-
Notifications
You must be signed in to change notification settings - Fork 5.4k
[components] Scrapeless - fix actions #17377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[components] Scrapeless - fix actions #17377
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
@joy-chanboop is attempting to deploy a commit to the Pipedreamers Team on Vercel. A member of the Team first needs to authorize it. |
WalkthroughThe changes update the version numbers of three action modules and modify their Changes
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
components/scrapeless/actions/scraping-api/scraping-api.mjsOops! Something went wrong! :( ESLint: 8.57.1 Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'jsonc-eslint-parser' imported from /eslint.config.mjs components/scrapeless/actions/crawler/crawler.mjsOops! Something went wrong! :( ESLint: 8.57.1 Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'jsonc-eslint-parser' imported from /eslint.config.mjs components/scrapeless/actions/universal-scraping-api/universal-scraping-api.mjsOops! Something went wrong! :( ESLint: 8.57.1 Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'jsonc-eslint-parser' imported from /eslint.config.mjs 📜 Recent review detailsConfiguration used: CodeRabbit UI ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (3)
⏰ Context from checks skipped due to timeout of 90000ms (4)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Thank you so much for submitting this! We've added it to our backlog to review, and our team has been notified. |
Thanks for submitting this PR! When we review PRs, we follow the Pipedream component guidelines. If you're not familiar, here's a quick checklist:
|
3ca44b5
to
3544fe8
Compare
Hi @jcortes, I noticed that in the previously deployed Scrapeless service, the By explicitly adding an await, the parameters are now correctly resolved, and inputProps contains the form data as expected. Could you please help review this fix? Let me know if you have insights on why the previous combination caused inputProps to be empty. Thanks a lot for your time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@joy-chanboop Can you please try with this version if that works on your side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @joy-chanboop I've just tried with this modification and it worked just fine altought I ran out of credits with the account that Leo shared [Scrapeless]: insufficient balance, please recharge first.
import scrapeless from "../../scrapeless.app.mjs";
export default {
key: "scrapeless-crawler",
name: "Crawler",
description: "Crawl any website at scale and say goodbye to blocks. [See the documentation](https://apidocs.scrapeless.com/api-17509010).",
version: "0.0.9",
type: "action",
props: {
scrapeless,
apiServer: {
type: "string",
label: "Please select a API server",
description: "Please select a API server to use",
default: "crawl",
options: [
{
label: "Crawl",
value: "crawl",
},
{
label: "Scrape",
value: "scrape",
},
],
reloadProps: true,
},
},
additionalProps() {
const { apiServer } = this;
const props = {};
if (apiServer === "crawl" || apiServer === "scrape") {
props.url = {
type: "string",
label: "URL to Crawl",
description: "If you want to crawl in batches, please refer to the SDK of the document",
};
}
if (apiServer === "crawl") {
props.limitCrawlPages = {
type: "integer",
label: "Number Of Subpages",
default: 5,
description: "Max number of results to return",
};
}
return props;
},
async run({ $ }) {
const {
scrapeless,
apiServer,
url,
limitCrawlPages,
} = this;
console.log("url", url);
console.log("limitCrawlPages", limitCrawlPages);
console.log("apiServer", apiServer);
const browserOptions = {
"proxy_country": "ANY",
"session_name": "Crawl",
"session_recording": true,
"session_ttl": 900,
};
let response;
if (apiServer === "crawl") {
response =
await scrapeless._scrapelessClient().scrapingCrawl.crawl.crawlUrl(url, {
limit: limitCrawlPages,
browserOptions,
});
}
if (apiServer === "scrape") {
response =
await scrapeless._scrapelessClient().scrapingCrawl.scrape.scrapeUrl(url, {
browserOptions,
});
}
if (response?.status === "completed" && response?.data) {
$.export("$summary", `Successfully retrieved crawling results for ${url}`);
return response;
} else {
throw new Error(response?.error || "Failed to retrieve crawling results");
}
},
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @joy-chanboop I've just tried with this modification and it worked just fine altought I ran out of credits with the account that Leo shared
[Scrapeless]: insufficient balance, please recharge first.
import scrapeless from "../../scrapeless.app.mjs"; export default { key: "scrapeless-crawler", name: "Crawler", description: "Crawl any website at scale and say goodbye to blocks. [See the documentation](https://apidocs.scrapeless.com/api-17509010).", version: "0.0.9", type: "action", props: { scrapeless, apiServer: { type: "string", label: "Please select a API server", description: "Please select a API server to use", default: "crawl", options: [ { label: "Crawl", value: "crawl", }, { label: "Scrape", value: "scrape", }, ], reloadProps: true, }, }, additionalProps() { const { apiServer } = this; const props = {}; if (apiServer === "crawl" || apiServer === "scrape") { props.url = { type: "string", label: "URL to Crawl", description: "If you want to crawl in batches, please refer to the SDK of the document", }; } if (apiServer === "crawl") { props.limitCrawlPages = { type: "integer", label: "Number Of Subpages", default: 5, description: "Max number of results to return", }; } return props; }, async run({ $ }) { const { scrapeless, apiServer, url, limitCrawlPages, } = this; console.log("url", url); console.log("limitCrawlPages", limitCrawlPages); console.log("apiServer", apiServer); const browserOptions = { "proxy_country": "ANY", "session_name": "Crawl", "session_recording": true, "session_ttl": 900, }; let response; if (apiServer === "crawl") { response = await scrapeless._scrapelessClient().scrapingCrawl.crawl.crawlUrl(url, { limit: limitCrawlPages, browserOptions, }); } if (apiServer === "scrape") { response = await scrapeless._scrapelessClient().scrapingCrawl.scrape.scrapeUrl(url, { browserOptions, }); } if (response?.status === "completed" && response?.data) { $.export("$summary", `Successfully retrieved crawling results for ${url}`); return response; } else { throw new Error(response?.error || "Failed to retrieve crawling results"); } }, };
Hi @jcortes ,
First, regarding the error you encountered — it’s actually due to your Scrapeless API KEY having no remaining balance. Could you please provide your email address? We’ll send you a dedicated test API KEY so you can continue testing without interruptions.
Also, I’m not sure if this was influenced by running in Pipedream’s production environment, but I found that previously, when executing the scraping-api action, inputProps didn’t contain the form field values returned by the additionalProps function. After reviewing the code, I realized that by adding an explicit await, I was able to correctly retrieve the props values.
Let me know if you have any thoughts on this, or if there’s more you’d like me to check.
Thanks a lot for your time!
You can use the following online environment code for testing.
import scrapeless from "../../scrapeless.app.mjs";
import { log } from "../../common/utils.mjs";
export default {
key: "scrapeless-scraping-api",
name: "Scraping API",
description: "Endpoints for fresh, structured data from 100+ popular sites. [See the documentation](https://apidocs.scrapeless.com/api-12919045).",
version: "0.0.1",
type: "action",
props: {
scrapeless,
apiServer: {
type: "string",
label: "Please select a API server",
default: "googleSearch",
description: "Please select a API server to use",
options: [
{
label: "Google Search",
value: "googleSearch",
},
],
reloadProps: true,
},
},
async run({ $ }) {
const {
scrapeless, apiServer, ...inputProps
} = this;
const MAX_RETRIES = 3;
// 10 seconds
const DELAY = 1000 * 10;
const { run } = $.context;
let submitData;
let job;
// pre check if the job is already in the context
if (run?.context?.job) {
job = run.context.job;
}
if (apiServer === "googleSearch") {
submitData = {
actor: "scraper.google.search",
input: {
q: inputProps.q,
hl: inputProps.hl,
gl: inputProps.gl,
},
};
}
if (!submitData) {
throw new Error("No actor found");
}
// 1. Create a new scraping job
if (!job) {
job = await scrapeless._scrapelessClient().deepserp.createTask({
actor: submitData.actor,
input: submitData.input,
});
if (job.status === 200) {
$.export("$summary", "Successfully retrieved scraping results");
return job.data;
}
log("task in progress");
}
// 2. Wait for the job to complete
if (run.runs === 1) {
$.flow.rerun(DELAY, {
job,
}, MAX_RETRIES);
} else if (run.runs > MAX_RETRIES ) {
throw new Error("Max retries reached");
} else if (job && job?.data?.taskId) {
const result = await scrapeless._scrapelessClient().deepserp.getTaskResult(job.data.taskId);
if (result.status === 200) {
$.export("$summary", "Successfully retrieved scraping results");
return result.data;
} else {
$.flow.rerun(DELAY, {
job,
}, MAX_RETRIES);
}
} else {
throw new Error("No job found");
}
},
additionalProps() {
const { apiServer } = this;
const props = {};
if (apiServer === "googleSearch") {
props.q = {
type: "string",
label: "Search Query",
description: "Parameter defines the query you want to search. You can use anything that you would use in a regular Google search. e.g. inurl:, site:, intitle:. We also support advanced search query parameters such as as_dt and as_eq.",
default: "coffee",
};
props.hl = {
type: "string",
label: "Language",
description: "Parameter defines the language to use for the Google search. It's a two-letter language code. (e.g., en for English, es for Spanish, or fr for French).",
default: "en",
};
props.gl = {
type: "string",
label: "Country",
description: "Parameter defines the country to use for the Google search. It's a two-letter country code. (e.g., us for the United States, uk for United Kingdom, or fr for France).",
default: "us",
};
}
return props;
},
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @joy-chanboop this is my email [email protected]. The way you are deferring the values with await is weird to me because the additionalProps method doesn't have the async await signature which is not needed in this case. However in my test I can see the logs with the values of the props whenever I run the action so I'm wondering if at least you are able to see them too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jcortes ,
I’ve just sent the testing API KEY to your email, please check your inbox. Let me know if you didn’t receive it.
Additionally, I found that only the scraping-api.mjs action requires adding await to properly retrieve the props values, which is quite odd — because other actions like crawler.mjs work fine without using await, and still correctly get the props. This inconsistency is also puzzling to me.
Could you help by running a quick test on the current scraping-api.mjs action in the master branch of the pipedream repo? In my local testing, without adding await, it consistently fails to retrieve the props, which led me to apply this fix.
Thanks a lot for taking a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @joy-chanboop I've left a comment here! https://github.com/PipedreamHQ/pipedream/pull/17377/files#r2178808794
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @joy-chanboop thanks for the api key. As I was playing around with the sdk I've ran into two different issues one is that if the sdk is running offline it means that tries to create a storage dir in the host which in this case is the Pipedream infraestructure in a default path that doesn't have permissions for it so that's why I had to set the env var:
process.env.SCRAPELESS_IS_ONLINE = "true";
On the other hand I noticed there were also some kind of warnings regarding the logging so I had to set the other env that I saw in the sdk code which is:
process.env.SCRAPELESS_LOG_ROOT_DIR = "/tmp";
And both of these envs I had to put them inside the this function _scrapelessClient and make it async in scrapeless.app.mjs
async _scrapelessClient() {
process.env.SCRAPELESS_IS_ONLINE = "true";
process.env.SCRAPELESS_LOG_ROOT_DIR = "/tmp";
const { Scrapeless } = await import("@scrapeless-ai/sdk");
const { api_key } = this.$auth;
if (!api_key) {
throw new ConfigurationError("API key is required");
}
return new Scrapeless({
apiKey: api_key,
baseUrl: this._baseUrl(),
});
},
So the import worked just fine. Now I also refactored a bit the component crawler.mjs
so you can test it on your side:
import scrapeless from "../../scrapeless.app.mjs";
export default {
key: "scrapeless-crawler",
name: "Crawler",
description: "Crawl any website at scale and say goodbye to blocks. [See the documentation](https://apidocs.scrapeless.com/api-17509010).",
version: "0.0.1",
type: "action",
props: {
scrapeless,
apiServer: {
type: "string",
label: "Please select a API server",
description: "Please select a API server to use",
default: "crawl",
options: [
{
label: "Crawl",
value: "crawl",
},
{
label: "Scrape",
value: "scrape",
},
],
reloadProps: true,
},
},
additionalProps() {
const props = {
url: {
type: "string",
label: "URL to Crawl",
description: "If you want to crawl in batches, please refer to the SDK of the document",
},
};
if (this.apiServer === "crawl") {
return {
...props,
limitCrawlPages: {
type: "integer",
label: "Number Of Subpages",
default: 5,
description: "Max number of results to return",
},
};
}
return props;
},
async run({ $ }) {
const {
scrapeless,
apiServer,
...inputProps
} = this;
const browserOptions = {
"proxy_country": "ANY",
"session_name": "Crawl",
"session_recording": true,
"session_ttl": 900,
};
let response;
const client = await scrapeless._scrapelessClient();
if (apiServer === "crawl") {
response =
await client.scrapingCrawl.crawl.crawlUrl(inputProps.url, {
limit: inputProps.limitCrawlPages,
browserOptions,
});
}
if (apiServer === "scrape") {
response =
await client.scrapingCrawl.scrape.scrapeUrl(inputProps.url, {
browserOptions,
});
}
if (response?.status === "completed" && response?.data) {
$.export("$summary", `Successfully retrieved crawling results for \`${inputProps.url}\``);
return response.data;
} else {
throw new Error(response?.error || "Failed to retrieve crawling results");
}
},
};
So let me know if that works!
WHY
The run method in
universal-scraping-api.mjs
was using async and destructured rest parameters, butinputProps
field was an empty object.As a result, accessing inputProps.url (and similar fields) returned undefined.
Added an explicit await to properly wait for the resolved inputProps, ensuring the values are available before proceeding.
Summary by CodeRabbit