JavaScript and TypeScript (Node.js) WebCrawler API SDK
Installation
npm i webcrawlerapi-js
Usage
Synchronous Crawling
The synchronous method waits for the crawl to complete and returns all data at once.
import webcrawlerapi from "webcrawlerapi-js";
const client = new webcrawlerapi.WebcrawlerClient("YOUR_API_KEY");
// Synchronous crawling
const result = await client.crawl({
"url": "https://stripe.com/",
"scrape_type": "markdown",
"items_limit": 10
});
for (const item of syncJob.job_items) {
item.getContent().then((content) => {
console.log(content.slice(0, 100));
})
}
console.log(result);
Asynchronous Crawling
The asynchronous method returns a job ID immediately and allows you to check the status later.
import webcrawlerapi from "webcrawlerapi-js";
const client = new webcrawlerapi.WebcrawlerClient("YOUR_API_KEY");
// Start the async crawl job
const job = await client.crawlAsync({
"url": "https://stripe.com/",
"scrape_type": "markdown",
"items_limit": 10
});
// Get the job ID
const jobId = job.id;
// Check job status
let jobStatus = await client.getJob(jobId);
console.log(jobStatus);
// You can poll the job status until it's complete
while (jobStatus.status === 'in_progress') {
await new Promise(resolve => setTimeout(resolve, jobStatus.recommended_pull_delay_ms));
jobStatus = await client.getJob(jobId);
}
console.log('Final result:', jobStatus);
Options
Both methods support the following options:
url
: The target URL to crawlscrape_type
: Type of content to extract ('markdown', 'html', etc.)items_limit
: Maximum number of pages to crawlallow_subdomains
: Whether to crawl subdomains (default: false)whitelist_regexp
: Regular expression for allowed URLsblacklist_regexp
: Regular expression for blocked URLswebhook_url
: URL to receive notifications when the job completes
GetContent
The job item contains a link to its content. For convenience, there is a getContent()
method that allows you to easily access this content. Here's an example:
const result = await client.crawl({
"url": "https://stripe.com/",
"scrape_type": "markdown",
"items_limit": 10
});
for (const item of syncJob.job_items) {
item.getContent().then((content) => {
console.log(content.slice(0, 100));
})
}
This method retrieves the full content associated with the job, which can be useful for processing or displaying the job's data.