Job - is a task that you can run on the Webcrawler API. It has an asynchronous nature. It means you will get a notification when it is done (read more about async request).
Job request parameters
url
- (required) the seed URL where the crawler starts. Can be any valid URL.scrape_type
- (default:html
) the type of scraping you want to perform. Can behtml
,cleaned
.items_limit
- (default:20
) crawler will stops when it reaches this limit of pages for this job.webhook_url
- (optional) the URL where the server will send a POST request once the task is completed (read more about webhooks and async requests).crawl_delay_ms
- (default:2000
) delay between requests in milliseconds. To respect the website and avoid being blocked we recommend to leave it default.max_retries
- (default:2
) the number of retries if page request fails.whitelist_regexp
- (optional) a regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.blacklist_regexp
- (optional) a regular expression to blacklist URLs. URLs that match the pattern will be skipped.scrape_type
- (default:html
) the type of scraping you want to perform. Can behtml
,cleaned
.allow_subdomains
- (default:false
) iftrue
the crawler will also crawl subdomains (for example,blog.example.com
if the seed URL isexample.com
).
Example:
{
"url": "https://stripe.com/",
"webhook_url": "https://yourserver.com/webhook",
"items_limit": 10,
"crawl_delay_ms": 2000,
"max_retries": 1,
"scrape_type": "cleaned",
"allow_subdomains": false
}
Job response
-
id
- the unique identifier of the job. -
url
- the seed URL where the crawler started. -
status
- the status of the job. Can benew
,in_progress
,done
,error
. -
scrape_type
- the type of scraping you want to perform. -
extract_rules
- an object with rules to extract data from the page. -
whitelist_regexp
- a regular expression to whitelist URLs. -
blacklist_regexp
- a regular expression to blacklist URLs. -
allow_subdomains
- if the crawler will also crawl subdomains. -
items_limit
- the limit of pages for this job. -
crawl_delay_ms
- delay between requests in milliseconds. -
max_retries
- the number of retries if page request fails. -
created_at
- the date when the job was created. -
finished_at
- the date when the job was finished. -
webhook_url
- the URL where the server will send a POST request once the task is completed. -
webhook_status
- the status of the webhook request. -
webhook_error
- the error message if the webhook request failed. -
job_items
- an array of items that were extracted from the pages.Job Item:
id
- the unique identifier of the item.status
- the status of the item. Can benew
,in_progress
,done
,error
.job_id
- the job identifier.original_url
- the URL of the page.page_status_code
- the status code of the page request.raw_content_url
- the URL to the raw content of the page.cleaned_content_url
- the URL to the cleaned content of the page (ifscrape_type
iscleaned
).title
- the title of the page.created_at
- the date when the item was created.cost
- the cost of the item in $.
Example:
{
"job_id": "23b81e21-c672-4402-a886-303f18de9555",
"url": "https://stripe.com/",
"scrape_type": "clened",
"extract_rules": "",
"whitelist_regexp": "",
"blacklist_regexp": "",
"allow_subdomains": false,
"items_limit": 10,
"created_at": "2024-06-17T12:22:08.034Z",
"crawl_delay_ms": 0,
"finished_at": "2024-06-17T12:23:01.53Z",
"webhook_url": "https://yourserver.com/webhook",
"webhook_status": 0,
"webhook_error": "",
"status": "done",
"job_items": [
{
"id": "3542eeb1-dd99-4e92-88d4-774a1424737d",
"job_id": "23b81e21-c672-4402-a886-303f18de9555",
"original_url": "https://stripe.com/docs/no-code/tap-to-pay",
"page_status_code": 200,
"raw_content_url": "https://data.webcrawlerapi.com/raw/clwgv3ywz000hsy99lwbk7q18/23b81e21-c672-4402-a886-303f18de9555/https___stripe_com_docs_no_code_tap_to_pay",
"cleaned_content_url": "https://data.webcrawlerapi.com/raw/clwgv3ywz000hsy99lwbk7q18/23b81e21-c672-4402-a886-303f18de9555/https___stripe_com_docs_no_code_tap_to_pay",
"status": "done",
"title": "Tap to Pay on the Dashboard mobile app | Stripe Documentation",
"created_at": "2024-06-17T12:22:19.511Z",
"updated_at": "2024-06-17T12:22:33.334Z",
"retries": 0,
"cost": 0.002
}
]
}