What is Crawling Job?
Job - is a task that you can run on the Webcrawler API. It has an asynchronous nature. It means you will get a notification when it is done (read more about async request).
Job request parameters
url- (required) the seed URL where the crawler starts. Can be any valid URL.scrape_type- (default:html) the type of scraping you want to perform. Can behtml,cleaned.markdown.items_limit- (default:10) crawler will stops when it reaches this limit of pages for this job.webhook_url- (optional) the URL where the server will send a POST request once the task is completed (read more about webhooks and async requests).whitelist_regexp- (optional) a regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.blacklist_regexp- (optional) a regular expression to blacklist URLs. URLs that match the pattern will be skipped.allow_subdomains- (default:false) iftruethe crawler will also crawl subdomains (for example,blog.example.comif the seed URL isexample.com).max_depth- (optional) maximum depth of crawling from the starting URL. A value of0means only the starting page,1means the starting page plus pages directly linked from it,2adds one more level of depth, and so on. By default, there is no depth limit.
Example:
{
"url": "https://stripe.com/",
"webhook_url": "https://yourserver.com/webhook",
"items_limit": 10,
"scrape_type": "markdown",
"allow_subdomains": false,
"max_depth": 2
}curl --request POST \
--url https://api.webcrawlerapi.com/v1/crawl \
--header 'Authorization: Bearer <PASTE YOUR API KEY HERE>' \
--data '{
"url": "https://stripe.com/",
"webhook_url": "https://yourserver.com/webhook",
"items_limit": 10,
"scrape_type": "markdown",
"allow_subdomains": false,
"max_depth": 2
}'Job response
-
id- the unique identifier of the job. -
org_id- your organization identifier. -
url- the seed URL where the crawler started. -
status- the status of the job. Can benew,in_progress,done,error. -
scrape_type- the type of scraping you want to perform (html,cleanedormarkdown). -
whitelist_regexp- a regular expression to whitelist URLs. -
blacklist_regexp- a regular expression to blacklist URLs. -
allow_subdomains- if the crawler will also crawl subdomains. -
items_limit- the limit of pages for this job. -
max_depth- maximum depth of crawling from the starting URL (if specified in the request). -
created_at- the date when the job was created. -
finished_at- the date when the job was finished. -
webhook_url- the URL where the server will send a POST request once the task is completed. -
webhook_status- the status of the webhook request. -
webhook_error- the error message if the webhook request failed. -
job_items- an array of items that were extracted from the pages.Job Item:
id- the unique identifier of the item.status- the status of the item. Can benew,in_progress,done,error.job_id- the job identifier.original_url- the URL of the page.page_status_code- the status code of the page request.raw_content_url- the URL to the raw content of the page.cleaned_content_url- the URL to the cleaned content of the page (ifscrape_typeiscleaned. Check Crawling Types).markdown_content_url- the URL to the markdown content of the page (ifscrape_typeismarkdown. Check Crawling Types).title- the title of the page (<title>tag content).created_at- the date when the item was created.cost- the cost of the item in $.referred_url- the URL where the page was referred from.last_error- the last error message if the item failed.
Example:
{
"id": "abb39f29-087e-4714-aa05-15537be12f90",
"org_id": "cm48ww9kw00019rv7bsyfko1d",
"url": "https://books.toscrape.com/",
"status": "done",
"scrape_type": "markdown",
"whitelist_regexp": ".*category.*",
"blacklist_regexp": "",
"allow_subdomains": false,
"items_limit": 10,
"max_depth": 2,
"created_at": "2024-12-15T10:26:13.893Z",
"finished_at": "2024-12-15T10:26:37.118Z",
"updated_at": "2024-12-15T10:26:37.118Z",
"webhook_url": "",
"job_items": [
{
"id": "a46f3117-f97a-4ca2-a434-6cfdcd022b72",
"job_id": "abb39f29-087e-4714-aa05-15537be12f90",
"original_url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html",
"page_status_code": 200,
"markdown_content_url": "https://data.webcrawlerapi.com/markdown/books.toscrape.com/https___books_toscrape_com_catalogue_category_books_travel_2_index_html",
"status": "done",
"title": "All products | Books to Scrape - Sandbox",
"last_error": "",
"created_at": "2024-12-15T10:26:17.941Z",
"updated_at": "2024-12-15T10:26:23.915Z",
"cost": 2000,
"referred_url": "https://books.toscrape.com/"
}
]
}