What is Crawling Job?

Job - is a task that you can run on the Webcrawler API. It has an asynchronous nature. It means you will get a notification when it is done (read more about async request).

Job request parameters

url - (required) the seed URL where the crawler starts. Can be any valid URL.
scrape_type - (default: html) the type of scraping you want to perform. Can be html, cleaned. markdown.
items_limit - (default: 10) crawler will stops when it reaches this limit of pages for this job.
webhook_url - (optional) the URL where the server will send a POST request once the task is completed (read more about webhooks and async requests).
whitelist_regexp - (optional) a regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.
blacklist_regexp - (optional) a regular expression to blacklist URLs. URLs that match the pattern will be skipped.
allow_subdomains - (default: false) if true the crawler will also crawl subdomains (for example, blog.example.com if the seed URL is example.com).
max_depth - (optional) maximum depth of crawling from the starting URL. A value of 0 means only the starting page, 1 means the starting page plus pages directly linked from it, 2 adds one more level of depth, and so on. By default, there is no depth limit.

Example:

{
    "url": "https://stripe.com/",
    "webhook_url": "https://yourserver.com/webhook",
    "items_limit": 10,
    "scrape_type": "markdown",
    "allow_subdomains": false,
    "max_depth": 2
}

curl --request POST \
  --url https://api.webcrawlerapi.com/v1/crawl \
  --header 'Authorization: Bearer <PASTE YOUR API KEY HERE>' \
  --data '{
    "url": "https://stripe.com/",
    "webhook_url": "https://yourserver.com/webhook",
    "items_limit": 10,
    "scrape_type": "markdown",
    "allow_subdomains": false,
    "max_depth": 2
}'

Job response

id - the unique identifier of the job.
org_id - your organization identifier.
url - the seed URL where the crawler started.
status - the status of the job. Can be new, in_progress, done, error.
scrape_type - the type of scraping you want to perform (html, cleaned or markdown).
whitelist_regexp - a regular expression to whitelist URLs.
blacklist_regexp - a regular expression to blacklist URLs.
allow_subdomains - if the crawler will also crawl subdomains.
items_limit - the limit of pages for this job.
max_depth - maximum depth of crawling from the starting URL (if specified in the request).
created_at - the date when the job was created.
finished_at - the date when the job was finished.
webhook_url - the URL where the server will send a POST request once the task is completed.
webhook_status - the status of the webhook request.
webhook_error - the error message if the webhook request failed.
job_items - an array of items that were extracted from the pages.

Job Item:
- id - the unique identifier of the item.
- status - the status of the item. Can be new, in_progress, done, error.
- job_id - the job identifier.
- original_url - the URL of the page.
- page_status_code - the status code of the page request.
- raw_content_url - the URL to the raw content of the page.
- cleaned_content_url - the URL to the cleaned content of the page (if scrape_type is cleaned. Check Crawling Types).
- markdown_content_url - the URL to the markdown content of the page (if scrape_type is markdown. Check Crawling Types).
- title - the title of the page (<title> tag content).
- created_at - the date when the item was created.
- cost - the cost of the item in $.
- referred_url - the URL where the page was referred from.
- last_error - the last error message if the item failed.

Example:

{
	"id": "abb39f29-087e-4714-aa05-15537be12f90",
	"org_id": "cm48ww9kw00019rv7bsyfko1d",
	"url": "https://books.toscrape.com/",
	"status": "done",
	"scrape_type": "markdown",
	"whitelist_regexp": ".*category.*",
	"blacklist_regexp": "",
	"allow_subdomains": false,
	"items_limit": 10,
	"max_depth": 2,
	"created_at": "2024-12-15T10:26:13.893Z",
	"finished_at": "2024-12-15T10:26:37.118Z",
	"updated_at": "2024-12-15T10:26:37.118Z",
	"webhook_url": "",
	"job_items": [
		{
			"id": "a46f3117-f97a-4ca2-a434-6cfdcd022b72",
			"job_id": "abb39f29-087e-4714-aa05-15537be12f90",
			"original_url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html",
			"page_status_code": 200,
			"markdown_content_url": "https://data.webcrawlerapi.com/markdown/books.toscrape.com/https___books_toscrape_com_catalogue_category_books_travel_2_index_html",
			"status": "done",
			"title": "All products | Books to Scrape - Sandbox",
			"last_error": "",
			"created_at": "2024-12-15T10:26:17.941Z",
			"updated_at": "2024-12-15T10:26:23.915Z",
			"cost": 2000,
			"referred_url": "https://books.toscrape.com/"
		}
    ]
}

What is Crawling Job?

Job request parameters

Job response

On this page