docs
Crawling
API
Job

Job - is a task that you can run on the Webcrawler API. It has an asynchronous nature. It means you will get a notification when it is done (read more about async request).

Job request parameters

  • url - (required) the seed URL where the crawler starts. Can be any valid URL.
  • scrape_type - (default: html) the type of scraping you want to perform. Can be html, cleaned.
  • items_limit - (default: 20) crawler will stops when it reaches this limit of pages for this job.
  • webhook_url - (optional) the URL where the server will send a POST request once the task is completed (read more about webhooks and async requests).
  • crawl_delay_ms - (default: 2000) delay between requests in milliseconds. To respect the website and avoid being blocked we recommend to leave it default.
  • max_retries - (default: 2) the number of retries if page request fails.
  • whitelist_regexp - (optional) a regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.
  • blacklist_regexp - (optional) a regular expression to blacklist URLs. URLs that match the pattern will be skipped.
  • scrape_type - (default: html) the type of scraping you want to perform. Can be html, cleaned.
  • allow_subdomains - (default: false) if true the crawler will also crawl subdomains (for example, blog.example.com if the seed URL is example.com).

Example:

{
    "url": "https://stripe.com/",
    "webhook_url": "https://yourserver.com/webhook",
    "items_limit": 10,
    "crawl_delay_ms": 2000,
    "max_retries": 1,
    "scrape_type": "cleaned",
    "allow_subdomains": false
}

Job response

  • id - the unique identifier of the job.

  • url - the seed URL where the crawler started.

  • status - the status of the job. Can be new, in_progress, done, error.

  • scrape_type - the type of scraping you want to perform.

  • extract_rules - an object with rules to extract data from the page.

  • whitelist_regexp - a regular expression to whitelist URLs.

  • blacklist_regexp - a regular expression to blacklist URLs.

  • allow_subdomains - if the crawler will also crawl subdomains.

  • items_limit - the limit of pages for this job.

  • crawl_delay_ms - delay between requests in milliseconds.

  • max_retries - the number of retries if page request fails.

  • created_at - the date when the job was created.

  • finished_at - the date when the job was finished.

  • webhook_url - the URL where the server will send a POST request once the task is completed.

  • webhook_status - the status of the webhook request.

  • webhook_error - the error message if the webhook request failed.

  • job_items - an array of items that were extracted from the pages.

    Job Item:

    • id - the unique identifier of the item.
    • status - the status of the item. Can be new, in_progress, done, error.
    • job_id - the job identifier.
    • original_url - the URL of the page.
    • page_status_code - the status code of the page request.
    • raw_content_url - the URL to the raw content of the page.
    • cleaned_content_url - the URL to the cleaned content of the page (if scrape_type is cleaned).
    • title - the title of the page.
    • created_at - the date when the item was created.
    • cost - the cost of the item in $.

Example:

{
	"job_id": "23b81e21-c672-4402-a886-303f18de9555",
	"url": "https://stripe.com/",
	"scrape_type": "clened",
	"extract_rules": "",
	"whitelist_regexp": "",
	"blacklist_regexp": "",
	"allow_subdomains": false,
	"items_limit": 10,
	"created_at": "2024-06-17T12:22:08.034Z",
	"crawl_delay_ms": 0,
	"finished_at": "2024-06-17T12:23:01.53Z",
	"webhook_url": "https://yourserver.com/webhook",
	"webhook_status": 0,
	"webhook_error": "",
	"status": "done",
	"job_items": [
		{
			"id": "3542eeb1-dd99-4e92-88d4-774a1424737d",
			"job_id": "23b81e21-c672-4402-a886-303f18de9555",
			"original_url": "https://stripe.com/docs/no-code/tap-to-pay",
			"page_status_code": 200,
			"raw_content_url": "https://data.webcrawlerapi.com/raw/clwgv3ywz000hsy99lwbk7q18/23b81e21-c672-4402-a886-303f18de9555/https___stripe_com_docs_no_code_tap_to_pay",
			"cleaned_content_url": "https://data.webcrawlerapi.com/raw/clwgv3ywz000hsy99lwbk7q18/23b81e21-c672-4402-a886-303f18de9555/https___stripe_com_docs_no_code_tap_to_pay",
			"status": "done",
			"title": "Tap to Pay on the Dashboard mobile app | Stripe Documentation",
			"created_at": "2024-06-17T12:22:19.511Z",
			"updated_at": "2024-06-17T12:22:33.334Z",
			"retries": 0,
			"cost": 0.002
		}
    ]
}