GET /job/:id
Basic API endpoint to start crawling a website.
https://api.webcrawlerapi.com/v1/job/:id
Method: GET
Request
Available request params
id
- (required) the unique identifier of the job.
Example:
https://api.webcrawlerapi.com/v1/job/6c391693-e566-4b99-97ca-5fa00032e281
Response
Job contains:
-
id
- the unique identifier of the job. -
org_id
- your organization identifier. -
url
- the seed URL where the crawler started. -
status
- the status of the job. Can benew
,in_progress
,done
,error
. -
scrape_type
- the type of scraping you want to perform (html
,cleaned
ormarkdown
). -
whitelist_regexp
- a regular expression to whitelist URLs. -
blacklist_regexp
- a regular expression to blacklist URLs. -
allow_subdomains
- if the crawler will also crawl subdomains. -
items_limit
- the limit of pages for this job. -
created_at
- the date when the job was created. -
finished_at
- the date when the job was finished. -
webhook_url
- the URL where the server will send a POST request once the task is completed. -
webhook_status
- the status of the webhook request. -
webhook_error
- the error message if the webhook request failed. -
job_items
- an array of items that were extracted from the pages.Job Item:
id
- the unique identifier of the item.status
- the status of the item. Can benew
,in_progress
,done
,error
.job_id
- the job identifier.original_url
- the URL of the page.page_status_code
- the status code of the page request.raw_content_url
- the URL to the raw content of the page.cleaned_content_url
- the URL to the cleaned content of the page (ifscrape_type
iscleaned
. Check Crawling Types (opens in a new tab)).markdown_content_url
- the URL to the markdown content of the page (ifscrape_type
ismarkdown
. Check Crawling Types (opens in a new tab)).title
- the title of the page (<title>
tag content).created_at
- the date when the item was created.cost
- the cost of the item in $.referred_url
- the URL where the page was referred from.last_error
- the last error message if the item failed.
Example:
{
"id": "abb39f29-087e-4714-aa05-15537be12f90",
"org_id": "cm48ww9kw00019rv7bsyfko1d",
"url": "https://books.toscrape.com/",
"scrape_type": "markdown",
"whitelist_regexp": ".*category.*",
"blacklist_regexp": "",
"allow_subdomains": false,
"items_limit": 10,
"created_at": "2024-12-15T10:26:13.893Z",
"finished_at": "2024-12-15T10:26:37.118Z",
"updated_at": "2024-12-15T10:26:37.118Z",
"webhook_url": "",
"status": "done",
"job_items": [
{
"id": "a46f3117-f97a-4ca2-a434-6cfdcd022b72",
"job_id": "abb39f29-087e-4714-aa05-15537be12f90",
"original_url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html",
"page_status_code": 200,
"markdown_content_url": "https://data.webcrawlerapi.com/markdown/books.toscrape.com/https___books_toscrape_com_catalogue_category_books_travel_2_index_html",
"status": "done",
"title": "All products | Books to Scrape - Sandbox",
"last_error": "",
"created_at": "2024-12-15T10:26:17.941Z",
"updated_at": "2024-12-15T10:26:23.915Z",
"cost": 2000,
"referred_url": "https://books.toscrape.com/"
}
]
}
Refer to Job overview for more information about the response fields.
Crawling request is done in asynchronous way. It means that you will receive a response with a task id. You can use this task id to check the status of the scraping task (Read more about Async Requests)