POST /crawl

Basic API endpoint to start crawling a website.

https://api.webcrawlerapi.com/v1/crawl

Format: JSON
Method: POST

Request

Available request params

url - (required) the seed URL where the crawler starts. Can be any valid URL.
scrape_type - (default: html) the type of scraping you want to perform. Can be html, cleaned, markdown.
items_limit - (default: 10) crawler will stops when it reaches this limit of pages for this job.
webhook_url - (optional) the URL where the server will send a POST request once the task is completed (read more about webhooks and async requests).
allow_subdomains - (default: false) if true the crawler will also crawl subdomains (for example, blog.example.com if the seed URL is example.com).
whitelist_regexp - (optional) a regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.
blacklist_regexp - (optional) a regular expression to blacklist URLs. URLs that match the pattern will be skipped.

Example:

{
    "url": "https://stripe.com/",
    "webhook_url": "https://yourserver.com/webhook",
    "items_limit": 10,
    "scrape_type": "cleaned",
    "allow_subdomains": false
}

Response

Example:

{
    "id": "23b81e21-c672-4402-a886-303f18de9555"
}

Crawling request is done in asynchronous way. It means that you will receive a response with a task id. You can use this task id to check the status of the scraping task (Read more about Async Requests)

🦜🔗 LangChain Job Get