Webcrawler API LogoWebCrawlerAPI
API

POST /crawl

Basic API endpoint to start crawling a website

Basic API endpoint to start crawling a website.

https://api.webcrawlerapi.com/v1/crawl

Format: JSON Method: POST

Request

Available request params

  • url - (required) the seed URL where the crawler starts. Can be any valid URL.
  • scrape_type - (default: markdown) the type of scraping you want to perform. Can be html, cleaned, markdown.
  • items_limit - (required) crawler will stops when it reaches this limit of pages for this job.
  • webhook_url - (optional) the URL where the server will send a POST request once the task is completed (read more about webhooks and async requests).
  • main_content_only - (optional) Extract only the main content of article or blog post. When set to true, the scraper will focus on extracting the primary article content while filtering out navigation, sidebars, ads, and other non-essential elements. Default is false.
  • allow_subdomains - (default: false) if true the crawler will also crawl subdomains (for example, blog.example.com if the seed URL is example.com).
  • whitelist_regexp - (optional) a regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.
  • blacklist_regexp - (optional) a regular expression to blacklist URLs. URLs that match the pattern will be skipped.
  • respect_robots_txt - (optional) if set to true, the crawler will respect the website's robots.txt file and skip pages that are disallowed by it. Default is false.
  • max_depth - (optional) maximum depth of crawling from the starting URL. A value of 0 means only the starting page, 1 means the starting page plus pages directly linked from it, 2 adds one more level of depth, and so on. By default, there is no depth limit.

Example:

{
    "url": "https://stripe.com/",
    "webhook_url": "https://yourserver.com/webhook",
    "items_limit": 10,
    "scrape_type": "cleaned",
    "main_content_only": true,
    "allow_subdomains": false,
    "respect_robots_txt": true,
    "max_depth": 2
}

Response

Example:

{
    "id": "23b81e21-c672-4402-a886-303f18de9555"
}

Crawling request is done in asynchronous way. It means that you will receive a response with a task id. You can use this task id to check the status of the scraping task (Read more about Async Requests)

On this page