API
POST /crawl
Basic API endpoint to start crawling a website
Basic API endpoint to start crawling a website.
https://api.webcrawlerapi.com/v1/crawlFormat: JSON Method: POST
Request
Available request params
url- (required) the seed URL where the crawler starts. Can be any valid URL.scrape_type- (default:markdown) the type of scraping you want to perform. Can behtml,cleaned,markdown.items_limit- (required) crawler will stops when it reaches this limit of pages for this job.webhook_url- (optional) the URL where the server will send a POST request once the task is completed (read more about webhooks and async requests).allow_subdomains- (default:false) iftruethe crawler will also crawl subdomains (for example,blog.example.comif the seed URL isexample.com).whitelist_regexp- (optional) a regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.blacklist_regexp- (optional) a regular expression to blacklist URLs. URLs that match the pattern will be skipped.respect_robots_txt- (optional) if set totrue, the crawler will respect the website's robots.txt file and skip pages that are disallowed by it. Default isfalse.
Example:
{
"url": "https://stripe.com/",
"webhook_url": "https://yourserver.com/webhook",
"items_limit": 10,
"scrape_type": "cleaned",
"allow_subdomains": false,
"respect_robots_txt": true
}Response
Example:
{
"id": "23b81e21-c672-4402-a886-303f18de9555"
}Crawling request is done in asynchronous way. It means that you will receive a response with a task id. You can use this task id to check the status of the scraping task (Read more about Async Requests)