Skip to content

Scraping request

Scraping allows you to extract data from websites.

Request

You can configure the scraping request by providing the following parameters:

  • url - the URL of the website you want to scrape.
  • scrape_type - the type of scraping you want to perform. Can be html or cleaned.
  • webhook_url - (optional) a URL where the server will send a POST request once the task is completed.

Example:

{
"url": "https://stripe.com/",
"scrape_type": "cleaned",
"webhook_url": "https://yourserver.com/webhook"
}

Response

Read how to get scrape response here

The response will contain the extracted data from the page.

  • id - the unique identifier of the request.
  • scrape_type - the type of scraping you want to perform.
  • extracted_content - the extracted content from the page (raw HTML, cleaned or JSON string depending on the scrape_type).
  • created_at - the date when the request was created.
  • page_status_code - the status code of the page request.

Example:

{
"job_id": "bd98c98a-99a5-43ea-b650-a8c7662d4d28",
"type": "html",
"extracted_content": "<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n\n <meta charset=\"utf-8\" />\n <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />\n <style type=\"text/css\">\n body {\n background-color: #f0f0f2;\n margin: 0;\n padding: 0;\n font-family: -apple-system, system-ui, BlinkMacSystemFont, \"Segoe UI\", \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n \n }\n div {\n width: 600px;\n margin: 5em auto;\n padding: 2em;\n background-color: #fdfdff;\n border-radius: 0.5em;\n box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n }\n a:link, a:visited {\n color: #38488f;\n text-decoration: none;\n }\n @media (max-width: 700px) {\n div {\n margin: 0 auto;\n width: auto;\n }\n }\n </style> \n</head>\n\n<body>\n<div>\n <h1>Example Domain</h1>\n <p>This domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.</p>\n <p><a href=\"https://www.iana.org/domains/example\">More information...</a></p>\n</div>\n</body>\n</html>\n\n",
"page_status_code": 200,
"created_at": "2024-06-17 07:02:51"
}