Skip to content

Scraping type

There are several types of scraping you can perform with WebcrawlerAPI. You can control it by setting the scrape_type parameter in the request.

Currently supported types are:

  • html - returns the raw HTML of the page.
  • cleaned - returns the cleaned content of the page.

HTML scraping

HTML scraping is the most basic type of scraping. It returns the raw HTML of the page. No manipulation is done on the content.

This is the default scrape option. To use it, omit the scrape_type option in the request or set it to html.

Example of the content:

<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

Cleaned scraping

Cleaned scraping is a type of scraping that removes unnecessary elements from the page. It returns the cleaned content of the page. BeautifulSoup4 used to clean the data.

To use it, set the scrape_type option to cleaned.

Example of cleaned content:

Example Domain\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n