There are several types of scraping you can perform with WebcrawlerAPI. You can control it by setting the scrape_type
parameter in the request.
Currently supported types are:
html
- returns the raw HTML of the page.cleaned
- returns the cleaned content of the page.
HTML scraping
HTML scraping is the most basic type of scraping. It returns the raw HTML of the page. No manipulation is done on the content.
This is the default scrape option. To use it, omit the scrape_type
option in the request or set it to html
.
Example of the content:
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
Cleaned scraping
Cleaned scraping is a type of scraping that removes unnecessary elements from the page. It returns the cleaned content of the page. BeautifulSoup4 (opens in a new tab) used to clean the data.
To use it, set the scrape_type
option to cleaned
.
Example of cleaned content:
Example Domain\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n