Crawling output format types
There are several types of scraping you can perform with WebcrawlerAPI. You can control it by setting the scrape_type
parameter in the request.
Currently supported types are:
markdown
- returns the content of the page in markdown format.cleaned
- returns the cleaned content of the page.html
- returns the raw HTML of the page.
Markdown format
Markdown type is pure content but with some extra markdown formatting, like headings, links, lists, etc.
Markdown formatting is more useful for LLMs and AI to pass as the reference data, as it is some extra sign to understand the words "weight". For example, headers give an understanding of what the text is about. This could help to achieve a better result by understanding context data better.
Example of markdown formatted text:
# Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
[More information...](https://www.iana.org/domains/example)
(Check out also UrlToMarkdown API (opens in a new tab))
Cleaned scraping
Cleaned scraping is a type of scraping that removes unnecessary elements from the page. It returns the cleaned content of the page. BeautifulSoup4 (opens in a new tab) used to clean the data.
To use it, set the scrape_type
option to cleaned
.
Example of cleaned content:
Example Domain
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
More information...
HTML scraping
HTML scraping is the most basic type of scraping. It returns the raw HTML of the page. No manipulation is done on the content.
This is the default scrape option. To use it, omit the scrape_type
option in the request or set it to html
.
Example of the content:
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>