docs
Crawling output format types

Crawling output format types

There are several types of scraping you can perform with WebcrawlerAPI. You can control it by setting the scrape_type parameter in the request.

Currently supported types are:

  • markdown - returns the content of the page in markdown format.
  • cleaned - returns the cleaned content of the page.
  • html - returns the raw HTML of the page.

Markdown format

Markdown type is pure content but with some extra markdown formatting, like headings, links, lists, etc.

Markdown formatting is more useful for LLMs and AI to pass as the reference data, as it is some extra sign to understand the words "weight". For example, headers give an understanding of what the text is about. This could help to achieve a better result by understanding context data better.

Example of markdown formatted text:

# Example Domain

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.


[More information...](https://www.iana.org/domains/example)

(Check out also UrlToMarkdown API (opens in a new tab))

Cleaned scraping

Cleaned scraping is a type of scraping that removes unnecessary elements from the page. It returns the cleaned content of the page. BeautifulSoup4 (opens in a new tab) used to clean the data.

To use it, set the scrape_type option to cleaned.

Example of cleaned content:

Example Domain
Example Domain
    This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...

HTML scraping

HTML scraping is the most basic type of scraping. It returns the raw HTML of the page. No manipulation is done on the content.

This is the default scrape option. To use it, omit the scrape_type option in the request or set it to html.

Example of the content:

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>