Website crawler (Webcrawler)

📍 What is Webcrawler?

The Page Content Webcrawler is an advanced tool designed to traverse the web by following hyperlinks from page to page, collecting and extracting content. It begins at a given URL and explores linked pages, allowing you to gather large volumes of data across multiple domains. This crawling capability enables comprehensive data extraction, making it suitable for a variety of applications such as data analysis and AI model training.

Input params

url 🌐 (required): The initial URL from which the crawler begins its journey across the web.
blacklist_regexp 🚫 (optional): A regular expression to exclude certain URLs from being crawled.
whitelist_regexp ✅ (optional): A regular expression to include only specific URLs for crawling.
allow_subdomains 🔗 (optional): Set to true if the crawler should include subdomains in its search.

🧹 Cleaned Content

For each crawled page the Webcrawler delivers content that is free of HTML, providing pure, unstructured text. This feature is particularly useful for applications requiring clean data, such as training AI language models, where pure textual data is needed for effective learning and analysis.

In other words it convert webpage like this:


<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

to the output like this:

Example Domain 
Example Domain 
This domain is for use in illustrative examples in documents. You may use this
   domain in literature without prior coordination or asking for permission.
   More information...

How can I use Webcrawler?

The Page Content Webcrawler is a versatile tool that can be utilized in various scenarios:

🧠 Train LLM Models: Gather large-scale textual data to train and refine large language models (LLMs). The clean and unstructured content extracted by the Webcrawler is ideal for improving model accuracy and depth.
📰 Content Aggregation: Automatically collect and aggregate articles, blogs, and news content from multiple websites to keep up with the latest trends and updates in specific industries.
📊 Market Research: Conduct comprehensive market research by crawling competitor websites and extracting valuable insights on pricing, product offerings, and customer feedback, enabling data-driven strategic decisions.

Request example

{
  "url": "https://example.com/",
  "whitelist_regexp": "",
  "blacklist_regexp": "",
  "allow_subdomains": false,
  "clean": true
}

Response example

The main is that for each page there will be two links:

raw_content_url - link to the raw content of the page
cleaned_content_url - link to the cleaned content of the page

{
  "job_id": "23b81e21-c672-4402-a886-303f18de9555",
  "url": "https://stripe.com/",
  "scrape_type": "cleaned",
  "extract_rules": "",
  "whitelist_regexp": "",
  "blacklist_regexp": "",
  "allow_subdomains": false,
  "items_limit": 10,
  "created_at": "2024-06-17T12:22:08.034Z",
  "crawl_delay_ms": 0,
  "finished_at": "2024-06-17T12:23:01.53Z",
  "webhook_url": "https://yourserver.com/webhook",
  "webhook_status": 0,
  "webhook_error": "",
  "status": "done",
  "job_items": [
    {
      "id": "3542eeb1-dd99-4e92-88d4-774a1424737d",
      "job_id": "23b81e21-c672-4402-a886-303f18de9555",
      "original_url": "https://stripe.com/docs/no-code/tap-to-pay",
      "page_status_code": 200,
      "raw_content_url": "https://data.webcrawlerapi.com/raw/clwgv3ywz000hsy99lwbk7q18/23b81e21-c672-4402-a886-303f18de9555/https___stripe_com_docs_no_code_tap_to_pay",
      "cleaned_content_url": "https://data.webcrawlerapi.com/raw/clwgv3ywz000hsy99lwbk7q18/23b81e21-c672-4402-a886-303f18de9555/https___stripe_com_docs_no_code_tap_to_pay",
      "status": "done",
      "title": "Tap to Pay on the Dashboard mobile app | Stripe Documentation",
      "created_at": "2024-06-17T12:22:19.511Z",
      "updated_at": "2024-06-17T12:22:33.334Z",
      "retries": 0,
      "cost": 0.002
    }
  ]
}

Legal

This web scraping tool is designed to collect only publicly available data from websites, ensuring that no private information such as emails, phone numbers, or other sensitive details is accessed or gathered. It is important to note that the responsibility for how the collected data is used lies entirely with the user of the tool. Users must ensure that their actions comply with relevant laws, regulations. The owner of the Scraper does not control or influence the use of the data and is not liable for any misuse by the user.