How to crawl the website with Python

Python is the most popular language for developing AI models nowadays. It has many tools and frameworks for building and deploying AI models. However, any AI model requires the data to train. Website content has been used to train almost all existing models so far.

There are several options for how to crawl the content of the website using Python. All methods have their pros and cons. Let's take a look at more detail:

Build from scratch - building a crawler from scratch is time-consuming, and you must manage all your resources yourself. Moreover, you must consider all edge cases and deal with headless browsers, queues, etc. But this option gives you real flexibility to control your crawling process fully.
Use open-source framework - there are numerous ready-to-use crawling frameworks written in Python that can help you build custom crawling solutions. To do this, you have to be familiar with software development. However, this also requires time and effort to get a solution ready.
Use an API - if you don't want to spend time building your own crawler from scratch or with an existing framework - API is the best choice for you. It is simple: just send an HTTP request and get the content in minutes. However, you must pay for each crawled page (See pricing section for example).

In this article, you can find how to crawl content using the simplest option - API.

Crawl data from the website with an API in Python.

To start using it in Python, you first need to register in the Dashboard and get an Access Key under the "Access" section. You will get 10$ for free to test it out, so no payment or credit card is required.

Start crawling job in Python.

Assuming you have your access key, here is the code with the basic parameters to start crawling the website:

import time

import requests

# Constants
API_BASE_URL = "https://api.webcrawlerapi.com/v1"
BEARER_TOKEN = "000000000"  # Replace with your actual bearer token


def create_crawl_job():
    print("Creating a new job...")
    url = f"{API_BASE_URL}/crawl"
    headers = {
        "Authorization": f"Bearer {BEARER_TOKEN}",
        "Content-Type": "application/json"
    }
    payload = {
        "url": "http://example.com",  # URL to crawl
        "items_limit": 10,
        "clean_content": True,
        "whitelist_regexp": ".*example\.com.*",
        "blacklist_regexp": ".*login.*"
    }

    response = requests.post(url, json=payload, headers=headers)
    if response.status_code == 200:
        return response.json()
    else:
        raise Exception(f"Failed to create job: {response.status_code} - {response.text}")


    # Response:
    # {
    # 	"id": "36686bf1-d76e-4489-a739-58d52db13c4a"
    # }

In response, you will get your job_id, which you can use to track your job progress. You can also use UI to get detailed info.

Get crawled content.

You can use the next Python function in your code:

def get_job_details(job_id):
    while True:  # Loop indefinitely until the job status is 'done'
        url = f"{API_BASE_URL}/job/{job_id}"
        headers = {
            "Authorization": f"Bearer {BEARER_TOKEN}",
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            job_details = response.json()
            if job_details['status'] == 'in_progress':
                print(f"Job status is {job_details['status']}. Checking again in 3 seconds...")
                time.sleep(3)  # Wait for 3 seconds before the next request
            else:
                return job_details
        else:
            raise Exception(f"Failed to retrieve job: {response.status_code} - {response.text}")

It will make a request to get your crawling job and check its status. Once it is ready, it prints a link to cleaned content.

The best is that you don't need to clean received content. It is already cleaned with BeatifulSoup4 if you send the clean_content option true in your crawling request.

Make everything work together

Here is the full working example:

import time

import requests

# Constants
API_BASE_URL = "https://api.webcrawlerapi.com/v1"
BEARER_TOKEN = "000000000000000"  # Replace with your actual bearer token


def create_crawl_job():
    print("Creating a new job...")
    url = f"{API_BASE_URL}/crawl"
    headers = {
        "Authorization": f"Bearer {BEARER_TOKEN}",
        "Content-Type": "application/json"
    }
    payload = {
        "url": "http://example.com",  # URL to crawl
        "items_limit": 1,
        "clean_content": True,
        "whitelist_regexp": ".*example\.com.*",
        "blacklist_regexp": ".*login.*"
    }

    response = requests.post(url, json=payload, headers=headers)
    if response.status_code == 200:
        return response.json()
    else:
        raise Exception(f"Failed to create job: {response.status_code} - {response.text}")

    # Response:
    # {
    # 	"id": "36686bf1-d76e-4489-a739-58d52db13c4a"
    # }

def get_job_details(job_id):
    while True:  # Loop indefinitely until the job status is 'done'
        url = f"{API_BASE_URL}/job/{job_id}"
        headers = {
            "Authorization": f"Bearer {BEARER_TOKEN}",
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            job_details = response.json()
            if job_details['status'] == 'in_progress':
                print(f"Job status is {job_details['status']}. Checking again in 3 seconds...")
                time.sleep(3)  # Wait for 3 seconds before the next request
            else:
                return job_details
        else:
            raise Exception(f"Failed to retrieve job: {response.status_code} - {response.text}")


# help function to download content
def fetch_content(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Will raise an exception for HTTP errors
        return response.text
    except requests.RequestException as e:
        return f"Error fetching content from {url}: {str(e)}"


# Main code to create a job, retrieve its details, and print content of URLs
try:
    # Assume job_id is retrieved from create_crawl_job or predefined
    job_response = create_crawl_job()
    job_id = job_response['job_id']
    print(f"Job created successfully with ID: {job_id}")

    # Function to get job details
    job_details = get_job_details(job_id)
    print(f"Job Details: {job_details}")

    # Iterate over job items and fetch content from the URLs
    for job_item in job_details['job_items']:
        print(f'{job_item["original_url"]} [{job_item["page_status_code"]}]: {job_item["cleaned_content_url"]}')
        content = fetch_content(job_item["raw_content_url"])  # Fetch content using the raw_content_url
        print(f"Raw contnet from {job_item['raw_content_url']}:
{content}")
        cleaned_content = fetch_content(job_item['cleaned_content_url'])
        print(f"Cleaned from {job_item['raw_content_url']}:
{cleaned_content}")
except Exception as e:
    print(e)

After a few seconds you will see raw and clean content of the website.

Raw response example:

Raw contnet from https://data.webcrawlerapi.com/raw/cluzx4y4p0001iwpwf7y9lfb7/cee1dba6-593e-456d-b47a-cd5bf3d016fb/http___example_com:
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;

...

</div>
</body>
</html>

Cleaned content example:

Cleaned from https://data.webcrawlerapi.com/raw/cluzx4y4p0001iwpwf7y9lfb7/cee1dba6-593e-456d-b47a-cd5bf3d016fb/http___example_com:
Example Domain
Example Domain
This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...

Summary

The choice of the way to crawl the website depends on several factors: your development skills, how much time and money do you have, and how deep do you want to dive into crawling techniques?

If you are an experienced developer who likes to build software for fun and doesn't want to spend money - your choice is to build a crawler yourself either from scratch or using some existing Python framework.

If you are not familiar with development or don't have much time and can pay for crawling, your option is to use an API to get a result in a minute.