BeautifulSoup4 Web Crawler

A tiny BeautifulSoup4 + requests crawler that stays on one site, normalizes URLs, and deduplicates links.

Written byAndrii
Published on

If you want a tiny “just works” crawler, start here. Copy-paste this file, run it, and you will get a deduped list of internal URLs.

#!/usr/bin/env python3

import argparse
import collections
import re
import sys
import time
import urllib.parse

import requests
from bs4 import BeautifulSoup


SKIP_SCHEMES = {"mailto", "tel", "javascript", "data"}
DROP_QUERY_PREFIXES = {"utm_"}
DROP_QUERY_KEYS = {"fbclid", "gclid", "igshid"}


def normalize_url(raw: str, *, base: str | None = None) -> str | None:
    try:
        u = urllib.parse.urljoin(base, raw) if base else raw
        p = urllib.parse.urlsplit(u)
    except Exception:
        return None

    if not p.scheme or p.scheme.lower() in SKIP_SCHEMES:
        return None

    scheme = p.scheme.lower()
    netloc = p.netloc.lower()

    # drop default ports
    if (scheme == "http" and netloc.endswith(":80")) or (scheme == "https" and netloc.endswith(":443")):
        netloc = netloc.rsplit(":", 1)[0]

    # normalize path: collapse // and ensure leading /
    path = re.sub(r"/+$", "", p.path or "/") or "/"
    path = re.sub(r"/{2,}", "/", path)

    # normalize query: drop common tracking params and sort remaining
    q = urllib.parse.parse_qsl(p.query, keep_blank_values=True)
    q2: list[tuple[str, str]] = []
    for k, v in q:
        lk = k.lower()
        if lk in DROP_QUERY_KEYS:
            continue
        if any(lk.startswith(prefix) for prefix in DROP_QUERY_PREFIXES):
            continue
        q2.append((k, v))
    q2.sort(key=lambda kv: (kv[0], kv[1]))
    query = urllib.parse.urlencode(q2, doseq=True)

    # drop fragments
    return urllib.parse.urlunsplit((scheme, netloc, path, query, ""))


def same_site(url: str, root: str) -> bool:
    try:
        a = urllib.parse.urlsplit(url)
        b = urllib.parse.urlsplit(root)
        return a.scheme == b.scheme and a.netloc == b.netloc
    except Exception:
        return False


def extract_links(html: str, *, base_url: str) -> list[str]:
    soup = BeautifulSoup(html, "html.parser")
    out: list[str] = []
    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href or not isinstance(href, str):
            continue
        n = normalize_url(href, base=base_url)
        if n:
            out.append(n)
    return out


def crawl(start_url: str, *, max_pages: int, delay_s: float, timeout_s: float) -> list[str]:
    start = normalize_url(start_url)
    if not start:
        raise ValueError("Invalid start URL")

    session = requests.Session()
    session.headers.update({"User-Agent": "BeautifulSoupCrawler/1.0 (+https://webcrawlerapi.com)"})

    seen: set[str] = set()
    q: collections.deque[str] = collections.deque([start])
    crawled: list[str] = []

    while q and len(crawled) < max_pages:
        url = q.popleft()
        if url in seen:
            continue
        if not same_site(url, start):
            continue

        seen.add(url)
        try:
            res = session.get(url, timeout=timeout_s, allow_redirects=True)
        except requests.RequestException:
            continue

        final_url = normalize_url(res.url) or url
        if final_url not in seen:
            seen.add(final_url)
        crawled.append(final_url)

        ctype = (res.headers.get("content-type") or "").lower()
        if res.ok and "text/html" in ctype:
            for link in extract_links(res.text, base_url=final_url):
                if link not in seen:
                    q.append(link)

        if delay_s:
            time.sleep(delay_s)

    return crawled


def main(argv: list[str]) -> int:
    ap = argparse.ArgumentParser(description="Tiny site crawler using BeautifulSoup4")
    ap.add_argument("start_url", help="Seed URL, e.g. https://example.com")
    ap.add_argument("--max-pages", type=int, default=100, help="Hard stop")
    ap.add_argument("--delay", type=float, default=0.2, help="Delay between requests (seconds)")
    ap.add_argument("--timeout", type=float, default=10.0, help="Request timeout (seconds)")
    args = ap.parse_args(argv)

    urls = crawl(args.start_url, max_pages=args.max_pages, delay_s=args.delay, timeout_s=args.timeout)
    for u in urls:
        print(u)
    print(f"\nCrawled {len(urls)} pages", file=sys.stderr)
    return 0


if __name__ == "__main__":
    raise SystemExit(main(sys.argv[1:]))

BeautifulSoup4 web crawler in one file

Hi, I'm Andrew. This is a tiny crawler that is built for one job: start from a URL, follow links, and keep going.

It is not a production crawler. It is a learning script that shows the core loop: fetch -> parse -> enqueue -> dedupe.

How to run it

The virtualenv is already created in this repo at content/blog/beatifulsoup-webcrawler/extra/code/.venv.

cd content/blog/beatifulsoup-webcrawler/extra/code
source .venv/bin/activate
python -m pip install -r requirements.txt
python crawler.py https://example.com --max-pages 50 --delay 0.2

URLs are printed to stdout. The progress line (Crawled N pages) is printed to stderr.

If you want the longer, step-by-step version (from copy-paste crawler to more production-ish concerns), read: How to crawl the website with Python.

What it does (and why it works)

This script is small, but it is not naive. Three guardrails are doing most of the work.

1) Deduplication

Without dedupe, the crawl becomes infinite. Navigation menus alone can re-add the same URLs forever.

In the script, seen is the simplest correct version:

seen: set[str] = set()

if url in seen:
    continue
seen.add(url)

2) Scope control (same site)

If scope is not defined, the crawler leaves the site. It follows socials, auth providers, CDNs, random third-party links.

This crawler stays strict: same scheme + same host.

from urllib.parse import urlsplit


def same_site(url: str, root: str) -> bool:
    a = urlsplit(url)
    b = urlsplit(root)
    return a.scheme == b.scheme and a.netloc == b.netloc

It is boring. It is also safe.

3) URL normalization

Duplicates are not only caused by re-visiting the same link. They are also caused by URL variants.

  • /page vs /page/
  • #fragment variants
  • tracking params like utm_*, fbclid, gclid

So normalization is used before URLs are added to the queue. In this script it is done in normalize_url().

Here is the small idea, without all the extra rules:

from urllib.parse import urljoin, urlsplit, urlunsplit


def normalize(raw: str, base: str) -> str:
    u = urljoin(base, raw)
    p = urlsplit(u)
    return urlunsplit((p.scheme.lower(), p.netloc.lower(), p.path.rstrip("/"), p.query, ""))

The tradeoff is real. Aggressive normalization can merge pages that are actually different. That is why tracking params are dropped first, not everything.

Politeness (delay + timeout)

Two settings are used so the crawl does not hang and the target is not hammered.

import time

import requests

session = requests.Session()

timeout_s = 10.0
delay_s = 0.2

res = session.get(url, timeout=timeout_s, allow_redirects=True)
time.sleep(delay_s)

0.2s can still be too fast for many sites. If you crawl something you do not control, go slower.

What this script does not do

This is where real crawling work starts.

  • robots.txt parsing and per-URL allow checks
  • backoff on 429 and retries on flaky networks
  • JavaScript rendering (SPA pages that ship empty HTML)
  • anti-bot handling
  • storage (results, crawl state, link graph)

Also: crawling is only step one. After you fetch pages, you usually need to clean the HTML (remove scripts, nav, boilerplate) before you can use the text. See: Clean crawled or scraped data with BeautifulSoup in Python.

If those features are needed, a script stops being a script. Crawling infrastructure is being built.

When that point is reached, a managed service like WebCrawlerAPI is usually cheaper than rebuilding everything.


About the Author

Andrii Mazurian
Andrew Mazurian@andriixzvf

Founder, WebCrawlerAPI · 🇳🇱 Netherlands

Engineer with 15 years of experience in APIs, big data, and infrastructure. Founded WebCrawlerAPI in 2024 with a single goal: to build the best data API, and have been shipping it every day since.