Webcrawler API LogoWebCrawler API
PricingDocsBlogSign inSign Up
Webcrawler API LogoWebCrawler API

Tools

  • Website to Markdown
  • llms.txt Generator
  • HTML to Readability

Resources

  • Blog
  • Docs
  • Glossary
  • Changelog

Follow us

  • Github
  • X (Twitter)
  • Postman
  • Swagger

Legal

  • Privacy Policy
  • Terms & Conditions
  • Refund Policy

Made in Netherlands 🇳🇱
2023-2026   ©103Labs
    PythonTutorialWeb Crawling

    BeautifulSoup4 Web Crawler

    A tiny BeautifulSoup4 + requests crawler that stays on one site, normalizes URLs, and deduplicates links.

    Written byAndrew
    Published onFeb 3, 2026

    Table of Contents

    • BeautifulSoup4 web crawler in one file
    • How to run it
    • What it does (and why it works)
    • 1) Deduplication
    • 2) Scope control (same site)
    • 3) URL normalization
    • Politeness (delay + timeout)
    • What this script does not do

    Table of Contents

    • BeautifulSoup4 web crawler in one file
    • How to run it
    • What it does (and why it works)
    • 1) Deduplication
    • 2) Scope control (same site)
    • 3) URL normalization
    • Politeness (delay + timeout)
    • What this script does not do

    If you want a tiny “just works” crawler, start here. Copy-paste this file, run it, and you will get a deduped list of internal URLs.

    #!/usr/bin/env python3
    
    import argparse
    import collections
    import re
    import sys
    import time
    import urllib.parse
    
    import requests
    from bs4 import BeautifulSoup
    
    
    SKIP_SCHEMES = {"mailto", "tel", "javascript", "data"}
    DROP_QUERY_PREFIXES = {"utm_"}
    DROP_QUERY_KEYS = {"fbclid", "gclid", "igshid"}
    
    
    def normalize_url(raw: str, *, base: str | None = None) -> str | None:
        try:
            u = urllib.parse.urljoin(base, raw) if base else raw
            p = urllib.parse.urlsplit(u)
        except Exception:
            return None
    
        if not p.scheme or p.scheme.lower() in SKIP_SCHEMES:
            return None
    
        scheme = p.scheme.lower()
        netloc = p.netloc.lower()
    
        # drop default ports
        if (scheme == "http" and netloc.endswith(":80")) or (scheme == "https" and netloc.endswith(":443")):
            netloc = netloc.rsplit(":", 1)[0]
    
        # normalize path: collapse // and ensure leading /
        path = re.sub(r"/+$", "", p.path or "/") or "/"
        path = re.sub(r"/{2,}", "/", path)
    
        # normalize query: drop common tracking params and sort remaining
        q = urllib.parse.parse_qsl(p.query, keep_blank_values=True)
        q2: list[tuple[str, str]] = []
        for k, v in q:
            lk = k.lower()
            if lk in DROP_QUERY_KEYS:
                continue
            if any(lk.startswith(prefix) for prefix in DROP_QUERY_PREFIXES):
                continue
            q2.append((k, v))
        q2.sort(key=lambda kv: (kv[0], kv[1]))
        query = urllib.parse.urlencode(q2, doseq=True)
    
        # drop fragments
        return urllib.parse.urlunsplit((scheme, netloc, path, query, ""))
    
    
    def same_site(url: str, root: str) -> bool:
        try:
            a = urllib.parse.urlsplit(url)
            b = urllib.parse.urlsplit(root)
            return a.scheme == b.scheme and a.netloc == b.netloc
        except Exception:
            return False
    
    
    def extract_links(html: str, *, base_url: str) -> list[str]:
        soup = BeautifulSoup(html, "html.parser")
        out: list[str] = []
        for a in soup.select("a[href]"):
            href = a.get("href")
            if not href or not isinstance(href, str):
                continue
            n = normalize_url(href, base=base_url)
            if n:
                out.append(n)
        return out
    
    
    def crawl(start_url: str, *, max_pages: int, delay_s: float, timeout_s: float) -> list[str]:
        start = normalize_url(start_url)
        if not start:
            raise ValueError("Invalid start URL")
    
        session = requests.Session()
        session.headers.update({"User-Agent": "BeautifulSoupCrawler/1.0 (+https://webcrawlerapi.com)"})
    
        seen: set[str] = set()
        q: collections.deque[str] = collections.deque([start])
        crawled: list[str] = []
    
        while q and len(crawled) < max_pages:
            url = q.popleft()
            if url in seen:
                continue
            if not same_site(url, start):
                continue
    
            seen.add(url)
            try:
                res = session.get(url, timeout=timeout_s, allow_redirects=True)
            except requests.RequestException:
                continue
    
            final_url = normalize_url(res.url) or url
            if final_url not in seen:
                seen.add(final_url)
            crawled.append(final_url)
    
            ctype = (res.headers.get("content-type") or "").lower()
            if res.ok and "text/html" in ctype:
                for link in extract_links(res.text, base_url=final_url):
                    if link not in seen:
                        q.append(link)
    
            if delay_s:
                time.sleep(delay_s)
    
        return crawled
    
    
    def main(argv: list[str]) -> int:
        ap = argparse.ArgumentParser(description="Tiny site crawler using BeautifulSoup4")
        ap.add_argument("start_url", help="Seed URL, e.g. https://example.com")
        ap.add_argument("--max-pages", type=int, default=100, help="Hard stop")
        ap.add_argument("--delay", type=float, default=0.2, help="Delay between requests (seconds)")
        ap.add_argument("--timeout", type=float, default=10.0, help="Request timeout (seconds)")
        args = ap.parse_args(argv)
    
        urls = crawl(args.start_url, max_pages=args.max_pages, delay_s=args.delay, timeout_s=args.timeout)
        for u in urls:
            print(u)
        print(f"\nCrawled {len(urls)} pages", file=sys.stderr)
        return 0
    
    
    if __name__ == "__main__":
        raise SystemExit(main(sys.argv[1:]))
    

    BeautifulSoup4 web crawler in one file

    Hi, I'm Andrew. This is a tiny crawler that is built for one job: start from a URL, follow links, and keep going.

    It is not a production crawler. It is a learning script that shows the core loop: fetch -> parse -> enqueue -> dedupe.

    How to run it

    The virtualenv is already created in this repo at content/blog/beatifulsoup-webcrawler/extra/code/.venv.

    cd content/blog/beatifulsoup-webcrawler/extra/code
    source .venv/bin/activate
    python -m pip install -r requirements.txt
    python crawler.py https://example.com --max-pages 50 --delay 0.2
    

    URLs are printed to stdout. The progress line (Crawled N pages) is printed to stderr.

    If you want the longer, step-by-step version (from copy-paste crawler to more production-ish concerns), read: How to crawl the website with Python.

    What it does (and why it works)

    This script is small, but it is not naive. Three guardrails are doing most of the work.

    1) Deduplication

    Without dedupe, the crawl becomes infinite. Navigation menus alone can re-add the same URLs forever.

    In the script, seen is the simplest correct version:

    seen: set[str] = set()
    
    if url in seen:
        continue
    seen.add(url)
    

    2) Scope control (same site)

    If scope is not defined, the crawler leaves the site. It follows socials, auth providers, CDNs, random third-party links.

    This crawler stays strict: same scheme + same host.

    from urllib.parse import urlsplit
    
    
    def same_site(url: str, root: str) -> bool:
        a = urlsplit(url)
        b = urlsplit(root)
        return a.scheme == b.scheme and a.netloc == b.netloc
    

    It is boring. It is also safe.

    3) URL normalization

    Duplicates are not only caused by re-visiting the same link. They are also caused by URL variants.

    • /page vs /page/
    • #fragment variants
    • tracking params like utm_*, fbclid, gclid

    So normalization is used before URLs are added to the queue. In this script it is done in normalize_url().

    Here is the small idea, without all the extra rules:

    from urllib.parse import urljoin, urlsplit, urlunsplit
    
    
    def normalize(raw: str, base: str) -> str:
        u = urljoin(base, raw)
        p = urlsplit(u)
        return urlunsplit((p.scheme.lower(), p.netloc.lower(), p.path.rstrip("/"), p.query, ""))
    

    The tradeoff is real. Aggressive normalization can merge pages that are actually different. That is why tracking params are dropped first, not everything.

    Politeness (delay + timeout)

    Two settings are used so the crawl does not hang and the target is not hammered.

    import time
    
    import requests
    
    session = requests.Session()
    
    timeout_s = 10.0
    delay_s = 0.2
    
    res = session.get(url, timeout=timeout_s, allow_redirects=True)
    time.sleep(delay_s)
    

    0.2s can still be too fast for many sites. If you crawl something you do not control, go slower.

    What this script does not do

    This is where real crawling work starts.

    • robots.txt parsing and per-URL allow checks
    • backoff on 429 and retries on flaky networks
    • JavaScript rendering (SPA pages that ship empty HTML)
    • anti-bot handling
    • storage (results, crawl state, link graph)

    Also: crawling is only step one. After you fetch pages, you usually need to clean the HTML (remove scripts, nav, boilerplate) before you can use the text. See: Clean crawled or scraped data with BeautifulSoup in Python.

    If those features are needed, a script stops being a script. Crawling infrastructure is being built.

    When that point is reached, a managed service like WebCrawlerAPI is usually cheaper than rebuilding everything.