If you want a tiny “just works” crawler, start here. Copy-paste this file, run it, and you will get a deduped list of internal URLs.
#!/usr/bin/env python3
import argparse
import collections
import re
import sys
import time
import urllib.parse
import requests
from bs4 import BeautifulSoup
SKIP_SCHEMES = {"mailto", "tel", "javascript", "data"}
DROP_QUERY_PREFIXES = {"utm_"}
DROP_QUERY_KEYS = {"fbclid", "gclid", "igshid"}
def normalize_url(raw: str, *, base: str | None = None) -> str | None:
try:
u = urllib.parse.urljoin(base, raw) if base else raw
p = urllib.parse.urlsplit(u)
except Exception:
return None
if not p.scheme or p.scheme.lower() in SKIP_SCHEMES:
return None
scheme = p.scheme.lower()
netloc = p.netloc.lower()
# drop default ports
if (scheme == "http" and netloc.endswith(":80")) or (scheme == "https" and netloc.endswith(":443")):
netloc = netloc.rsplit(":", 1)[0]
# normalize path: collapse // and ensure leading /
path = re.sub(r"/+$", "", p.path or "/") or "/"
path = re.sub(r"/{2,}", "/", path)
# normalize query: drop common tracking params and sort remaining
q = urllib.parse.parse_qsl(p.query, keep_blank_values=True)
q2: list[tuple[str, str]] = []
for k, v in q:
lk = k.lower()
if lk in DROP_QUERY_KEYS:
continue
if any(lk.startswith(prefix) for prefix in DROP_QUERY_PREFIXES):
continue
q2.append((k, v))
q2.sort(key=lambda kv: (kv[0], kv[1]))
query = urllib.parse.urlencode(q2, doseq=True)
# drop fragments
return urllib.parse.urlunsplit((scheme, netloc, path, query, ""))
def same_site(url: str, root: str) -> bool:
try:
a = urllib.parse.urlsplit(url)
b = urllib.parse.urlsplit(root)
return a.scheme == b.scheme and a.netloc == b.netloc
except Exception:
return False
def extract_links(html: str, *, base_url: str) -> list[str]:
soup = BeautifulSoup(html, "html.parser")
out: list[str] = []
for a in soup.select("a[href]"):
href = a.get("href")
if not href or not isinstance(href, str):
continue
n = normalize_url(href, base=base_url)
if n:
out.append(n)
return out
def crawl(start_url: str, *, max_pages: int, delay_s: float, timeout_s: float) -> list[str]:
start = normalize_url(start_url)
if not start:
raise ValueError("Invalid start URL")
session = requests.Session()
session.headers.update({"User-Agent": "BeautifulSoupCrawler/1.0 (+https://webcrawlerapi.com)"})
seen: set[str] = set()
q: collections.deque[str] = collections.deque([start])
crawled: list[str] = []
while q and len(crawled) < max_pages:
url = q.popleft()
if url in seen:
continue
if not same_site(url, start):
continue
seen.add(url)
try:
res = session.get(url, timeout=timeout_s, allow_redirects=True)
except requests.RequestException:
continue
final_url = normalize_url(res.url) or url
if final_url not in seen:
seen.add(final_url)
crawled.append(final_url)
ctype = (res.headers.get("content-type") or "").lower()
if res.ok and "text/html" in ctype:
for link in extract_links(res.text, base_url=final_url):
if link not in seen:
q.append(link)
if delay_s:
time.sleep(delay_s)
return crawled
def main(argv: list[str]) -> int:
ap = argparse.ArgumentParser(description="Tiny site crawler using BeautifulSoup4")
ap.add_argument("start_url", help="Seed URL, e.g. https://example.com")
ap.add_argument("--max-pages", type=int, default=100, help="Hard stop")
ap.add_argument("--delay", type=float, default=0.2, help="Delay between requests (seconds)")
ap.add_argument("--timeout", type=float, default=10.0, help="Request timeout (seconds)")
args = ap.parse_args(argv)
urls = crawl(args.start_url, max_pages=args.max_pages, delay_s=args.delay, timeout_s=args.timeout)
for u in urls:
print(u)
print(f"\nCrawled {len(urls)} pages", file=sys.stderr)
return 0
if __name__ == "__main__":
raise SystemExit(main(sys.argv[1:]))
BeautifulSoup4 web crawler in one file
Hi, I'm Andrew. This is a tiny crawler that is built for one job: start from a URL, follow links, and keep going.
It is not a production crawler. It is a learning script that shows the core loop: fetch -> parse -> enqueue -> dedupe.
How to run it
The virtualenv is already created in this repo at content/blog/beatifulsoup-webcrawler/extra/code/.venv.
cd content/blog/beatifulsoup-webcrawler/extra/code
source .venv/bin/activate
python -m pip install -r requirements.txt
python crawler.py https://example.com --max-pages 50 --delay 0.2
URLs are printed to stdout. The progress line (Crawled N pages) is printed to stderr.
If you want the longer, step-by-step version (from copy-paste crawler to more production-ish concerns), read: How to crawl the website with Python.
What it does (and why it works)
This script is small, but it is not naive. Three guardrails are doing most of the work.
1) Deduplication
Without dedupe, the crawl becomes infinite. Navigation menus alone can re-add the same URLs forever.
In the script, seen is the simplest correct version:
seen: set[str] = set()
if url in seen:
continue
seen.add(url)
2) Scope control (same site)
If scope is not defined, the crawler leaves the site. It follows socials, auth providers, CDNs, random third-party links.
This crawler stays strict: same scheme + same host.
from urllib.parse import urlsplit
def same_site(url: str, root: str) -> bool:
a = urlsplit(url)
b = urlsplit(root)
return a.scheme == b.scheme and a.netloc == b.netloc
It is boring. It is also safe.
3) URL normalization
Duplicates are not only caused by re-visiting the same link. They are also caused by URL variants.
- /page vs /page/
- #fragment variants
- tracking params like utm_*, fbclid, gclid
So normalization is used before URLs are added to the queue. In this script it is done in normalize_url().
Here is the small idea, without all the extra rules:
from urllib.parse import urljoin, urlsplit, urlunsplit
def normalize(raw: str, base: str) -> str:
u = urljoin(base, raw)
p = urlsplit(u)
return urlunsplit((p.scheme.lower(), p.netloc.lower(), p.path.rstrip("/"), p.query, ""))
The tradeoff is real. Aggressive normalization can merge pages that are actually different. That is why tracking params are dropped first, not everything.
Politeness (delay + timeout)
Two settings are used so the crawl does not hang and the target is not hammered.
import time
import requests
session = requests.Session()
timeout_s = 10.0
delay_s = 0.2
res = session.get(url, timeout=timeout_s, allow_redirects=True)
time.sleep(delay_s)
0.2s can still be too fast for many sites. If you crawl something you do not control, go slower.
What this script does not do
This is where real crawling work starts.
- robots.txt parsing and per-URL allow checks
- backoff on 429 and retries on flaky networks
- JavaScript rendering (SPA pages that ship empty HTML)
- anti-bot handling
- storage (results, crawl state, link graph)
Also: crawling is only step one. After you fetch pages, you usually need to clean the HTML (remove scripts, nav, boilerplate) before you can use the text. See: Clean crawled or scraped data with BeautifulSoup in Python.
If those features are needed, a script stops being a script. Crawling infrastructure is being built.
When that point is reached, a managed service like WebCrawlerAPI is usually cheaper than rebuilding everything.