Webcrawler API LogoWebCrawler API
PricingDocsBlogSign inSign Up
Webcrawler API LogoWebCrawler API

Tools

  • Website to Markdown
  • llms.txt Generator
  • HTML to Readability

Resources

  • Blog
  • Docs
  • Glossary
  • Changelog

Follow us

  • Github
  • X (Twitter)
  • Postman
  • Swagger

Legal

  • Privacy Policy
  • Terms & Conditions
  • Refund Policy

Made in Netherlands 🇳🇱
2023-2026   ©103Labs
    PythonTutorialAPI10 min read to read

    How to crawl the website with Python

    There are several options for how to crawl the content of the website using Python. All methods have their pros and cons. Let's take a look at more detail.

    Written byAndrew
    Published onFeb 6, 2026

    Table of Contents

    • How to Crawl a Website with Python: Complete Guide with Code Examples
    • Simplest copy-paste working Python crawling example
    • What is Web Crawling (and How It Differs from Scraping)
    • Simple Python Website Crawler with Requests and BeautifulSoup
    • Installing the Required Libraries
    • Crawling a Single Page
    • Following Links to Crawl Multiple Pages
    • Extracting and Storing Data
    • Building a Production Web Crawler with Scrapy
    • Why Scrapy for Larger Projects
    • Creating Your First Scrapy Spider
    • Scrapy Crawling Rules and Link Extraction
    • Processing and Exporting Scraped Data
    • Crawling JavaScript-Heavy Websites with Python
    • The JavaScript Problem
    • Using Selenium for JavaScript Rendering
    • Playwright as a Selenium Alternative
    • Crawling All Links on a Website (Full Site Crawl)
    • Method 1: Start with `sitemap.xml`
    • Method 2: Breadth-first link crawling (BFS)
    • Best Practices for Python Web Crawlers
    • Respecting `robots.txt` and Rate Limiting
    • Handling Errors and Retries
    • Using User Agents and Headers
    • Avoiding Blocks and CAPTCHA
    • Scaling Your Python Crawler (When DIY Gets Hard)
    • Common Use Cases for Python Web Crawlers
    • Price monitoring
    • Lead generation (contact discovery)
    • SEO audits and competitor research
    • Content aggregation
    • Market research
    • Troubleshooting Common Crawling Problems
    • Frequently Asked Questions
    • Is web crawling legal?
    • What is the difference between crawling and scraping?
    • How fast should a crawler run?
    • Should `robots.txt` be respected?
    • What is the best Python library for crawling?
    • How should pagination be handled?
    • How should duplicates be handled?
    • Crawl data from the website with an API in Python.
    • When should a crawling API be used?
    • Start crawling job in Python.

    Table of Contents

    • How to Crawl a Website with Python: Complete Guide with Code Examples
    • Simplest copy-paste working Python crawling example
    • What is Web Crawling (and How It Differs from Scraping)
    • Simple Python Website Crawler with Requests and BeautifulSoup
    • Installing the Required Libraries
    • Crawling a Single Page
    • Following Links to Crawl Multiple Pages
    • Extracting and Storing Data
    • Building a Production Web Crawler with Scrapy
    • Why Scrapy for Larger Projects
    • Creating Your First Scrapy Spider
    • Scrapy Crawling Rules and Link Extraction
    • Processing and Exporting Scraped Data
    • Crawling JavaScript-Heavy Websites with Python
    • The JavaScript Problem
    • Using Selenium for JavaScript Rendering
    • Playwright as a Selenium Alternative
    • Crawling All Links on a Website (Full Site Crawl)
    • Method 1: Start with `sitemap.xml`
    • Method 2: Breadth-first link crawling (BFS)
    • Best Practices for Python Web Crawlers
    • Respecting `robots.txt` and Rate Limiting
    • Handling Errors and Retries
    • Using User Agents and Headers
    • Avoiding Blocks and CAPTCHA
    • Scaling Your Python Crawler (When DIY Gets Hard)
    • Common Use Cases for Python Web Crawlers
    • Price monitoring
    • Lead generation (contact discovery)
    • SEO audits and competitor research
    • Content aggregation
    • Market research
    • Troubleshooting Common Crawling Problems
    • Frequently Asked Questions
    • Is web crawling legal?
    • What is the difference between crawling and scraping?
    • How fast should a crawler run?
    • Should `robots.txt` be respected?
    • What is the best Python library for crawling?
    • How should pagination be handled?
    • How should duplicates be handled?
    • Crawl data from the website with an API in Python.
    • When should a crawling API be used?
    • Start crawling job in Python.

    How to Crawl a Website with Python: Complete Guide with Code Examples

    Possible ways to crawl website in Python:

    • Simplest copy-paste working Python crawling example
    • Simple Python Website Crawler with Requests and BeautifulSoup
    • Building a Production Web Crawler with Scrapy
    • Crawling JavaScript-Heavy Websites with Python
    • Crawl data from the website with an API in Python

    Simplest copy-paste working Python crawling example

    Before we dive into different approaches, here's a minimal working crawler you can run right now:

    #!/usr/bin/env python3
    # Install dependencies first
    # pip install requests beautifulsoup4
    
    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin, urlparse
    from collections import deque
    
    def crawl(start_url, max_pages=10):
        """Simple web crawler - just copy and run!"""
        visited = []
        queue = deque([start_url])
        domain = urlparse(start_url).netloc
    
        while queue and len(visited) < max_pages:
            url = queue.popleft()
            if url in visited:
                continue
    
            try:
                # Fetch the page
                response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
                response.raise_for_status()
                visited.append(url)
                print(f"âś“ Crawled: {url}")
    
                # Parse HTML and find all links
                soup = BeautifulSoup(response.text, "html.parser")
                for link in soup.find_all("a", href=True):
                    full_url = urljoin(url, link["href"])
                    # Only crawl same domain
                    if urlparse(full_url).netloc == domain and full_url not in visited:
                        queue.append(full_url)
            except Exception as e:
                print(f"âś— Failed: {url} - {e}")
    
        return visited
    
    # Run the crawler
    if __name__ == "__main__":
        urls = crawl("https://quotes.toscrape.com/", max_pages=5)
        print(f"\nTotal pages crawled: {len(urls)}")
    

    Save this as crawler.py and run with python3 crawler.py. It will crawl up to 5 pages from quotes.toscrape.com.

    What this does: It starts at one URL, fetches the HTML with requests, parses it with BeautifulSoup to find all links, and follows links within the same domain. Perfect for learning, but for production use (handling JavaScript, rate limits, proxies, etc.) - keep reading.

    If you want to crawl a website with Python, you can get surprisingly far with a small script - as long as you know what will break in real life. I will start with a copy-paste crawler that actually runs, then build it up step by step (from Requests + BeautifulSoup to Scrapy, and to browser tools for JavaScript-heavy sites). Along the way I will show the boring but important parts: robots.txt, rate limits, and what to do when you hit 403/429 blocks. If you only need a few hundred pages, DIY is fine - if you need thousands, retries, proxies, and scheduling become the real work, and that is where a service like WebCrawlerAPI can make sense later.


    What is Web Crawling (and How It Differs from Scraping)

    People mix these terms up constantly, so let me clear it up.

    Crawling is about discovering pages. You start at one URL, grab all the links on that page, then visit those links, grab more links, and keep going. Think of it like exploring a maze - you're mapping out what exists, not necessarily reading every sign on the wall.

    Scraping is about extracting specific data from pages you already found. You grab product prices, article titles, contact info, reviews - whatever data you actually need from the HTML.

    Here's the real difference in practice:

    • A crawler hits 100 pages and returns a list of URLs
    • A scraper hits those same 100 pages and returns structured data (JSON, CSV, database rows)

    Most real projects do both. You crawl to find all product pages on an e-commerce site, then scrape each page to extract the price, title, and specs. The crawler discovers, the scraper extracts.

    Scraping has its own problems (parsing messy HTML, handling JavaScript, dealing with rate limits), but crawling adds the complexity of navigation logic on top.

    If you just need data from 5 specific URLs you already know? Skip the crawler, just scrape those pages directly. If you need to discover everything on a site first? You need a proper crawler.


    Simple Python Website Crawler with Requests and BeautifulSoup

    This section builds up a working crawler step by step. If you want to see the complete final version first, check out this gist - it's a production-ready crawler with robots.txt handling, proper delays, and CSV export. We'll break down the key parts below.

    If you want the smallest possible one-file crawler (dedupe + same-site scope + URL normalization), see: BeautifulSoup4 Web Crawler.

    Installing the Required Libraries

    You need two packages: requests for fetching web pages, and beautifulsoup4 for parsing HTML.

    pip install requests beautifulsoup4
    

    What each library does:

    • requests - Makes HTTP requests to fetch web pages. It handles all the low-level networking, headers, cookies, timeouts. Much easier than Python's built-in urllib.
    • beautifulsoup4 - Parses messy HTML into a tree you can navigate with simple Python code. Handles broken HTML that would crash a strict parser.

    Python version: You need Python 3.7 or higher. These libraries work with Python 3.12+ just fine.

    If you're in a virtual environment (you should be):

    python3 -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    pip install requests beautifulsoup4
    

    That's it. No browser drivers, no headless Chrome, no Docker - just two pure Python packages.

    Crawling a Single Page

    Let's start simple. Fetch one page, grab its title and all the links on it.

    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin
    
    
    def crawl_single_page(url: str):
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
        r.raise_for_status()
        soup = BeautifulSoup(r.text, "html.parser")
    
        title = soup.title.get_text(strip=True) if soup.title else ""
        links = [urljoin(url, a["href"]) for a in soup.select("a[href]")]
        return title, links
    

    What happens here:

    1. requests.get() fetches the page. The User-Agent header makes us look like a browser instead of Python (some sites block the default Python user agent).
    2. BeautifulSoup parses the HTML. The html.parser is built into Python - no extra install needed.
    3. soup.title.get_text() extracts the text from the <title> tag.
    4. soup.find_all("a") finds every <a> tag with an href attribute.
    5. urljoin() converts relative URLs like /page/2 into absolute URLs like https://quotes.toscrape.com/page/2.

    This works fine for one page. But if you try to run this on 100 pages, you'll hit problems: no retry logic, no delay between requests, no way to avoid visiting the same page twice.

    Following Links to Crawl Multiple Pages

    Now we scale it up. Visit multiple pages by following links, but stay on the same domain and avoid infinite loops.

    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin, urlparse
    from collections import deque
    
    def crawl_multiple_pages(seed_url, max_pages=10):
        domain = urlparse(seed_url).netloc
        visited = set()
        queue = deque([seed_url])
    
        while queue and len(visited) < max_pages:
            url = queue.popleft()
            if url in visited:
                continue
            visited.add(url)
    
            soup = BeautifulSoup(requests.get(url, timeout=10).text, "html.parser")
            for a in soup.select("a[href]"):
                next_url = urljoin(url, a["href"])
                if urlparse(next_url).netloc == domain:
                    queue.append(next_url)
    
        return list(visited)
    

    Key parts:

    • deque - A queue for breadth-first crawling. We add links to the right, pop URLs from the left. This crawls level by level instead of diving deep into one branch.
    • visited set - Prevents visiting the same URL twice. Crucial for avoiding infinite loops.
    • domain check - urlparse(full_url).netloc == domain keeps us on the same site. Without this, we'd crawl the entire internet.
    • try/except - If one page fails (timeout, 404, connection error), we skip it and keep going.

    What's still missing: This doesn't respect robots.txt, doesn't add delays between requests (will get you blocked fast), and doesn't handle redirects properly. The full gist example fixes all of this.

    Extracting and Storing Data

    Now let's extract real data and save it somewhere useful. The main idea is that one crawl loop will produce structured rows, and then those rows will be exported.

    import csv
    from dataclasses import dataclass
    
    
    @dataclass
    class PageResult:
        url: str
        status: int
        title: str
    
    
    def save_to_csv(rows: list[PageResult], output_path: str) -> None:
        with open(output_path, "w", newline="", encoding="utf-8") as f:
            w = csv.DictWriter(f, fieldnames=["url", "status", "title"])
            w.writeheader()
            for r in rows:
                w.writerow({"url": r.url, "status": r.status, "title": r.title})
    
    
    # In the crawl loop, PageResult objects will be appended and then exported:
    # results.append(PageResult(url=final_url, status=resp.status_code, title=title))
    # save_to_csv(results, "out/crawl_results.csv")
    

    What this adds:

    • dataclass - Clean way to store structured data. Better than dicts for type safety.
    • Session object - Reuses the same HTTP connection. Faster than creating a new connection for every request.
    • normalize_url() - Removes URL fragments (#section) so page.html and page.html#top count as the same page.
    • Content-Type check - Skips PDFs, images, and other non-HTML files. Prevents trying to parse binary data with BeautifulSoup.
    • time.sleep() - Adds a 0.5 second delay between requests. This is critical. Without delays, many sites will ban your IP after 10-20 requests.
    • CSV export - Saves data in a format you can open in Excel, import into a database, or process with pandas.

    Alternative: JSON output

    If you prefer JSON instead of CSV:

    import json
    
    def save_to_json(rows, output_path: str) -> None:
        with open(output_path, "w", encoding="utf-8") as f:
            json.dump([r.__dict__ for r in rows], f, indent=2)
    

    Real-world extraction:

    For production use, you'd extract more fields:

    # Examples of common extractions
    description = (soup.find("meta", attrs={"name": "description"}) or {}).get("content", "")
    heading = (soup.find("h1").get_text(strip=True) if soup.find("h1") else "")
    price_el = soup.select_one("span.price")
    product_price = price_el.get_text(strip=True) if price_el else None
    

    This basic crawler will get you surprisingly far for small-scale projects. For the complete version with robots.txt handling and better error handling, see the full gist example.

    When this breaks: JavaScript-heavy sites, aggressive rate limiting, CAPTCHAs, login-required pages. We'll cover those problems in the next sections.


    Building a Production Web Crawler with Scrapy

    If you're crawling hundreds of pages and the Requests+BeautifulSoup approach starts to feel like you're duct-taping features together (retry logic here, rate limiting there, duplicate detection somewhere else), it's time to switch to Scrapy. Scrapy is a production web crawling framework - not just a library. It handles all the annoying infrastructure stuff so you can focus on extracting data.

    Check all examples in the Scrapy Website Crawler Examples Github repo.

    Why Scrapy for Larger Projects

    Scrapy gives you features that would take weeks to build yourself:

    • Built-in concurrency - Scrapy handles multiple requests in parallel automatically. You write single-threaded code, Scrapy runs it concurrently using Twisted. No threading, no async/await complexity. Set CONCURRENT_REQUESTS = 16 and you're crawling 16 pages at once.
    • Automatic retries - Network fails, timeouts, 500 errors - Scrapy retries automatically with exponential backoff. Configurable retry counts and delays.
    • robots.txt handling - Set ROBOTSTXT_OBEY = True and Scrapy checks robots.txt before every request. No manual parsing needed.
    • Request prioritization - Scrapy uses a priority queue. You can mark certain URLs as high priority and they'll get crawled first.
    • Middlewares and pipelines - Clean separation between fetching data (spider), processing data (pipeline), and request handling (middleware). Add logging, duplicate filtering, database saving without touching your spider code.
    • Response caching - Enable HTTP cache middleware and Scrapy stores responses on disk. Run the same crawl 100 times while testing your parser without hitting the server once.

    When to use Scrapy instead of BeautifulSoup in Python

    • Crawling more than 50 pages
    • Need to crawl regularly (daily/weekly jobs)
    • Multi-step crawling (list pages → detail pages → pagination)
    • Need structured data output (JSON, CSV, database)
    • Care about politeness (delays, robots.txt)

    When to stick with Requests

    • One-off script for 5-10 pages
    • Simple proof of concept
    • Already embedded in a larger codebase

    The initial setup cost is higher with Scrapy, but for any serious crawling work, it pays off fast.

    Creating Your First Scrapy Spider

    First, install Scrapy:

    pip install scrapy
    

    Unlike Requests, you don't need BeautifulSoup - Scrapy includes its own HTML parser (using lxml under the hood, which is faster than BeautifulSoup).

    Simple spider example:

    import scrapy
    
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        allowed_domains = ["quotes.toscrape.com"]
        start_urls = ["https://quotes.toscrape.com/"]
    
        custom_settings = {"ROBOTSTXT_OBEY": True, "DOWNLOAD_DELAY": 0.5}
    
        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                }
    
            next_page = response.css("li.next a::attr(href)").get()
            if next_page:
                yield response.follow(next_page, callback=self.parse)
    

    Run it:

    scrapy runspider quotes_spider.py -o output.json
    

    Full example in crawl_spider.py.

    That's it. Scrapy handles everything: fetches pages, calls parse() for each response, follows the links you yield, exports results to JSON.

    Key parts explained:

    • name - Spider identifier. Required. Used when running via scrapy crawl quotes.
    • allowed_domains - Scrapy won't follow links outside these domains. Safety feature to prevent runaway crawls.
    • start_urls - List of URLs to start crawling. Scrapy fetches these first.
    • custom_settings - Spider-specific settings. Override global config without editing files.
    • parse(response) - Called for every successful response. Return/yield dictionaries (data) or Request objects (more pages to crawl).
    • response.css() - CSS selector API. ::text extracts text, ::attr(href) extracts attributes, .get() returns first match, .getall() returns all matches.
    • response.follow() - Creates a new Request. Handles relative URLs automatically. You specify the callback method.

    CSS selectors vs XPath:

    Scrapy supports both. CSS is more readable for simple cases:

    # CSS
    response.css("div.quote span.text::text").get()
    
    # XPath (equivalent)
    response.xpath("//div[@class='quote']//span[@class='text']/text()").get()
    

    Use CSS for 90% of cases. Switch to XPath when you need complex logic like "find the table cell in the same row as the one containing 'Price'".

    Complete working example:

    The full spider code including author page parsing is available in crawl-scrapy-examples. It demonstrates multiple callback methods and structured data extraction.

    Scrapy Crawling Rules and Link Extraction

    For complex crawling patterns (follow all pagination, follow all category pages, but don't follow external links), use CrawlSpider instead of the basic Spider. You define rules, Scrapy does the rest.

    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class QuotesCrawlSpider(CrawlSpider):
        name = "quotes_crawl"
        allowed_domains = ["quotes.toscrape.com"]
        start_urls = ["https://quotes.toscrape.com/"]
    
        rules = (
            Rule(LinkExtractor(restrict_css="li.next a"), callback="parse_quotes", follow=True),
        )
    
        def parse_quotes(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                }
    

    Full example in quotes_crawlspider.py. Run it and pagination will be followed automatically based on the rule.

    How rules work:

    Each Rule tells Scrapy:

    1. What links to extract - LinkExtractor() finds matching links
    2. What to do with them - Call a callback method to parse the page
    3. Whether to follow - If follow=True, Scrapy extracts links from those pages too

    LinkExtractor options:

    # Common patterns
    LinkExtractor(restrict_css="a.product-link")
    LinkExtractor(allow=r"/product/\d+", deny=r"/admin/")
    

    Depth limiting:

    Set DEPTH_LIMIT to prevent crawling too deep. Depth 0 is start URLs, depth 1 is pages linked from start URLs, depth 2 is pages linked from those, etc.

    custom_settings = {
        "DEPTH_LIMIT": 2,  # Only crawl 2 levels deep
    }
    

    Processing and Exporting Scraped Data

    For production crawlers, you want structured data, not just dictionaries. Scrapy provides Items for type-safe data structures and Pipelines for processing.

    Define data structures with Items:

    from scrapy import Item, Field
    
    class QuoteItem(Item):
        text = Field()
        author = Field()
    

    Use Items in your spider:

    class QuotesItemSpider(scrapy.Spider):
        name = "quotes_items"
        start_urls = ["https://quotes.toscrape.com/"]
    
        def parse(self, response):
            for quote in response.css("div.quote"):
                item = QuoteItem()
                item["text"] = quote.css("span.text::text").get()
                item["author"] = quote.css("small.author::text").get()
                yield item
    

    Why use Items instead of dicts?

    • Type safety - Know what fields exist
    • Validation - Add field processors to clean data
    • IDE autocomplete - Better developer experience
    • Pipeline compatibility - Pipelines can check item types

    Data processing with Pipelines:

    Full example in quotes_items.py.

    Pipelines receive items after extraction and before export. Use them to clean, validate, deduplicate, or save to databases.

    class QuotesPipeline:
        def __init__(self):
            self.seen_quotes = set()
    
        def process_item(self, item, spider):
            if isinstance(item, QuoteItem):
                text = (item.get("text") or "").strip()
                if text in self.seen_quotes:
                    from scrapy.exceptions import DropItem
                    raise DropItem("duplicate")
                self.seen_quotes.add(text)
                item["text"] = text
    
            return item
    

    Enable pipelines in settings:

    custom_settings = {
        "ITEM_PIPELINES": {
            "myproject.pipelines.QuotesPipeline": 300,
            "myproject.pipelines.DatabasePipeline": 400,
        },
    }
    

    The number (300, 400) is the priority - lower numbers run first.

    Export formats:

    Scrapy exports to multiple formats out of the box:

    scrapy runspider spider.py -o output.json
    scrapy runspider spider.py -o output.csv
    

    Custom export to database:

    For database export, use a pipeline:

    import sqlite3
    
    class DatabasePipeline:
        def open_spider(self, spider):
            self.conn = sqlite3.connect("quotes.db")
            self.cursor = self.conn.cursor()
    
        def process_item(self, item, spider):
            if isinstance(item, QuoteItem):
                self.cursor.execute(
                    "INSERT INTO quotes VALUES (?, ?, ?)",
                    (item.get("text"), item.get("author"), "")
                )
            return item
    

    Complete working example with Items and Pipeline: quotes_items.py. It demonstrates structured data extraction, duplicate detection, and data cleaning.

    Real production pattern:

    In production, you'd have multiple pipelines:

    1. ValidationPipeline (priority 100) - Check required fields, validate formats
    2. CleaningPipeline (priority 200) - Clean text, normalize data
    3. DuplicatePipeline (priority 300) - Filter duplicates
    4. DatabasePipeline (priority 400) - Save to database
    5. ImagePipeline (priority 500) - Download and process images (built-in)

    Each pipeline does one thing. Easy to test, easy to debug, easy to reorder.

    Scrapy transforms web crawling from "writing networking code" to "writing extraction logic." You focus on what data to extract, Scrapy handles how to fetch it reliably at scale.


    Crawling JavaScript-Heavy Websites with Python

    If your crawler keeps returning empty pages, you are probably not doing anything wrong. You are just fetching the wrong thing.

    Modern sites often ship a tiny HTML shell and then render the real content in the browser with JavaScript. requests, BeautifulSoup, and vanilla Scrapy will happily download the shell. Then you parse it. And you get... nothing.

    This section is about fixing that without turning your crawler into a fragile, slow headless-browser monster.

    The JavaScript Problem

    The core problem is simple:

    • requests.get(url).text returns the initial HTML document.
    • The stuff you actually want (products, posts, quotes, etc.) gets loaded later via XHR/fetch and rendered by the browser.

    Quick reality check (30 seconds):

    1. Open the page in Chrome.
    2. Right click -> View Page Source.
    3. Search for the thing you want (a product title, a quote, a price).

    If it is not in "View Page Source" but it is visible in the normal page, you are looking at a JavaScript-rendered site.

    Before you reach for a headless browser, try the cheap wins first:

    • Look for an underlying JSON API. Open DevTools -> Network -> Fetch/XHR, refresh, and watch what endpoints return the data. If the data is already in JSON, scraping the JSON is faster and more reliable than rendering HTML.
    • Check for embedded state. Next.js pages often have data inside __NEXT_DATA__. Many apps ship a big JSON blob in a <script> tag.
    • Sitemaps still work. Even JS-heavy sites often expose URLs in sitemap.xml. Discovery can stay "static" while only a subset of pages gets rendered.

    If those options fail (or you need the fully rendered DOM), you need a browser renderer: Selenium or Playwright.

    Using Selenium for JavaScript Rendering

    Selenium drives a real browser (Chrome, Firefox, etc.) and gives you the rendered DOM. That is the whole point.

    The catch is that browsers are heavier than HTTP requests. So the workflow is usually kept simple:

    1. Open a page.
    2. Wait for a selector that proves content rendered.
    3. Extract HTML (or the specific fields) and pass it back to your Python parser.

    The Selenium example is kept as a public gist (so it can be copied into any project without hunting around this repo):

    https://gist.github.com/n10ty/988fe84ee2bb0722e2e14303ba36d3b7

    Here is the core Selenium flow in a few lines (open -> wait -> grab rendered HTML):

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome()
    driver.get("https://quotes.toscrape.com/js/")
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".quote")))
    html = driver.page_source
    driver.quit()
    

    What I would do in real crawls:

    • Render only when you must. Rendering every page is slow and expensive.
    • Wait for a specific element. Waiting for "page load" is not enough on many sites.
    • Set timeouts and treat rendering as flaky (retries, backoff).
    • Disable images/fonts to speed up loads.

    Playwright as a Selenium Alternative

    Playwright does the same job (browser automation), but it is usually more predictable than Selenium for crawling work. It also has a first-class Python library, so the whole pipeline can stay in Python.

    The full working script is kept as a public gist:

    Playwright renderer (Python) gist

    Here is the core Playwright flow in a few lines (open -> wait -> grab rendered HTML):

    from playwright.sync_api import sync_playwright
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto("https://quotes.toscrape.com/js/", wait_until="networkidle")
        page.wait_for_selector(".quote")
        html = page.content()
        browser.close()
    

    When to use which:

    • If you want the most "it just works" Python browser renderer, Playwright is usually the cleanest path.
    • If you already have Selenium infrastructure (grid, browser profiles, existing scripts), stick with Selenium.
    • If you only need data that already exists in JSON, skip browsers entirely and hit the JSON endpoint.

    Crawling All Links on a Website (Full Site Crawl)

    There are two ways to discover pages on a domain:

    1. Ask the site for a list (sitemaps).
    2. Walk the site like a user (follow links).

    In real crawls, both approaches will be used.

    • Sitemaps will be used to get coverage fast.
    • Link-following will be used to find orphan pages, parameterized URLs, and things that never made it into the sitemap.

    Method 1: Start with sitemap.xml

    This will be the highest ROI move on many sites.

    • Discovery will be fast.
    • You will not get trapped in infinite calendars.
    • You will not accidentally hammer the same nav pages 10,000 times.

    The catch: sitemaps will not always exist, and they will not always be complete.

    Here is a minimal sitemap URL collector:

    import xml.etree.ElementTree as ET
    from urllib.parse import urljoin
    
    import requests
    
    
    def get_sitemap_urls(base_url: str) -> list[str]:
        xml = requests.get(urljoin(base_url, "/sitemap.xml"), timeout=20).text
        root = ET.fromstring(xml)
        return [loc.text.strip() for loc in root.findall(".//{*}loc") if loc.text]
    

    If the sitemap index pattern is used (<sitemapindex> pointing to multiple sitemaps), it is the same approach: parse <loc> and recurse.

    Method 2: Breadth-first link crawling (BFS)

    This is the classic crawler loop: queue -> fetch -> extract links -> enqueue.

    What will matter in practice is scope control. A full-site crawl will be destroyed by:

    • infinite query strings (?page=1, ?page=2, ...)
    • faceted navigation (?color=red&size=m&brand=...)
    • calendars
    • internal search pages
    • duplicate pages (same content, different URLs)

    If you want a crawl you can trust, these controls will be added:

    • Domain allowlist (stay on one domain)
    • URL normalization (remove fragments, normalize trailing slashes)
    • Query strategy (drop all query params, or allow a small allowlist)
    • Depth/page limits (hard stops)
    • Content-type filters (HTML only)

    Here is a compact BFS crawler skeleton with those guardrails:

    from collections import deque
    from urllib.parse import urljoin, urlparse, urldefrag
    
    import requests
    from bs4 import BeautifulSoup
    
    
    def normalize(url: str) -> str:
        url, _frag = urldefrag(url)
        return url.rstrip("/")
    
    
    def crawl_site(seed_url: str, max_pages: int = 200) -> list[str]:
        domain = urlparse(seed_url).netloc
        seen = set()
        queue = deque([seed_url])
    
        while queue and len(seen) < max_pages:
            url = normalize(queue.popleft())
            if url in seen or urlparse(url).netloc != domain:
                continue
            if urlparse(url).query:  # Drop query params by default.
                continue
            seen.add(url)
    
            html = requests.get(url, timeout=20).text
            soup = BeautifulSoup(html, "html.parser")
            queue.extend(normalize(urljoin(url, a["href"])) for a in soup.select("a[href]"))
    
        return list(seen)
    

    If only one thing will be copied from this section, it should be this idea:

    Discovery and fetching will be separated. URLs will be discovered (sitemaps + BFS), then fetched and extracted with the right tool (Requests/Scrapy/Playwright), depending on what each URL needs.


    Best Practices for Python Web Crawlers

    The crawler that works on a demo site is not the crawler that survives a real site.

    This is where the boring parts will save you.

    Respecting robots.txt and Rate Limiting

    robots.txt is not a law. It is a policy file.

    If a site says "do not crawl /private", it should not be crawled.

    At a minimum, this will be checked:

    • whether the URL is allowed for your crawler user agent
    • whether a crawl delay is specified

    Python has a standard library parser:

    from urllib.parse import urljoin
    from urllib.robotparser import RobotFileParser
    
    
    def robots_allows(url: str, user_agent: str = "*") -> bool:
        rp = RobotFileParser()
        rp.set_url(urljoin(url, "/robots.txt"))
        rp.read()
        return rp.can_fetch(user_agent, url)
    

    Rate limiting will be treated as part of correctness.

    • A crawler that gets blocked at page 50 is not "fast".
    • It is just wrong.

    The simple rule: go slower than you think. Then speed up with concurrency only after error rates and blocks are under control.

    Handling Errors and Retries

    Failures will happen. You will see:

    • timeouts
    • temporary 5xx responses
    • 429 rate limits
    • random connection resets

    So retries will be added, and they will be polite.

    import random
    import time
    
    import requests
    
    
    def get_with_backoff(session: requests.Session, url: str, tries: int = 5) -> requests.Response:
        for attempt in range(tries):
            try:
                resp = session.get(url, timeout=20)
                if resp.status_code < 400 and resp.status_code != 429:
                    return resp
            except Exception:
                resp = None
            time.sleep(min(30, 2 ** attempt) + random.random())
        if resp is None:
            raise RuntimeError("request failed")
        resp.raise_for_status()
        return resp
    

    In production, failures will be written to a log with:

    • URL
    • status code
    • exception type
    • retry count
    • timestamp

    If you cannot answer "how many URLs failed and why", you do not have a crawler yet.

    Using User Agents and Headers

    The default Python user agent is a red flag for some sites.

    This does not mean you should pretend to be Chrome 124 with 40 headers.

    It means:

    • a realistic User-Agent
    • basic Accept headers
    • consistent behavior (timeouts, redirects)
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (compatible; WebCrawlerAPI/1.0)",
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
    })
    

    If sessions/cookies are required (login flows), things will get harder fast. At that point, browser automation or an API-based approach will usually be chosen.

    Avoiding Blocks and CAPTCHA

    This is the part people try to skip.

    Blocking will happen when:

    • too many requests are sent from one IP
    • patterns look too bot-like (same path cadence, no cookies, no JS)
    • the target has aggressive bot protection

    The early warning signs:

    • 403 spikes
    • 429 spikes
    • HTML that suddenly becomes a challenge page
    • response sizes that drop to a tiny constant

    What will help before anything fancy:

    1. Slow down.
    2. Respect robots.txt.
    3. Cache responses while developing parsers.
    4. Stop crawling when blocks start. (Backoff, rotate targets, retry later.)

    Proxies can help, but they will not fix a broken crawler design.


    Scaling Your Python Crawler (When DIY Gets Hard)

    DIY crawlers break in predictable ways:

    • A laptop will not like thousands of browser sessions.
    • IP blocks will show up as soon as the crawl is big enough.
    • Retrying and scheduling will become the real work.

    At this point, three paths are usually taken:

    1. Stay DIY, but invest in infrastructure. Queues, storage, distributed workers, observability.
    2. Move more of the work to Scrapy. Concurrency and retry logic will be managed for you.
    3. Use a crawling API when rendering, proxies, retries, and job scheduling are not where you want to spend your time.

    This is where a service like WebCrawlerAPI can make sense.

    The tradeoff is simple:

    • Money will be spent.
    • Engineering time will be saved.

    If your crawl is a one-off for 50 pages, it will not be worth it. If you are running jobs daily across thousands of URLs, it often will be.


    Common Use Cases for Python Web Crawlers

    The crawler is just a tool. The value comes from what is built on top.

    Price monitoring

    Product pages will be crawled on a schedule, price fields will be extracted, and diffs will be stored.

    Real-life caveats:

    • prices will be personalized
    • currencies will change by region
    • stock will be hidden behind JS

    Lead generation (contact discovery)

    This usually means crawling:

    • team pages
    • directory pages
    • "contact" pages

    Then extracting emails, phone numbers, or forms.

    Be careful here. Legal rules will differ by country, and terms of service will exist.

    SEO audits and competitor research

    This is a classic crawl job:

    • find all internal URLs
    • check status codes (404/500)
    • identify redirect chains
    • detect thin pages (low word count)
    • map internal links and depth

    Content aggregation

    Blogs, docs, and knowledge bases will be crawled to:

    • build internal search
    • create datasets
    • keep local mirrors

    If the goal is just “tell me when this page changes”, a feed is often simpler than a full crawl pipeline. See: convert any website to an RSS feed.

    Market research

    This is the messy one.

    You will be dealing with:

    • inconsistent HTML
    • JS-heavy listing pages
    • rate limits
    • pagination patterns that change without warning

    This is where crawlers become products.


    Troubleshooting Common Crawling Problems

    Problem: "My crawler returns empty content"

    Likely cause: JavaScript rendering.

    Fix:

    • check View Page Source
    • find the JSON endpoint in DevTools
    • use Selenium/Playwright only where needed

    Problem: "I keep getting blocked (403/429)"

    Likely cause: rate limiting or bot protection.

    Fix:

    • slow down
    • add backoff and retries
    • reduce concurrency
    • respect robots.txt
    • stop the crawl when blocks spike and retry later

    Problem: "My crawl never ends"

    Likely cause: infinite URL space.

    Fix:

    • drop query params by default
    • add max_pages and/or depth limits
    • add allow/deny patterns

    Problem: "My output has duplicates"

    Likely cause: URL variants and redirects.

    Fix:

    • normalize URLs (remove fragments, consistent trailing slashes)
    • store and dedupe final URLs after redirects
    • consider canonical URLs if provided

    Problem: "It is too slow"

    Fix:

    • cache responses during development
    • use a requests.Session()
    • add controlled concurrency (Scrapy will do this well)
    • avoid rendering unless needed

    Frequently Asked Questions

    Is web crawling legal?

    It depends.

    Public pages can be crawled, but terms of service, robots.txt, and local laws can apply. This is not legal advice.

    If you are crawling anything sensitive (accounts, paywalls, personal data), talk to a lawyer.

    What is the difference between crawling and scraping?

    Crawling discovers pages. Scraping extracts data from those pages.

    Most projects will do both.

    How fast should a crawler run?

    As slow as needed to avoid being blocked and to avoid hurting the target site.

    If you cannot get stable results at 1 request per second, going faster will not help.

    Should robots.txt be respected?

    Yes.

    If the goal is a reliable crawl, you do not want to fight the target site.

    What is the best Python library for crawling?

    • For small scripts: requests + BeautifulSoup.
    • For real crawling: Scrapy.
    • For JS-heavy pages: Playwright or Selenium.
    • API if you don't have time to build your own crawler and just need data.

    How should pagination be handled?

    If there is a clear "next" link, follow it.

    If pagination is query-based (?page=2), add an allowlist and a hard cap.

    How should duplicates be handled?

    URLs will be normalized. Final redirected URLs will be stored. Canonical URLs will be considered when available.

    Crawl data from the website with an API in Python.

    Using an API is the shortcut when crawling stops being "just a script" and starts turning into infrastructure. Browser rendering, proxies, retries, and fingerprinting will already be solved on the other side, so time is not spent rebuilding them. The crawl can be made more predictable too: one request starts a job, results come back in a consistent format, and failures are handled with retries instead of manual babysitting.

    When should a crawling API be used?

    When the crawl becomes infrastructure:

    • proxies
    • browser rendering
    • job scheduling
    • retries at scale

    If those are not your focus, using API can be the right move.

    Start crawling job in Python.

    Assuming you have your access key, here is the code with the basic parameters to crawl any site with Python:

    #!/usr/bin/env python3
    
    from webcrawlerapi import WebCrawlerAPI
    
    API_KEY = "Your API KEY from https://dash.webcrawlerapi.com/access"
    
    crawler = WebCrawlerAPI(api_key=API_KEY)
    
    job = crawler.crawl(
        url="https://books.toscrape.com",
        scrape_type="markdown",
        items_limit=10,
    )
    
    print(job.status)
    print(len(job.job_items))
    

    Conclusion: Start Crawling Websites with Python Today

    If you only need a few pages, the copy-paste Requests crawler will be enough.

    If you need hundreds, Scrapy will save you time.

    If the site is JavaScript-heavy, a browser renderer will be used for the pages that need it, not for everything.

    And if the crawl becomes a recurring job with retries, proxies, and rendering, that is when a managed crawler like WebCrawlerAPI will start to look reasonable.