Webcrawler API LogoWebCrawler API
PricingDocsBlogSign inSign Up
Webcrawler API LogoWebCrawler API

Tools

  • Website to Markdown
  • llms.txt Generator

Resources

  • Blog
  • Docs
  • Changelog

Follow us

  • Github
  • X (Twitter)
  • Postman

Legal

  • Privacy Policy
  • Terms & Conditions
  • Refund Policy

Made in Netherlands 🇳🇱
2023-2025   ©103Labs
    7 min read to read

    How to build a web crawler with Scrapy in Python

    Scrapy is a powerful tool for crawling and scraping websites. In this tutorial, you will learn how to build a crawler using this framework, render JavaScript, and save the content of the website page by page.

    Written byAndrew
    Published onMay 25, 2024

    Table of Contents

    • Basic crawler with Python
    • Scrapy crawl delay
    • Render Javascript in Scrapy
    • Final Scrapy crawler script
    • Summary

    Table of Contents

    • Basic crawler with Python
    • Scrapy crawl delay
    • Render Javascript in Scrapy
    • Final Scrapy crawler script
    • Summary

    Basic crawler with Python

    To start crawling, first install Scrapy:

    pip install scrapy
    

    Then, create a basic script:

    import scrapy
    import os
    import hashlib
    
    class PageSaverSpider(scrapy.Spider):
        name = "page_saver"
        start_urls = [
            'https://books.toscrape.com/index.html',
        ]
    
        def parse(self, response):
            # Extracting the URL to use as a filename
            url = response.url
            # Using hashlib to create a unique hash for the URL to ensure filename is filesystem safe
            url_hash = hashlib.md5(url.encode()).hexdigest()
            # Create a safe filename from the URL
            safe_url = url.replace('://', '_').replace('/', '_').replace(':', '_')
            filename = f'{safe_url}_{url_hash}.html'
    
            # Ensuring filename length does not exceed filesystem limits
            filename = (filename[:245] + '..html') if len(filename) > 250 else filename
    
            # Creating a directory to save files if it doesn't exist
            os.makedirs('saved_pages', exist_ok=True)
            file_path = os.path.join('saved_pages', filename)
    
            # Writing the response body to the file
            with open(file_path, 'wb') as f:
                f.write(response.body)
            self.log(f'Saved file {file_path}')
    
            # Following links to the next page
            next_pages = response.css('a::attr(href)').getall()
            for next_page in next_pages:
                next_page_url = response.urljoin(next_page)
                yield scrapy.Request(next_page_url, callback=self.parse)
    

    This scraper script does the following:

    • download content from the initial page
    • save content into the file under saved_pages directory. Name includes URL
    • find all <a href=""></a> elements and send extracted links to a queue

    This basic Python Scrapy script helps to save the content of the website pages to files.

    Scrapy crawl delay

    If you run the script above on a certain website, every new page will be crawled immediately after the previous one. This can create an unwanted load, which may lead to downtime or blocks. Website owners, if it is not you, of course, can decide to make crawler bots' lives harder by installing bot protection, CAPTCHAs, banning IPs, etc.

    To respect crawling websites, you can set delays between crawling web pages using the DOWNLOAD_DELAY custom setting:

       custom_settings = {
           'DOWNLOAD_DELAY': 1,
       }
    

    Render Javascript in Scrapy

    By default, Scrapy doesn't render JavaScript. This is a significant limitation since, nowadays, more and more websites are using JS to render webpages.

    Fortunately, this is possible with (Splash)[https://splash.readthedocs.io/] - a lightweight javascript rendering service.

    Run Splash using Docker first:

    docker pull scrapinghub/splash
    docker run -p 8050:8050 scrapinghub/splash
    

    Next, install the scrapy-splash package:

    pip install scrapy-splash
    

    Add these settings in your scrapper:

    
    # Enable splash middleware
    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    
    # Enable splash deduplicate filter
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    
    # Enable splash HTTP cache
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    
    # Splash server URL
    SPLASH_URL = 'http://localhost:8050'
    

    Final Scrapy crawler script

    import scrapy
    import os
    import hashlib
    
    class PageSaverSpider(scrapy.Spider):
        name = "page_saver"
        start_urls = [
            'https://books.toscrape.com/index.html',
        ]
        custom_settings = {
            'DOWNLOAD_DELAY': 1,  # Adding a delay of 1 second between requests
            'BOT_NAME': 'myproject',
            'DOWNLOADER_MIDDLEWARES': {
                'scrapy_splash.SplashCookiesMiddleware': 723,
                'scrapy_splash.SplashMiddleware': 725,
                'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
            },
            'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
            'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
            'SPLASH_URL': 'http://localhost:8050',
        }
    
        def parse(self, response):
            # Extracting the URL to use as a filename
            url = response.url
            # Using hashlib to create a unique hash for the URL to ensure filename is filesystem safe
            url_hash = hashlib.md5(url.encode()).hexdigest()
            # Create a safe filename from the URL
            safe_url = url.replace('://', '_').replace('/', '_').replace(':', '_')
            filename = f'{safe_url}_{url_hash}.html'
    
            # Ensuring filename length does not exceed filesystem limits
            filename = (filename[:245] + '..html') if len(filename) > 250 else filename
    
            # Creating a directory to save files if it doesn't exist
            os.makedirs('saved_pages', exist_ok=True)
            file_path = os.path.join('saved_pages', filename)
    
            # Writing the response body to the file
            with open(file_path, 'wb') as f:
                f.write(response.body)
            self.log(f'Saved file {file_path}')
    
            # Following links to the next page
            next_pages = response.css('a::attr(href)').getall()
            for next_page in next_pages:
                next_page_url = response.urljoin(next_page)
                yield scrapy.Request(next_page_url, callback=self.parse)
    

    Summary

    Scrapy is a powerful Python crawling and scraping framework. It works great if you need to crawl or scrape a website. However, in order to do that, you have to be familiar with Python programming language and manage infrastructure yourself.

    If you don't have time for that and simply want to do an HTTP call and get the data, it is better to try an WebCrawler API which handles all this for you, however, it is a paid service.