How to build a web crawler with Scrapy in Python

Basic crawler with Python

To start crawling, first install Scrapy:

pip install scrapy

Then, create a basic script:

import scrapy
import os
import hashlib

class PageSaverSpider(scrapy.Spider):
    name = "page_saver"
    start_urls = [
        'https://books.toscrape.com/index.html',
    ]

    def parse(self, response):
        # Extracting the URL to use as a filename
        url = response.url
        # Using hashlib to create a unique hash for the URL to ensure filename is filesystem safe
        url_hash = hashlib.md5(url.encode()).hexdigest()
        # Create a safe filename from the URL
        safe_url = url.replace('://', '_').replace('/', '_').replace(':', '_')
        filename = f'{safe_url}_{url_hash}.html'

        # Ensuring filename length does not exceed filesystem limits
        filename = (filename[:245] + '..html') if len(filename) > 250 else filename

        # Creating a directory to save files if it doesn't exist
        os.makedirs('saved_pages', exist_ok=True)
        file_path = os.path.join('saved_pages', filename)

        # Writing the response body to the file
        with open(file_path, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {file_path}')

        # Following links to the next page
        next_pages = response.css('a::attr(href)').getall()
        for next_page in next_pages:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(next_page_url, callback=self.parse)

This scraper script does the following:

download content from the initial page
save content into the file under saved_pages directory. Name includes URL
find all <a href=""></a> elements and send extracted links to a queue

This basic Python Scrapy script helps to save the content of the website pages to files.

Scrapy crawl delay

If you run the script above on a certain website, every new page will be crawled immediately after the previous one. This can create an unwanted load, which may lead to downtime or blocks. Website owners, if it is not you, of course, can decide to make crawler bots' lives harder by installing bot protection, CAPTCHAs, banning IPs, etc.

To respect crawling websites, you can set delays between crawling web pages using the DOWNLOAD_DELAY custom setting:

   custom_settings = {
       'DOWNLOAD_DELAY': 1,
   }

Render Javascript in Scrapy

By default, Scrapy doesn't render JavaScript. This is a significant limitation since, nowadays, more and more websites are using JS to render webpages.

Fortunately, this is possible with (Splash)[https://splash.readthedocs.io/] - a lightweight javascript rendering service.

Run Splash using Docker first:

docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash

Next, install the scrapy-splash package:

pip install scrapy-splash

Add these settings in your scrapper:


# Enable splash middleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# Enable splash deduplicate filter
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

# Enable splash HTTP cache
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

# Splash server URL
SPLASH_URL = 'http://localhost:8050'

Final Scrapy crawler script

import scrapy
import os
import hashlib

class PageSaverSpider(scrapy.Spider):
    name = "page_saver"
    start_urls = [
        'https://books.toscrape.com/index.html',
    ]
    custom_settings = {
        'DOWNLOAD_DELAY': 1,  # Adding a delay of 1 second between requests
        'BOT_NAME': 'myproject',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
        'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
        'SPLASH_URL': 'http://localhost:8050',
    }

    def parse(self, response):
        # Extracting the URL to use as a filename
        url = response.url
        # Using hashlib to create a unique hash for the URL to ensure filename is filesystem safe
        url_hash = hashlib.md5(url.encode()).hexdigest()
        # Create a safe filename from the URL
        safe_url = url.replace('://', '_').replace('/', '_').replace(':', '_')
        filename = f'{safe_url}_{url_hash}.html'

        # Ensuring filename length does not exceed filesystem limits
        filename = (filename[:245] + '..html') if len(filename) > 250 else filename

        # Creating a directory to save files if it doesn't exist
        os.makedirs('saved_pages', exist_ok=True)
        file_path = os.path.join('saved_pages', filename)

        # Writing the response body to the file
        with open(file_path, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {file_path}')

        # Following links to the next page
        next_pages = response.css('a::attr(href)').getall()
        for next_page in next_pages:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(next_page_url, callback=self.parse)

Summary

Scrapy is a powerful Python crawling and scraping framework. It works great if you need to crawl or scrape a website. However, in order to do that, you have to be familiar with Python programming language and manage infrastructure yourself.

If you don't have time for that and simply want to do an HTTP call and get the data, it is better to try an WebCrawler API which handles all this for you, however, it is a paid service.