Basic crawler with Python
To start crawling, first install Scrapy:
pip install scrapy
Then, create a basic script:
import scrapy
import os
import hashlib
class PageSaverSpider(scrapy.Spider):
name = "page_saver"
start_urls = [
'https://books.toscrape.com/index.html',
]
def parse(self, response):
# Extracting the URL to use as a filename
url = response.url
# Using hashlib to create a unique hash for the URL to ensure filename is filesystem safe
url_hash = hashlib.md5(url.encode()).hexdigest()
# Create a safe filename from the URL
safe_url = url.replace('://', '_').replace('/', '_').replace(':', '_')
filename = f'{safe_url}_{url_hash}.html'
# Ensuring filename length does not exceed filesystem limits
filename = (filename[:245] + '..html') if len(filename) > 250 else filename
# Creating a directory to save files if it doesn't exist
os.makedirs('saved_pages', exist_ok=True)
file_path = os.path.join('saved_pages', filename)
# Writing the response body to the file
with open(file_path, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {file_path}')
# Following links to the next page
next_pages = response.css('a::attr(href)').getall()
for next_page in next_pages:
next_page_url = response.urljoin(next_page)
yield scrapy.Request(next_page_url, callback=self.parse)
This scraper script does the following:
- download content from the initial page
- save content into the file under
saved_pages
directory. Name includes URL - find all
<a href=""></a>
elements and send extracted links to a queue
This basic Python Scrapy script helps to save the content of the website pages to files.
Scrapy crawl delay
If you run the script above on a certain website, every new page will be crawled immediately after the previous one. This can create an unwanted load, which may lead to downtime or blocks. Website owners, if it is not you, of course, can decide to make crawler bots' lives harder by installing bot protection, CAPTCHAs, banning IPs, etc.
To respect crawling websites, you can set delays between crawling web pages using the DOWNLOAD_DELAY
custom setting:
custom_settings = {
'DOWNLOAD_DELAY': 1,
}
Render Javascript in Scrapy
By default, Scrapy doesn't render JavaScript. This is a significant limitation since, nowadays, more and more websites are using JS to render webpages.
Fortunately, this is possible with (Splash)[https://splash.readthedocs.io/] - a lightweight javascript rendering service.
Run Splash using Docker first:
docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash
Next, install the scrapy-splash package:
pip install scrapy-splash
Add these settings in your scrapper:
# Enable splash middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# Enable splash deduplicate filter
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# Enable splash HTTP cache
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# Splash server URL
SPLASH_URL = 'http://localhost:8050'
Final Scrapy crawler script
import scrapy
import os
import hashlib
class PageSaverSpider(scrapy.Spider):
name = "page_saver"
start_urls = [
'https://books.toscrape.com/index.html',
]
custom_settings = {
'DOWNLOAD_DELAY': 1, # Adding a delay of 1 second between requests
'BOT_NAME': 'myproject',
'DOWNLOADER_MIDDLEWARES': {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'SPLASH_URL': 'http://localhost:8050',
}
def parse(self, response):
# Extracting the URL to use as a filename
url = response.url
# Using hashlib to create a unique hash for the URL to ensure filename is filesystem safe
url_hash = hashlib.md5(url.encode()).hexdigest()
# Create a safe filename from the URL
safe_url = url.replace('://', '_').replace('/', '_').replace(':', '_')
filename = f'{safe_url}_{url_hash}.html'
# Ensuring filename length does not exceed filesystem limits
filename = (filename[:245] + '..html') if len(filename) > 250 else filename
# Creating a directory to save files if it doesn't exist
os.makedirs('saved_pages', exist_ok=True)
file_path = os.path.join('saved_pages', filename)
# Writing the response body to the file
with open(file_path, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {file_path}')
# Following links to the next page
next_pages = response.css('a::attr(href)').getall()
for next_page in next_pages:
next_page_url = response.urljoin(next_page)
yield scrapy.Request(next_page_url, callback=self.parse)
Summary
Scrapy is a powerful Python crawling and scraping framework. It works great if you need to crawl or scrape a website. However, in order to do that, you have to be familiar with Python programming language and manage infrastructure yourself.
If you don't have time for that and simply want to do an HTTP call and get the data, it is better to try an WebCrawler API which handles all this for you, however, it is a paid service.