To start crawling, first install Scrapy:
pip install scrapy
Then, create a basic script:
import scrapyimport osimport hashlib
class PageSaverSpider(scrapy.Spider): name = "page_saver" start_urls = [ 'https://books.toscrape.com/index.html', ]
def parse(self, response): # Extracting the URL to use as a filename url = response.url # Using hashlib to create a unique hash for the URL to ensure filename is filesystem safe url_hash = hashlib.md5(url.encode()).hexdigest() # Create a safe filename from the URL safe_url = url.replace('://', '_').replace('/', '_').replace(':', '_') filename = f'{safe_url}_{url_hash}.html'
# Ensuring filename length does not exceed filesystem limits filename = (filename[:245] + '..html') if len(filename) > 250 else filename
# Creating a directory to save files if it doesn't exist os.makedirs('saved_pages', exist_ok=True) file_path = os.path.join('saved_pages', filename)
# Writing the response body to the file with open(file_path, 'wb') as f: f.write(response.body) self.log(f'Saved file {file_path}')
# Following links to the next page next_pages = response.css('a::attr(href)').getall() for next_page in next_pages: next_page_url = response.urljoin(next_page) yield scrapy.Request(next_page_url, callback=self.parse)
This scraper script does the following:
saved_pages
directory. Name includes URL<a href=""></a>
elements and send extracted links to a queueThis basic Python Scrapy script helps to save the content of the website pages to files.
If you run the script above on a certain website, every new page will be crawled immediately after the previous one. This can create an unwanted load, which may lead to downtime or blocks. Website owners, if it is not you, of course, can decide to make crawler bots’ lives harder by installing bot protection, CAPTCHAs, banning IPs, etc.
To respect crawling websites, you can set delays between crawling web pages using the DOWNLOAD_DELAY
custom setting:
custom_settings = { 'DOWNLOAD_DELAY': 1, }
By default, Scrapy doesn’t render JavaScript. This is a significant limitation since, nowadays, more and more websites are using JS to render webpages.
Fortunately, this is possible with (Splash)[https://splash.readthedocs.io/] - a lightweight javascript rendering service.
Run Splash using Docker first:
docker pull scrapinghub/splashdocker run -p 8050:8050 scrapinghub/splash
Next, install the scrapy-splash package:
pip install scrapy-splash
Add these settings in your scrapper:
# Enable splash middlewareDOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,}
# Enable splash deduplicate filterDUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# Enable splash HTTP cacheHTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# Splash server URLSPLASH_URL = 'http://localhost:8050'
import scrapyimport osimport hashlib
class PageSaverSpider(scrapy.Spider): name = "page_saver" start_urls = [ 'https://books.toscrape.com/index.html', ] custom_settings = { 'DOWNLOAD_DELAY': 1, # Adding a delay of 1 second between requests 'BOT_NAME': 'myproject', 'DOWNLOADER_MIDDLEWARES': { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'SPLASH_URL': 'http://localhost:8050', }
def parse(self, response): # Extracting the URL to use as a filename url = response.url # Using hashlib to create a unique hash for the URL to ensure filename is filesystem safe url_hash = hashlib.md5(url.encode()).hexdigest() # Create a safe filename from the URL safe_url = url.replace('://', '_').replace('/', '_').replace(':', '_') filename = f'{safe_url}_{url_hash}.html'
# Ensuring filename length does not exceed filesystem limits filename = (filename[:245] + '..html') if len(filename) > 250 else filename
# Creating a directory to save files if it doesn't exist os.makedirs('saved_pages', exist_ok=True) file_path = os.path.join('saved_pages', filename)
# Writing the response body to the file with open(file_path, 'wb') as f: f.write(response.body) self.log(f'Saved file {file_path}')
# Following links to the next page next_pages = response.css('a::attr(href)').getall() for next_page in next_pages: next_page_url = response.urljoin(next_page) yield scrapy.Request(next_page_url, callback=self.parse)
Scrapy is a powerful Python crawling and scraping framework. It works great if you need to crawl or scrape a website. However, in order to do that, you have to be familiar with Python programming language and manage infrastructure yourself.
If you don’t have time for that and simply want to do an HTTP call and get the data, it is better to try an WebCrawler API which handles all this for you, however, it is a paid service.