How to crawl website with PHP

Want to scrape website data using PHP? Here’s a quick guide to get started. PHP offers two main paths for web crawling:

Frameworks like Goutte and Spatie/Crawler:
- Goutte: Simple and great for static websites.
- Spatie/Crawler: Handles dynamic, JavaScript-heavy sites with advanced features.
WebCrawlerAPI:
- A cloud-based service for effortless, scalable web scraping without manual setup.

Quick Comparison:

Feature	Goutte	Spatie/Crawler	WebCrawlerAPI
JavaScript Support	No	Yes	Yes
Scalability	Manual setup	Manual setup	Automatic (cloud)
Ease of Use	Simple	Moderate	Very simple
Best For	Static websites	Dynamic websites	Large-scale projects

Whether you need full control through frameworks or a hassle-free API solution, PHP has you covered. Let’s dive into the details!

Selecting a PHP Crawler Framework

When working on web crawling projects with PHP, Goutte and Spatie/Crawler are two popular options. Each has its own strengths, making them suitable for different types of tasks.

Goutte

Goutte is built on Symfony components and is well-suited for large-scale crawling. Its object-oriented design and efficient handling of HTML/XML make it a great choice for straightforward data extraction. Beginners often find its intuitive DOM crawler easy to use [1].

Here’s a quick example of how Goutte works:

use GoutteClient;

$client = new Client();
$crawler = $client->request('GET', 'https://www.example.com');

$crawler->filter('h1')->each(function ($node) {
    echo $node->text();
});

Spatie/Crawler

Spatie/Crawler is better equipped for modern web applications, especially those with dynamic, client-side rendering. Pairing it with tools like Puppeteer allows it to handle JavaScript-heavy websites effectively [3].

Some of its standout features include:

Asynchronous crawling
Compliance with robots.txt
The ability to index PDFs [3]

Feature	Goutte	Spatie/Crawler
JavaScript Support	No	Yes
Memory Efficiency	Moderate	High
Ease of Use	Simple	Moderate
Best For	Static websites	Dynamic websites
Error Handling	Basic	Advanced

Choosing the Right Framework

The best framework depends on your project's needs. If you’re working with static websites and simple structures, Goutte is a solid option. On the other hand, for dynamic sites or projects requiring features like asynchronous crawling, Spatie/Crawler is the better fit [1][3].

Next, we’ll dive into how to set up and use Goutte for website crawling.

Guide to Using Goutte

Setup

Start by installing Goutte through Composer:

composer require fabpot/goutte

Next, include the library in your project:

require_once 'vendor/autoload.php';
use GoutteClient;

Crawling a Website

Here's how you can crawl a single webpage:

$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com');

// Check if the response is successful
if ($client->getResponse()->getStatusCode() === 200) {
    $pageTitle = $crawler->filter('title')->text();
    echo "Page Title: " . $pageTitle;
}

To handle multiple pages, you can use the following approach:

$baseUrl = 'https://books.toscrape.com/catalogue/page-';
$page = 1;

try {
    do {
        $crawler = $client->request('GET', $baseUrl . $page . '.html');
        // Process the data here
        $page++;
    } while ($crawler->filter('.next > a')->count() > 0);
} catch (Exception $e) {
    echo "Error: " . $e->getMessage();
}

Data Extraction

To extract links, prices, or table data, use these examples:

// Extract links and their text
$links = $crawler->filter('a')->each(function ($node) {
    return [
        'text' => $node->text(),
        'href' => $node->attr('href')
    ];
});

// Extract prices
$prices = $crawler->filter('.price_color')->each(function ($node) {
    return $node->text();
});

// Extract table data
$tableData = $crawler->filter('table tr')->each(function ($row) {
    return $row->filter('td')->each(function ($cell) {
        return $cell->text();
    });
});

Below are some common CSS selectors and their purposes:

Selector	Purpose
.class-name	Selects elements by class name
a	Retrieves all hyperlinks
table tr td	Extracts content from table cells
#id-name	Targets elements with a specific ID
.parent .child	Finds nested elements within a parent container

With Goutte, you can efficiently scrape and extract data from websites. But if you're looking for another option, you might want to explore WebCrawlerAPI.

sbb-itb-ac346ed

Alternative: WebCrawlerAPI

WebCrawlerAPI

If you're looking for a simpler, managed option for web crawling, WebCrawlerAPI might be the solution. Unlike tools like Goutte or Spatie/Crawler that require custom setups, WebCrawlerAPI is a cloud-based service designed to let developers focus on their core tasks without worrying about infrastructure.

How to Use WebCrawlerAPI

Here’s a quick example of how you can use WebCrawlerAPI in your PHP project:

Installation

composer require webcrawlerapi/sdk

Usage

use WebCrawlerAPIWebCrawlerAPI;

// Initialize the client
$crawler = new WebCrawlerAPI('your_api_key');

// Synchronous crawling (blocks until completion)
$job = $crawler->crawl(
    url: 'https://example.com',
    scrapeType: 'markdown',
    itemsLimit: 10
);

Why Choose WebCrawlerAPI?

WebCrawlerAPI brings several advantages to the table for developers:

Simple Integration: You can get started with just a few lines of code.
Cloud Scalability: Handles large-scale crawling without manual effort.
Anti-Bot Features: Includes tools like CAPTCHA bypassing.
Multiple Output Formats: Supports Markdown, Text and raw HTML.
Pay-As-You-Go Pricing: You only pay for the resources you use.

Comparing WebCrawlerAPI to Traditional PHP Frameworks

Feature	WebCrawlerAPI	Traditional PHP Frameworks
Scalability	Automatically scales in the cloud	Requires manual configuration
Data Formats	HTML, Text, Markdown	Limited to framework capabilities
Maintenance	Fully managed service	Requires ongoing self-maintenance

This makes WebCrawlerAPI a great choice for developers who want a hassle-free, scalable solution for web crawling.

Summary

Web crawling with PHP can be done using either traditional frameworks or modern API-based solutions. Tools like Goutte and Spatie/Crawler provide detailed control over the crawling process, while services like WebCrawlerAPI offer a managed, scalable option with minimal configuration.

Using PHP frameworks gives you more control over how the crawling works but requires manual setup and ongoing maintenance. On the other hand, WebCrawlerAPI handles challenges like anti-bot measures and scalability for you, making it a great choice for larger or more resource-intensive projects.

Frameworks like Goutte and Spatie/Crawler work well for smaller projects or when custom crawling behavior is essential [1][3]. If you're looking for a simpler solution that minimizes effort, WebCrawlerAPI takes care of the heavy lifting.

The right choice depends on your project's needs, including its size, technical complexity, and how much control you want over the crawling process.

Next, let’s dive into some common questions about using PHP for web crawling.

FAQs

Is PHP good for web scraping?

PHP is a practical choice for web scraping, thanks to its array of libraries and tools that simplify the process of extracting and processing web data [1][2].

With features like HTML parsing, HTTP request handling, and effective error management, PHP makes web scraping tasks smoother. Its design, tailored for web development, also aids in handling sessions and text encoding, making it well-suited for these projects [1][3].

It's important to follow ethical scraping practices, such as respecting robots.txt guidelines and including request delays. Whether using frameworks or APIs, PHP offers a dependable solution for web scraping tasks [4].

Table of Contents