docs
SDKs and Code Examples
PHP

PHP WebCrawler API SDK

Obtain an API Key

To use the WebCrawler API, you need to obtain an API key. You can do this by signing up for a free account (opens in a new tab) and then creating a new project.

Installation

composer require webcrawlerapi/webcrawlerapi-php-sdk

Requirements

  • PHP 8.0 or higher
  • Composer
  • ext-json PHP extension
  • Guzzle HTTP Client 7.0 or higher

Usage

Synchronous Crawling

The synchronous method waits for the crawl to complete and returns all data at once.

<?php
require_once('vendor/autoload.php');
 
use WebCrawlerAPI\WebCrawlerAPI;
 
// Initialize the client
$crawler = new WebCrawlerAPI('YOUR_API_KEY');
 
// Synchronous crawling
$job = $crawler->crawl([
    'url' => 'https://example.com',
    'scrape_type' => 'markdown',
    'items_limit' => 10
]);
 
echo "Job completed with status: {$job->status}\n";
 
// Access job items and their content
foreach ($job->jobItems as $item) {
    echo "Page title: {$item->title}\n";
    echo "Original URL: {$item->originalUrl}\n";
    
    // Get the content based on job's scrape_type
    $content = $item->getContent();
    if ($content) {
        echo "Content preview: " . substr($content, 0, 200) . "...\n";
    }
}

Asynchronous Crawling

The asynchronous method returns a job ID immediately and allows you to check the status later.

<?php
require_once('vendor/autoload.php');
 
use WebCrawlerAPI\WebCrawlerAPI;
 
// Initialize the client
$crawler = new WebCrawlerAPI('YOUR_API_KEY');
 
// Start async crawl job
$response = $crawler->crawlAsync([
    'url' => 'https://example.com',
    'scrape_type' => 'markdown',
    'items_limit' => 10
]);
 
// Get the job ID
$jobId = $response->id;
echo "Crawling job started with ID: {$jobId}\n";
 
// Check job status
$job = $crawler->getJob($jobId);
echo "Job status: {$job->status}\n";
 
// Poll until complete if needed
while ($job->status === 'in_progress') {
    sleep($job->recommendedPullDelayMs / 1000); // Convert ms to seconds
    $job = $crawler->getJob($jobId);
}
 
// Process results
if ($job->status === 'done') {
    foreach ($job->jobItems as $item) {
        echo "Page title: {$item->title}\n";
        echo "Original URL: {$item->originalUrl}\n";
    }
}

Available Parameters

Both crawling methods support these parameters:

  • url (required): The target URL to crawl
  • scrape_type: Type of content to extract ('markdown', 'html', 'cleaned')
  • items_limit: Maximum number of pages to crawl (default: 10)
  • allow_subdomains: Whether to crawl subdomains (default: false)
  • whitelist_regexp: Regular expression for allowed URLs
  • blacklist_regexp: Regular expression for blocked URLs
  • webhook_url: URL to receive notifications when the job completes
  • max_polls: Maximum number of status checks (sync only, default: 100)

Response Objects

Job Object

$job->id                         // Unique job identifier
$job->status                     // Job status (new, in_progress, done, error)
$job->url                        // Original crawl URL
$job->createdAt                  // Job creation timestamp
$job->finishedAt                 // Job completion timestamp
$job->jobItems                   // Array of crawled items
$job->recommendedPullDelayMs     // Recommended delay between status checks

JobItem Object

$item->id                    // Unique item identifier
$item->originalUrl           // URL of the crawled page
$item->title                 // Page title
$item->status                // Item status
$item->pageStatusCode        // HTTP status code
$item->markdownContentUrl    // URL to markdown content (if applicable)
$item->rawContentUrl         // URL to raw content
$item->cleanedContentUrl     // URL to cleaned content
$item->getContent()          // Method to get content based on scrape_type