PHP WebCrawler API SDK

Obtain an API Key

To use the WebCrawler API, you need to obtain an API key. You can do this by signing up for a free account (opens in a new tab) and then creating a new project.

Installation

composer require webcrawlerapi/sdk

Requirements

PHP 8.1 or higher
Composer
ext-json PHP extension
Guzzle HTTP Client 7.0 or higher

Usage

Synchronous Crawling

The synchronous method waits for the crawl to complete and returns all data at once.

<?php
require_once('vendor/autoload.php');
 
use WebCrawlerAPI\WebCrawlerAPI;
 
// Initialize the client
$crawler = new WebCrawlerAPI('YOUR_API_KEY');
 
// Synchronous crawling
$job = $crawler->crawl(
    url: 'https://example.com',
    scrapeType: 'markdown',
    itemsLimit: 10
);
 
echo "Job completed with status: {$job->status}\n";
 
// Access job items and their content
foreach ($job->jobItems as $item) {
    echo "Page title: {$item->title}\n";
    echo "Original URL: {$item->originalUrl}\n";
    
    // Get the content based on job's scrape_type
    // Returns null if item is not in "done" status
    $content = $item->getContent();
    if ($content) {
        echo "Content preview: " . substr($content, 0, 200) . "...\n";
    } else {
        echo "Content not available (item not done)\n";
    }
}

Asynchronous Crawling

The asynchronous method returns a job ID immediately and allows you to check the status later.

<?php
require_once('vendor/autoload.php');
 
use WebCrawlerAPI\WebCrawlerAPI;
 
// Initialize the client
$crawler = new WebCrawlerAPI('YOUR_API_KEY');
 
// Start async crawl job
$response = $crawler->crawlAsync(
    url: 'https://example.com',
    scrapeType: 'markdown',
    itemsLimit: 10
);
 
// Get the job ID
$jobId = $response->id;
echo "Crawling job started with ID: {$jobId}\n";
 
// Check job status
$job = $crawler->getJob($jobId);
echo "Job status: {$job->status}\n";
 
// Poll until complete if needed
while ($job->status === 'in_progress') {
    sleep($job->recommendedPullDelayMs / 1000); // Convert ms to seconds
    $job = $crawler->getJob($jobId);
}
 
// Process results
if ($job->status === 'done') {
    foreach ($job->jobItems as $item) {
        echo "Page title: {$item->title}\n";
        echo "Original URL: {$item->originalUrl}\n";
    }
}

Available Parameters

Both crawling methods support these parameters:

url (required): The target URL to crawl
scrapeType: Type of content to extract ('markdown', 'html', 'cleaned')
itemsLimit: Maximum number of pages to crawl (default: 10)
allowSubdomains: Whether to crawl subdomains (default: false)
whitelistRegexp: Regular expression for allowed URLs
blacklistRegexp: Regular expression for blocked URLs
webhookUrl: URL to receive notifications when the job completes
maxPolls: Maximum number of status checks (sync only, default: 100)

Response Objects

Job Object

$job->id                         // Unique job identifier
$job->status                     // Job status (new, in_progress, done, error)
$job->url                        // Original crawl URL
$job->createdAt                  // Job creation timestamp
$job->finishedAt                 // Job completion timestamp
$job->jobItems                   // Array of crawled items
$job->recommendedPullDelayMs     // Recommended delay between status checks

JobItem Object

$item->id                    // Unique item identifier
$item->originalUrl           // URL of the crawled page
$item->title                 // Page title
$item->status                // Item status
$item->pageStatusCode        // HTTP status code
$item->markdownContentUrl    // URL to markdown content (if applicable)
$item->rawContentUrl         // URL to raw content
$item->cleanedContentUrl     // URL to cleaned content
$item->getContent()          // Method to get content based on scrape_type (returns null if not "done")

JavaScript and TypeScript (Node.js)Python