PHP WebCrawler API SDK
Obtain an API Key
To use the WebCrawler API, you need to obtain an API key. You can do this by signing up for a free account (opens in a new tab) and then creating a new project.
Installation
composer require webcrawlerapi/webcrawlerapi-php-sdk
Requirements
- PHP 8.0 or higher
- Composer
- ext-json PHP extension
- Guzzle HTTP Client 7.0 or higher
Usage
Synchronous Crawling
The synchronous method waits for the crawl to complete and returns all data at once.
<?php
require_once('vendor/autoload.php');
use WebCrawlerAPI\WebCrawlerAPI;
// Initialize the client
$crawler = new WebCrawlerAPI('YOUR_API_KEY');
// Synchronous crawling
$job = $crawler->crawl([
'url' => 'https://example.com',
'scrape_type' => 'markdown',
'items_limit' => 10
]);
echo "Job completed with status: {$job->status}\n";
// Access job items and their content
foreach ($job->jobItems as $item) {
echo "Page title: {$item->title}\n";
echo "Original URL: {$item->originalUrl}\n";
// Get the content based on job's scrape_type
$content = $item->getContent();
if ($content) {
echo "Content preview: " . substr($content, 0, 200) . "...\n";
}
}
Asynchronous Crawling
The asynchronous method returns a job ID immediately and allows you to check the status later.
<?php
require_once('vendor/autoload.php');
use WebCrawlerAPI\WebCrawlerAPI;
// Initialize the client
$crawler = new WebCrawlerAPI('YOUR_API_KEY');
// Start async crawl job
$response = $crawler->crawlAsync([
'url' => 'https://example.com',
'scrape_type' => 'markdown',
'items_limit' => 10
]);
// Get the job ID
$jobId = $response->id;
echo "Crawling job started with ID: {$jobId}\n";
// Check job status
$job = $crawler->getJob($jobId);
echo "Job status: {$job->status}\n";
// Poll until complete if needed
while ($job->status === 'in_progress') {
sleep($job->recommendedPullDelayMs / 1000); // Convert ms to seconds
$job = $crawler->getJob($jobId);
}
// Process results
if ($job->status === 'done') {
foreach ($job->jobItems as $item) {
echo "Page title: {$item->title}\n";
echo "Original URL: {$item->originalUrl}\n";
}
}
Available Parameters
Both crawling methods support these parameters:
url
(required): The target URL to crawlscrape_type
: Type of content to extract ('markdown', 'html', 'cleaned')items_limit
: Maximum number of pages to crawl (default: 10)allow_subdomains
: Whether to crawl subdomains (default: false)whitelist_regexp
: Regular expression for allowed URLsblacklist_regexp
: Regular expression for blocked URLswebhook_url
: URL to receive notifications when the job completesmax_polls
: Maximum number of status checks (sync only, default: 100)
Response Objects
Job Object
$job->id // Unique job identifier
$job->status // Job status (new, in_progress, done, error)
$job->url // Original crawl URL
$job->createdAt // Job creation timestamp
$job->finishedAt // Job completion timestamp
$job->jobItems // Array of crawled items
$job->recommendedPullDelayMs // Recommended delay between status checks
JobItem Object
$item->id // Unique item identifier
$item->originalUrl // URL of the crawled page
$item->title // Page title
$item->status // Item status
$item->pageStatusCode // HTTP status code
$item->markdownContentUrl // URL to markdown content (if applicable)
$item->rawContentUrl // URL to raw content
$item->cleanedContentUrl // URL to cleaned content
$item->getContent() // Method to get content based on scrape_type