PHP WebCrawler API SDK
Obtain an API Key
To use the WebCrawler API, you need to obtain an API key. You can do this by signing up for a free account (opens in a new tab) and then creating a new project.
Installation
composer require webcrawlerapi/sdk
Requirements
- PHP 8.1 or higher
- Composer
- ext-json PHP extension
- Guzzle HTTP Client 7.0 or higher
Usage
Synchronous Crawling
The synchronous method waits for the crawl to complete and returns all data at once.
<?php
require_once('vendor/autoload.php');
use WebCrawlerAPI\WebCrawlerAPI;
// Initialize the client
$crawler = new WebCrawlerAPI('YOUR_API_KEY');
// Synchronous crawling
$job = $crawler->crawl(
url: 'https://example.com',
scrapeType: 'markdown',
itemsLimit: 10
);
echo "Job completed with status: {$job->status}\n";
// Access job items and their content
foreach ($job->jobItems as $item) {
echo "Page title: {$item->title}\n";
echo "Original URL: {$item->originalUrl}\n";
// Get the content based on job's scrape_type
// Returns null if item is not in "done" status
$content = $item->getContent();
if ($content) {
echo "Content preview: " . substr($content, 0, 200) . "...\n";
} else {
echo "Content not available (item not done)\n";
}
}
Asynchronous Crawling
The asynchronous method returns a job ID immediately and allows you to check the status later.
<?php
require_once('vendor/autoload.php');
use WebCrawlerAPI\WebCrawlerAPI;
// Initialize the client
$crawler = new WebCrawlerAPI('YOUR_API_KEY');
// Start async crawl job
$response = $crawler->crawlAsync(
url: 'https://example.com',
scrapeType: 'markdown',
itemsLimit: 10
);
// Get the job ID
$jobId = $response->id;
echo "Crawling job started with ID: {$jobId}\n";
// Check job status
$job = $crawler->getJob($jobId);
echo "Job status: {$job->status}\n";
// Poll until complete if needed
while ($job->status === 'in_progress') {
sleep($job->recommendedPullDelayMs / 1000); // Convert ms to seconds
$job = $crawler->getJob($jobId);
}
// Process results
if ($job->status === 'done') {
foreach ($job->jobItems as $item) {
echo "Page title: {$item->title}\n";
echo "Original URL: {$item->originalUrl}\n";
}
}
Available Parameters
Both crawling methods support these parameters:
url
(required): The target URL to crawlscrapeType
: Type of content to extract ('markdown', 'html', 'cleaned')itemsLimit
: Maximum number of pages to crawl (default: 10)allowSubdomains
: Whether to crawl subdomains (default: false)whitelistRegexp
: Regular expression for allowed URLsblacklistRegexp
: Regular expression for blocked URLswebhookUrl
: URL to receive notifications when the job completesmaxPolls
: Maximum number of status checks (sync only, default: 100)
Response Objects
Job Object
$job->id // Unique job identifier
$job->status // Job status (new, in_progress, done, error)
$job->url // Original crawl URL
$job->createdAt // Job creation timestamp
$job->finishedAt // Job completion timestamp
$job->jobItems // Array of crawled items
$job->recommendedPullDelayMs // Recommended delay between status checks
JobItem Object
$item->id // Unique item identifier
$item->originalUrl // URL of the crawled page
$item->title // Page title
$item->status // Item status
$item->pageStatusCode // HTTP status code
$item->markdownContentUrl // URL to markdown content (if applicable)
$item->rawContentUrl // URL to raw content
$item->cleanedContentUrl // URL to cleaned content
$item->getContent() // Method to get content based on scrape_type (returns null if not "done")