SDKs and Code Examples
PHP
Learn how to use the WebCrawler API PHP SDK to crawl websites and extract data.
Obtain an API Key
To use the WebCrawler API, you need to obtain an API key. You can do this by signing up for a free account.
Installation
composer require webcrawlerapi/sdkRequirements
- PHP 8.1 or higher
- Composer
- ext-json PHP extension
- Guzzle HTTP Client 7.0 or higher
Usage
Synchronous Crawling
The synchronous method waits for the crawl to complete and returns all data at once.
<?php
require_once('vendor/autoload.php');
use WebCrawlerAPI\WebCrawlerAPI;
// Initialize the client
$crawler = new WebCrawlerAPI('YOUR_API_KEY');
// Synchronous crawling
$job = $crawler->crawl(
url: 'https://example.com',
scrapeType: 'markdown',
itemsLimit: 10
);
echo "Job completed with status: {$job->status}\n";
// Access job items and their content
foreach ($job->jobItems as $item) {
echo "Page title: {$item->title}\n";
echo "Original URL: {$item->originalUrl}\n";
// Get the content based on job's scrape_type
// Returns null if item is not in "done" status
$content = $item->getContent();
if ($content) {
echo "Content preview: " . substr($content, 0, 200) . "...\n";
} else {
echo "Content not available (item not done)\n";
}
}Asynchronous Crawling
The asynchronous method returns a job ID immediately and allows you to check the status later.
<?php
require_once('vendor/autoload.php');
use WebCrawlerAPI\WebCrawlerAPI;
// Initialize the client
$crawler = new WebCrawlerAPI('YOUR_API_KEY');
// Start async crawl job
$response = $crawler->crawlAsync(
url: 'https://example.com',
scrapeType: 'markdown',
itemsLimit: 10
);
// Get the job ID
$jobId = $response->id;
echo "Crawling job started with ID: {$jobId}\n";
// Check job status
$job = $crawler->getJob($jobId);
echo "Job status: {$job->status}\n";
// Poll until complete if needed
while ($job->status === 'in_progress') {
sleep($job->recommendedPullDelayMs / 1000); // Convert ms to seconds
$job = $crawler->getJob($jobId);
}
// Process results
if ($job->status === 'done') {
foreach ($job->jobItems as $item) {
echo "Page title: {$item->title}\n";
echo "Original URL: {$item->originalUrl}\n";
}
}Available Parameters
Both crawling methods support these parameters:
url(required): The target URL to crawlscrapeType: Type of content to extract ('markdown', 'html', 'cleaned')itemsLimit: Maximum number of pages to crawl (default: 10)allowSubdomains: Whether to crawl subdomains (default: false)whitelistRegexp: Regular expression for allowed URLsblacklistRegexp: Regular expression for blocked URLswebhookUrl: URL to receive notifications when the job completesmaxPolls: Maximum number of status checks (sync only, default: 100)
Response Objects
Job Object
$job->id // Unique job identifier
$job->status // Job status (new, in_progress, done, error)
$job->url // Original crawl URL
$job->createdAt // Job creation timestamp
$job->finishedAt // Job completion timestamp
$job->jobItems // Array of crawled items
$job->recommendedPullDelayMs // Recommended delay between status checksJobItem Object
$item->id // Unique item identifier
$item->originalUrl // URL of the crawled page
$item->title // Page title
$item->status // Item status
$item->pageStatusCode // HTTP status code
$item->markdownContentUrl // URL to markdown content (if applicable)
$item->rawContentUrl // URL to raw content
$item->cleanedContentUrl // URL to cleaned content
$item->getContent() // Method to get content based on scrape_type (returns null if not "done")