SDKs and Code Examples
Python
Learn how to use the WebCrawler API Python SDK to crawl websites and extract data.
Installation
pip install webcrawlerapiUsage
Synchronous Crawling
The synchronous method waits for the crawl to complete and returns all data at once.
from webcrawlerapi import WebCrawlerAPI
# Initialize the client
crawler = WebCrawlerAPI(api_key="YOUR_API_KEY")
# Synchronous crawling
result = crawler.crawl(
url="https://example.com",
scrape_type="markdown",
items_limit=10
)
print(f"Job completed with status: {result.status}")
print(f"Number of items crawled: {len(result.job_items)}")Asynchronous Crawling
The asynchronous method returns a job ID immediately and allows you to check the status later.
from webcrawlerapi import WebCrawlerAPI
import time
# Initialize the client
crawler = WebCrawlerAPI(api_key="YOUR_API_KEY")
# Start async crawl job
job = crawler.crawl_async(
url="https://example.com",
scrape_type="markdown",
items_limit=10
)
# Get the job ID
job_id = job.id
# Check job status
job_status = crawler.get_job(job_id)
# Poll until job is complete
while job_status.status == 'in_progress':
time.sleep(job_status.recommended_pull_delay_ms / 1000) # Convert ms to seconds
job_status = crawler.get_job(job_id)
# Process results
if job_status.status == 'done':
for item in job_status.job_items:
print(f"Page title: {item.title}")
print(f"Original URL: {item.original_url}")
print(f"Markdown content URL: {item.markdown_content_url}")Available Parameters
Both crawling methods support these parameters:
url(required): The target URL to crawlscrape_type: Type of content to extract ('markdown', 'html', 'cleaned')items_limit: Maximum number of pages to crawl (default: 10)allow_subdomains: Whether to crawl subdomains (default: False)whitelist_regexp: Regular expression for allowed URLsblacklist_regexp: Regular expression for blocked URLswebhook_url: URL to receive notifications when the job completesmax_polls: Maximum number of status checks (sync only, default: 100)
Response Objects
Job Object
job.id # Unique job identifier
job.status # Job status (new, in_progress, done, error)
job.url # Original crawl URL
job.created_at # Job creation timestamp
job.finished_at # Job completion timestamp
job.job_items # List of crawled items
job.recommended_pull_delay_ms # Recommended delay between status checksJobItem Object
item.id # Unique item identifier
item.original_url # URL of the crawled page
item.title # Page title
item.status # Item status
item.page_status_code # HTTP status code
item.markdown_content_url # URL to markdown content (if applicable)
item.raw_content_url # URL to raw content
item.cleaned_content_url # URL to cleaned content