Python WebCrawler API SDK
Installation
pip install webcrawlerapi
Usage
Synchronous Crawling
The synchronous method waits for the crawl to complete and returns all data at once.
from webcrawlerapi import WebCrawlerAPI
# Initialize the client
crawler = WebCrawlerAPI(api_key="YOUR_API_KEY")
# Synchronous crawling
result = crawler.crawl(
url="https://example.com",
scrape_type="markdown",
items_limit=10
)
print(f"Job completed with status: {result.status}")
print(f"Number of items crawled: {len(result.job_items)}")
Asynchronous Crawling
The asynchronous method returns a job ID immediately and allows you to check the status later.
from webcrawlerapi import WebCrawlerAPI
import time
# Initialize the client
crawler = WebCrawlerAPI(api_key="YOUR_API_KEY")
# Start async crawl job
job = crawler.crawl_async(
url="https://example.com",
scrape_type="markdown",
items_limit=10
)
# Get the job ID
job_id = job.id
# Check job status
job_status = crawler.get_job(job_id)
# Poll until job is complete
while job_status.status == 'in_progress':
time.sleep(job_status.recommended_pull_delay_ms / 1000) # Convert ms to seconds
job_status = crawler.get_job(job_id)
# Process results
if job_status.status == 'done':
for item in job_status.job_items:
print(f"Page title: {item.title}")
print(f"Original URL: {item.original_url}")
print(f"Markdown content URL: {item.markdown_content_url}")
Available Parameters
Both crawling methods support these parameters:
url
(required): The target URL to crawlscrape_type
: Type of content to extract ('markdown', 'html', 'cleaned')items_limit
: Maximum number of pages to crawl (default: 10)allow_subdomains
: Whether to crawl subdomains (default: False)whitelist_regexp
: Regular expression for allowed URLsblacklist_regexp
: Regular expression for blocked URLswebhook_url
: URL to receive notifications when the job completesmax_polls
: Maximum number of status checks (sync only, default: 100)
Response Objects
Job Object
job.id # Unique job identifier
job.status # Job status (new, in_progress, done, error)
job.url # Original crawl URL
job.created_at # Job creation timestamp
job.finished_at # Job completion timestamp
job.job_items # List of crawled items
job.recommended_pull_delay_ms # Recommended delay between status checks
JobItem Object
item.id # Unique item identifier
item.original_url # URL of the crawled page
item.title # Page title
item.status # Item status
item.page_status_code # HTTP status code
item.markdown_content_url # URL to markdown content (if applicable)
item.raw_content_url # URL to raw content
item.cleaned_content_url # URL to cleaned content