.NET WebCrawler API SDK

Installation

dotnet add package WebCrawlerApi

Requirements

.NET 7.0 or higher

Usage

Synchronous Crawling

The synchronous method waits for the crawl to complete and returns all data at once.

using WebCrawlerApi;
using WebCrawlerApi.Models;
 
// Initialize the client
var crawler = new WebCrawlerApiClient("YOUR_API_KEY");
 
// Synchronous crawling
var job = await crawler.CrawlAndWaitAsync(
    url: "https://example.com",
    scrapeType: "markdown",
    itemsLimit: 10
);
 
Console.WriteLine($"Job completed with status: {job.Status}");
 
// Access job items and their content
foreach (var item in job.JobItems)
{
    Console.WriteLine($"Page title: {item.Title}");
    Console.WriteLine($"Original URL: {item.OriginalUrl}");
    
    var content = await item.GetContentAsync();
    if (content != null)
    {
        Console.WriteLine($"Content preview: {content[..Math.Min(200, content.Length)]}...");
    }
}

Asynchronous Crawling

The asynchronous method returns a job ID immediately and allows you to check the status later.

using WebCrawlerApi;
using WebCrawlerApi.Models;
 
// Initialize the client
var crawler = new WebCrawlerApiClient("YOUR_API_KEY");
 
// Start async crawl job
var response = await crawler.CrawlAsync(
    url: "https://example.com",
    scrapeType: "markdown",
    itemsLimit: 10
);
 
// Get the job ID
var jobId = response.Id;
Console.WriteLine($"Crawling job started with ID: {jobId}");
 
// Check job status
var job = await crawler.GetJobAsync(jobId);
while (job.Status == "in_progress")
{
    await Task.Delay(job.RecommendedPullDelayMs);
    job = await crawler.GetJobAsync(jobId);
}
 
// Process results
if (job.Status == "done")
{
    foreach (var item in job.JobItems)
    {
        Console.WriteLine($"Page title: {item.Title}");
        Console.WriteLine($"Original URL: {item.OriginalUrl}");
    }
}

Available Parameters

Both crawling methods support these parameters:

url (required): The target URL to crawl
scrapeType: Type of content to extract ('markdown', 'html', 'cleaned')
itemsLimit: Maximum number of pages to crawl (default: 10)
allowSubdomains: Whether to crawl subdomains (default: false)
whitelistRegexp: Regular expression for allowed URLs
blacklistRegexp: Regular expression for blocked URLs
webhookUrl: URL to receive notifications when the job completes
maxPolls: Maximum number of status checks (sync only, default: 100)

Response Objects

Job Object

job.Id                         // Unique job identifier
job.Status                     // Job status (new, in_progress, done, error)
job.Url                        // Original crawl URL
job.CreatedAt                  // Job creation timestamp
job.FinishedAt                 // Job completion timestamp
job.JobItems                   // List of crawled items
job.RecommendedPullDelayMs     // Recommended delay between status checks

JobItem Object

item.Id                    // Unique item identifier
item.OriginalUrl           // URL of the crawled page
item.Title                 // Page title
item.Status               // Item status
item.PageStatusCode       // HTTP status code
item.MarkdownContentUrl   // URL to markdown content (if applicable)
item.RawContentUrl        // URL to raw content
item.CleanedContentUrl    // URL to cleaned content
item.GetContentAsync()    // Method to get content based on scrape_type

Rate Limits JavaScript and TypeScript (Node.js)