docs
SDKs and Code Examples
.NET

.NET WebCrawler API SDK

Installation

dotnet add package WebCrawlerApi

Requirements

  • .NET 7.0 or higher

Usage

Synchronous Crawling

The synchronous method waits for the crawl to complete and returns all data at once.

using WebCrawlerApi;
using WebCrawlerApi.Models;
 
// Initialize the client
var crawler = new WebCrawlerApiClient("YOUR_API_KEY");
 
// Synchronous crawling
var job = await crawler.CrawlAndWaitAsync(
    url: "https://example.com",
    scrapeType: "markdown",
    itemsLimit: 10
);
 
Console.WriteLine($"Job completed with status: {job.Status}");
 
// Access job items and their content
foreach (var item in job.JobItems)
{
    Console.WriteLine($"Page title: {item.Title}");
    Console.WriteLine($"Original URL: {item.OriginalUrl}");
    
    var content = await item.GetContentAsync();
    if (content != null)
    {
        Console.WriteLine($"Content preview: {content[..Math.Min(200, content.Length)]}...");
    }
}

Asynchronous Crawling

The asynchronous method returns a job ID immediately and allows you to check the status later.

using WebCrawlerApi;
using WebCrawlerApi.Models;
 
// Initialize the client
var crawler = new WebCrawlerApiClient("YOUR_API_KEY");
 
// Start async crawl job
var response = await crawler.CrawlAsync(
    url: "https://example.com",
    scrapeType: "markdown",
    itemsLimit: 10
);
 
// Get the job ID
var jobId = response.Id;
Console.WriteLine($"Crawling job started with ID: {jobId}");
 
// Check job status
var job = await crawler.GetJobAsync(jobId);
while (job.Status == "in_progress")
{
    await Task.Delay(job.RecommendedPullDelayMs);
    job = await crawler.GetJobAsync(jobId);
}
 
// Process results
if (job.Status == "done")
{
    foreach (var item in job.JobItems)
    {
        Console.WriteLine($"Page title: {item.Title}");
        Console.WriteLine($"Original URL: {item.OriginalUrl}");
    }
}

Available Parameters

Both crawling methods support these parameters:

  • url (required): The target URL to crawl
  • scrapeType: Type of content to extract ('markdown', 'html', 'cleaned')
  • itemsLimit: Maximum number of pages to crawl (default: 10)
  • allowSubdomains: Whether to crawl subdomains (default: false)
  • whitelistRegexp: Regular expression for allowed URLs
  • blacklistRegexp: Regular expression for blocked URLs
  • webhookUrl: URL to receive notifications when the job completes
  • maxPolls: Maximum number of status checks (sync only, default: 100)

Response Objects

Job Object

job.Id                         // Unique job identifier
job.Status                     // Job status (new, in_progress, done, error)
job.Url                        // Original crawl URL
job.CreatedAt                  // Job creation timestamp
job.FinishedAt                 // Job completion timestamp
job.JobItems                   // List of crawled items
job.RecommendedPullDelayMs     // Recommended delay between status checks

JobItem Object

item.Id                    // Unique item identifier
item.OriginalUrl           // URL of the crawled page
item.Title                 // Page title
item.Status               // Item status
item.PageStatusCode       // HTTP status code
item.MarkdownContentUrl   // URL to markdown content (if applicable)
item.RawContentUrl        // URL to raw content
item.CleanedContentUrl    // URL to cleaned content
item.GetContentAsync()    // Method to get content based on scrape_type