.NET WebCrawler API SDK
Installation
dotnet add package WebCrawlerApi
Requirements
- .NET 7.0 or higher
Usage
Synchronous Crawling
The synchronous method waits for the crawl to complete and returns all data at once.
using WebCrawlerApi;
using WebCrawlerApi.Models;
// Initialize the client
var crawler = new WebCrawlerApiClient("YOUR_API_KEY");
// Synchronous crawling
var job = await crawler.CrawlAndWaitAsync(
url: "https://example.com",
scrapeType: "markdown",
itemsLimit: 10
);
Console.WriteLine($"Job completed with status: {job.Status}");
// Access job items and their content
foreach (var item in job.JobItems)
{
Console.WriteLine($"Page title: {item.Title}");
Console.WriteLine($"Original URL: {item.OriginalUrl}");
var content = await item.GetContentAsync();
if (content != null)
{
Console.WriteLine($"Content preview: {content[..Math.Min(200, content.Length)]}...");
}
}
Asynchronous Crawling
The asynchronous method returns a job ID immediately and allows you to check the status later.
using WebCrawlerApi;
using WebCrawlerApi.Models;
// Initialize the client
var crawler = new WebCrawlerApiClient("YOUR_API_KEY");
// Start async crawl job
var response = await crawler.CrawlAsync(
url: "https://example.com",
scrapeType: "markdown",
itemsLimit: 10
);
// Get the job ID
var jobId = response.Id;
Console.WriteLine($"Crawling job started with ID: {jobId}");
// Check job status
var job = await crawler.GetJobAsync(jobId);
while (job.Status == "in_progress")
{
await Task.Delay(job.RecommendedPullDelayMs);
job = await crawler.GetJobAsync(jobId);
}
// Process results
if (job.Status == "done")
{
foreach (var item in job.JobItems)
{
Console.WriteLine($"Page title: {item.Title}");
Console.WriteLine($"Original URL: {item.OriginalUrl}");
}
}
Available Parameters
Both crawling methods support these parameters:
url
(required): The target URL to crawlscrapeType
: Type of content to extract ('markdown', 'html', 'cleaned')itemsLimit
: Maximum number of pages to crawl (default: 10)allowSubdomains
: Whether to crawl subdomains (default: false)whitelistRegexp
: Regular expression for allowed URLsblacklistRegexp
: Regular expression for blocked URLswebhookUrl
: URL to receive notifications when the job completesmaxPolls
: Maximum number of status checks (sync only, default: 100)
Response Objects
Job Object
job.Id // Unique job identifier
job.Status // Job status (new, in_progress, done, error)
job.Url // Original crawl URL
job.CreatedAt // Job creation timestamp
job.FinishedAt // Job completion timestamp
job.JobItems // List of crawled items
job.RecommendedPullDelayMs // Recommended delay between status checks
JobItem Object
item.Id // Unique item identifier
item.OriginalUrl // URL of the crawled page
item.Title // Page title
item.Status // Item status
item.PageStatusCode // HTTP status code
item.MarkdownContentUrl // URL to markdown content (if applicable)
item.RawContentUrl // URL to raw content
item.CleanedContentUrl // URL to cleaned content
item.GetContentAsync() // Method to get content based on scrape_type