Java

Learn how to use the WebCrawler API Java SDK to crawl websites and extract data.

Obtain an API Key

To use the WebCrawler API, you need to obtain an API key. You can do this by signing up for a free account.

Installation

The Java SDK is a standalone, single-file implementation that requires no external dependencies. Simply copy the WebCrawlerAPI.java file into your project.

Download the SDK

Get the SDK from the GitHub repository:

# Download directly
curl -O https://raw.githubusercontent.com/WebCrawlerAPI/java-sdk/main/WebCrawlerAPI.java

# Or clone the repository
git clone https://github.com/WebCrawlerAPI/java-sdk.git

Add to Your Project

Copy WebCrawlerAPI.java into your project's source directory:

# For a standalone project
cp WebCrawlerAPI.java /path/to/your/project/src/

# For Maven projects
cp WebCrawlerAPI.java /path/to/your/project/src/main/java/

# For Gradle projects
cp WebCrawlerAPI.java /path/to/your/project/src/main/java/

Requirements

Java 17 or higher
No external dependencies or build tools required

Usage

Quick Start

Here's a simple example to get you started:

// Initialize the client
WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");

try {
    // Scrape a single page
    WebCrawlerAPI.ScrapeResult result = client.scrape(
        "https://example.com",
        "markdown"
    );

    if ("done".equals(result.status)) {
        System.out.println("Content: " + result.content);
    }
} catch (WebCrawlerAPI.WebCrawlerAPIException e) {
    System.err.println("Error: " + e.getMessage());
}

Synchronous Crawling

The synchronous method waits for the crawl to complete and returns all data at once.

// Initialize the client
WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");

try {
    // Crawl a website
    WebCrawlerAPI.CrawlResult result = client.crawl(
        "https://example.com",  // URL to crawl
        "markdown",             // Scrape type: html, cleaned, or markdown
        10                      // Maximum number of pages to crawl
    );

    System.out.println("Crawl completed with status: " + result.status);
    System.out.println("Number of items crawled: " + result.items.size());

    // Access crawled items
    for (WebCrawlerAPI.CrawlItem item : result.items) {
        System.out.println("URL: " + item.url);
        System.out.println("Status: " + item.status);
        System.out.println("Content URL: " + item.getContentUrl("markdown"));
    }
} catch (WebCrawlerAPI.WebCrawlerAPIException e) {
    System.err.println("Error: " + e.getMessage());
}

Asynchronous Scraping

Start a scrape job and check status later:

// Initialize the client
WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");

try {
    // Start scrape job asynchronously
    String scrapeId = client.scrapeAsync("https://example.com", "html");
    System.out.println("Scrape started with ID: " + scrapeId);

    // Do other work here...

    // Check status later
    WebCrawlerAPI.ScrapeResult result = client.getScrape(scrapeId);
    System.out.println("Status: " + result.status);

    // Poll until complete
    while (!"done".equals(result.status) && !"error".equals(result.status)) {
        Thread.sleep(2000);
        result = client.getScrape(scrapeId);
    }

    if ("done".equals(result.status)) {
        System.out.println("Content: " + result.html);
    }
} catch (Exception e) {
    System.err.println("Error: " + e.getMessage());
}

Scraping Single Pages

For scraping a single page without crawling (synchronous):

// Initialize the client
WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");

try {
    // Scrape a single page
    WebCrawlerAPI.ScrapeResult result = client.scrape(
        "https://example.com",
        "markdown"
    );

    System.out.println("Scrape status: " + result.status);
    if ("done".equals(result.status)) {
        System.out.println("Content: " + result.markdown);
    }
} catch (WebCrawlerAPI.WebCrawlerAPIException e) {
    System.err.println("Error: " + e.getMessage());
}

API Methods

crawl()

Crawl a website and return all discovered pages.

CrawlResult crawl(String url, String scrapeType, int itemsLimit)
CrawlResult crawl(String url, String scrapeType, int itemsLimit, int maxPolls)

Parameters:

url (String, required): The target URL to crawl
scrapeType (String): Type of content to extract: "markdown", "html", or "cleaned"
itemsLimit (int): Maximum number of pages to crawl
maxPolls (int, optional): Maximum polling attempts (default: 100)

scrape()

Scrape a single page synchronously (waits for completion).

ScrapeResult scrape(String url, String scrapeType)
ScrapeResult scrape(String url, String scrapeType, int maxPolls)

Parameters:

url (String, required): The target URL to scrape
scrapeType (String): Type of content to extract: "markdown", "html", or "cleaned"
maxPolls (int, optional): Maximum polling attempts (default: 100)

scrapeAsync()

Start a scrape job asynchronously (returns immediately).

String scrapeAsync(String url, String scrapeType)

Parameters:

url (String, required): The target URL to scrape
scrapeType (String): Type of content to extract: "markdown", "html", or "cleaned"

Returns: Scrape ID (String) that can be used with getScrape()

getScrape()

Get the status and result of a scrape job.

ScrapeResult getScrape(String scrapeId)

Parameters:

scrapeId (String, required): The scrape ID returned from scrapeAsync()

Response Objects

CrawlResult

Contains the result of a crawl operation.

public class CrawlResult {
    public String id;                      // Job ID
    public String status;                  // Job status: "new", "in_progress", "done", "error"
    public String url;                     // Original URL
    public String scrapeType;              // Scrape type used
    public int recommendedPullDelayMs;     // Recommended delay between polls
    public List<CrawlItem> items;          // List of crawled items
}

CrawlItem

Individual crawled page item.

public class CrawlItem {
    public String url;                     // Page URL
    public String status;                  // Item status
    public String rawContentUrl;           // URL to raw HTML content
    public String cleanedContentUrl;       // URL to cleaned content
    public String markdownContentUrl;      // URL to markdown content

    // Helper method to get content URL based on scrape type
    public String getContentUrl(String scrapeType);
}

ScrapeResult

Contains the result of a scrape operation.

public class ScrapeResult {
    public String status;                  // Scrape status: "in_progress", "done", "error"
    public String content;                 // Scraped content (based on scrape_type)
    public String html;                    // Raw HTML content
    public String markdown;                // Markdown content
    public String cleaned;                 // Cleaned text content
    public String url;                     // Page URL
    public int pageStatusCode;             // HTTP status code
}

Error Handling

The SDK throws WebCrawlerAPIException for API errors:

try {
    WebCrawlerAPI.CrawlResult result = client.crawl(url, "markdown", 10);
    // Process result...
} catch (WebCrawlerAPI.WebCrawlerAPIException e) {
    System.err.println("Error code: " + e.getErrorCode());
    System.err.println("Error message: " + e.getMessage());
}

Common error codes:

network_error - Network/connection error
invalid_response - Invalid API response
interrupted - Operation was interrupted
unknown_error - Unknown error occurred

Advanced Usage

Custom Base URL

For testing or custom endpoints:

// Use custom API endpoint (e.g., for local development)
WebCrawlerAPI client = new WebCrawlerAPI(
    "YOUR_API_KEY",
    "http://localhost:8080"  // Custom base URL
);

Control Polling Behavior

Customize the maximum number of polling attempts:

// Crawl with custom max polls
WebCrawlerAPI.CrawlResult result = client.crawl(
    "https://example.com",
    "markdown",
    10,
    50  // Max 50 polling attempts
);

// Scrape with custom max polls
WebCrawlerAPI.ScrapeResult scrape = client.scrape(
    "https://example.com",
    "markdown",
    30  // Max 30 polling attempts
);

Complete Example

Here's a complete example showing compilation and execution:

public class MyApp {
    public static void main(String[] args) {
        // Get API key from environment variable
        String apiKey = System.getenv("API_KEY");
        if (apiKey == null || apiKey.isEmpty()) {
            System.err.println("Error: API_KEY environment variable not set");
            System.exit(1);
        }

        // Create client
        WebCrawlerAPI client = new WebCrawlerAPI(apiKey);

        try {
            // Crawl a website
            WebCrawlerAPI.CrawlResult result = client.crawl(
                "https://books.toscrape.com",
                "markdown",
                5
            );

            System.out.println("Found " + result.items.size() + " items");

            // Display results
            for (WebCrawlerAPI.CrawlItem item : result.items) {
                System.out.println("URL: " + item.url);
                System.out.println("Status: " + item.status);
            }

        } catch (WebCrawlerAPI.WebCrawlerAPIException e) {
            System.err.println("Error: " + e.getMessage());
            System.exit(1);
        }
    }
}

Compile and Run

# Compile (make sure WebCrawlerAPI.java is in the same directory)
javac MyApp.java WebCrawlerAPI.java

# Run with your API key
API_KEY=your-api-key java MyApp

# Or for local testing
API_KEY=test-api-key API_BASE_URL=http://localhost:8080 java MyApp

More Information

For more examples and the complete source code, visit the GitHub repository.

Java

On this page