Java
Learn how to use the WebCrawler API Java SDK to crawl websites and extract data.
Obtain an API Key
To use the WebCrawler API, you need to obtain an API key. You can do this by signing up for a free account.
Installation
The Java SDK is a standalone, single-file implementation that requires no external dependencies. Simply copy the WebCrawlerAPI.java file into your project.
Download the SDK
Get the SDK from the GitHub repository:
# Download directly
curl -O https://raw.githubusercontent.com/WebCrawlerAPI/java-sdk/main/WebCrawlerAPI.java
# Or clone the repository
git clone https://github.com/WebCrawlerAPI/java-sdk.gitAdd to Your Project
Copy WebCrawlerAPI.java into your project's source directory:
# For a standalone project
cp WebCrawlerAPI.java /path/to/your/project/src/
# For Maven projects
cp WebCrawlerAPI.java /path/to/your/project/src/main/java/
# For Gradle projects
cp WebCrawlerAPI.java /path/to/your/project/src/main/java/Requirements
- Java 17 or higher
- No external dependencies or build tools required
Usage
Quick Start
Here's a simple example to get you started:
// Initialize the client
WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");
try {
// Scrape a single page
WebCrawlerAPI.ScrapeResult result = client.scrape(
"https://example.com",
"markdown"
);
if ("done".equals(result.status)) {
System.out.println("Content: " + result.content);
}
} catch (WebCrawlerAPI.WebCrawlerAPIException e) {
System.err.println("Error: " + e.getMessage());
}Synchronous Crawling
The synchronous method waits for the crawl to complete and returns all data at once.
// Initialize the client
WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");
try {
// Crawl a website
WebCrawlerAPI.CrawlResult result = client.crawl(
"https://example.com", // URL to crawl
"markdown", // Scrape type: html, cleaned, or markdown
10 // Maximum number of pages to crawl
);
System.out.println("Crawl completed with status: " + result.status);
System.out.println("Number of items crawled: " + result.items.size());
// Access crawled items
for (WebCrawlerAPI.CrawlItem item : result.items) {
System.out.println("URL: " + item.url);
System.out.println("Status: " + item.status);
System.out.println("Content URL: " + item.getContentUrl("markdown"));
}
} catch (WebCrawlerAPI.WebCrawlerAPIException e) {
System.err.println("Error: " + e.getMessage());
}Asynchronous Scraping
Start a scrape job and check status later:
// Initialize the client
WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");
try {
// Start scrape job asynchronously
String scrapeId = client.scrapeAsync("https://example.com", "html");
System.out.println("Scrape started with ID: " + scrapeId);
// Do other work here...
// Check status later
WebCrawlerAPI.ScrapeResult result = client.getScrape(scrapeId);
System.out.println("Status: " + result.status);
// Poll until complete
while (!"done".equals(result.status) && !"error".equals(result.status)) {
Thread.sleep(2000);
result = client.getScrape(scrapeId);
}
if ("done".equals(result.status)) {
System.out.println("Content: " + result.html);
}
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
}Scraping Single Pages
For scraping a single page without crawling (synchronous):
// Initialize the client
WebCrawlerAPI client = new WebCrawlerAPI("YOUR_API_KEY");
try {
// Scrape a single page
WebCrawlerAPI.ScrapeResult result = client.scrape(
"https://example.com",
"markdown"
);
System.out.println("Scrape status: " + result.status);
if ("done".equals(result.status)) {
System.out.println("Content: " + result.markdown);
}
} catch (WebCrawlerAPI.WebCrawlerAPIException e) {
System.err.println("Error: " + e.getMessage());
}API Methods
crawl()
Crawl a website and return all discovered pages.
CrawlResult crawl(String url, String scrapeType, int itemsLimit)
CrawlResult crawl(String url, String scrapeType, int itemsLimit, int maxPolls)Parameters:
url(String, required): The target URL to crawlscrapeType(String): Type of content to extract:"markdown","html", or"cleaned"itemsLimit(int): Maximum number of pages to crawlmaxPolls(int, optional): Maximum polling attempts (default: 100)
scrape()
Scrape a single page synchronously (waits for completion).
ScrapeResult scrape(String url, String scrapeType)
ScrapeResult scrape(String url, String scrapeType, int maxPolls)Parameters:
url(String, required): The target URL to scrapescrapeType(String): Type of content to extract:"markdown","html", or"cleaned"maxPolls(int, optional): Maximum polling attempts (default: 100)
scrapeAsync()
Start a scrape job asynchronously (returns immediately).
String scrapeAsync(String url, String scrapeType)Parameters:
url(String, required): The target URL to scrapescrapeType(String): Type of content to extract:"markdown","html", or"cleaned"
Returns: Scrape ID (String) that can be used with getScrape()
getScrape()
Get the status and result of a scrape job.
ScrapeResult getScrape(String scrapeId)Parameters:
scrapeId(String, required): The scrape ID returned fromscrapeAsync()
Response Objects
CrawlResult
Contains the result of a crawl operation.
public class CrawlResult {
public String id; // Job ID
public String status; // Job status: "new", "in_progress", "done", "error"
public String url; // Original URL
public String scrapeType; // Scrape type used
public int recommendedPullDelayMs; // Recommended delay between polls
public List<CrawlItem> items; // List of crawled items
}CrawlItem
Individual crawled page item.
public class CrawlItem {
public String url; // Page URL
public String status; // Item status
public String rawContentUrl; // URL to raw HTML content
public String cleanedContentUrl; // URL to cleaned content
public String markdownContentUrl; // URL to markdown content
// Helper method to get content URL based on scrape type
public String getContentUrl(String scrapeType);
}ScrapeResult
Contains the result of a scrape operation.
public class ScrapeResult {
public String status; // Scrape status: "in_progress", "done", "error"
public String content; // Scraped content (based on scrape_type)
public String html; // Raw HTML content
public String markdown; // Markdown content
public String cleaned; // Cleaned text content
public String url; // Page URL
public int pageStatusCode; // HTTP status code
}Error Handling
The SDK throws WebCrawlerAPIException for API errors:
try {
WebCrawlerAPI.CrawlResult result = client.crawl(url, "markdown", 10);
// Process result...
} catch (WebCrawlerAPI.WebCrawlerAPIException e) {
System.err.println("Error code: " + e.getErrorCode());
System.err.println("Error message: " + e.getMessage());
}Common error codes:
network_error- Network/connection errorinvalid_response- Invalid API responseinterrupted- Operation was interruptedunknown_error- Unknown error occurred
Advanced Usage
Custom Base URL
For testing or custom endpoints:
// Use custom API endpoint (e.g., for local development)
WebCrawlerAPI client = new WebCrawlerAPI(
"YOUR_API_KEY",
"http://localhost:8080" // Custom base URL
);Control Polling Behavior
Customize the maximum number of polling attempts:
// Crawl with custom max polls
WebCrawlerAPI.CrawlResult result = client.crawl(
"https://example.com",
"markdown",
10,
50 // Max 50 polling attempts
);
// Scrape with custom max polls
WebCrawlerAPI.ScrapeResult scrape = client.scrape(
"https://example.com",
"markdown",
30 // Max 30 polling attempts
);Complete Example
Here's a complete example showing compilation and execution:
public class MyApp {
public static void main(String[] args) {
// Get API key from environment variable
String apiKey = System.getenv("API_KEY");
if (apiKey == null || apiKey.isEmpty()) {
System.err.println("Error: API_KEY environment variable not set");
System.exit(1);
}
// Create client
WebCrawlerAPI client = new WebCrawlerAPI(apiKey);
try {
// Crawl a website
WebCrawlerAPI.CrawlResult result = client.crawl(
"https://books.toscrape.com",
"markdown",
5
);
System.out.println("Found " + result.items.size() + " items");
// Display results
for (WebCrawlerAPI.CrawlItem item : result.items) {
System.out.println("URL: " + item.url);
System.out.println("Status: " + item.status);
}
} catch (WebCrawlerAPI.WebCrawlerAPIException e) {
System.err.println("Error: " + e.getMessage());
System.exit(1);
}
}
}Compile and Run
# Compile (make sure WebCrawlerAPI.java is in the same directory)
javac MyApp.java WebCrawlerAPI.java
# Run with your API key
API_KEY=your-api-key java MyApp
# Or for local testing
API_KEY=test-api-key API_BASE_URL=http://localhost:8080 java MyAppMore Information
For more examples and the complete source code, visit the GitHub repository.