Errors

Complete guide to error codes in WebcrawlerAPI with job level and job item level errors

There are 2 levels of errors: job level and job item level.

Job level error codes:

insufficient_balance - Insufficient balance
invalid_request - Invalid request
internal_error - Internal server error

Job item level error codes:

host_returned_error - Unsuccessful HTTP response from the host
website_access_denied - Website access denied
blocked_by_robots_txt - URL blocked by robots.txt
name_not_resolved - Name resolution error
internal_error - Internal server error
timeout_error - Website timeout
webpage_non_success - Crawling attempt unsuccessful
llm_max_context_length_error - AI request error: maximum context length 128k tokens exceeded
duplicate_item - Duplicate content detected within the same job

Job Level Errors

Job level errors means that the job failed to run. It could be for example that there is not enough balance or internal error from the service.

Insufficient Balance

This error occurs when the balance is not enough to run the job. Go to the dashboard to top up your balance.

API error response example:

{
  "error_code": "insufficient_balance",
  "error_message": "Your balance is not enough to run this job"
}

Invalid request

This error occurs when the request is invalid. For example, the URL is invalid or the parameters are invalid.

API error response example:

{
  "error_code": "invalid_request",
  "error_message": "whitelist_regexp is invalid"
}

Internal error

This error means that something went wrong on our side. Please contact us on [email protected] if you encounter this error.

API error response example:

{
  "error_code": "internal_error",
  "error_message": "Internal server error"
}

Job Item Level Errors

Job item level error means that the job item failed with the specific error.

Job item level errors are returned in the job_items array. List of error codes:

Host returned error

Most common error. This error means that the response HTTP status code is not in range 200-299. Exception is 403 status code, that has a diffrenen error code website_access_denied.

API error response example:

{
    "id": "60b7c4a5-aca7-4183-87db-017418218641",
    //...
	"status": "done",
	"job_items": [
		{
			//...
			"error_code": "host_returned_error",
			"status": "error",
			"last_error": "Webpage returned error status code: 404"
		}
	]
}

Website access denied

This is a special case of the host_returned_error error. It means that the website returned a 403 status code.

API error response example:

{
    //...
	"status": "done",
	"job_items": [
		{
			//...
			"error_code": "website_access_denied",
			"status": "error",
			"last_error": "Webpage returned access denied status code: 403"
		}
	]
}

This error occurs when the respect_robots_txt parameter is set to true and the website's robots.txt file disallows access to the specific URL for crawlers. The robots.txt file is a standard used by websites to communicate with web crawlers about which parts of the site should not be crawled.

API error response example:

{
    "error_code": "blocked_by_robots_txt",
    "error_message": "URL is blocked by robots.txt. The website's robots.txt file disallows access to this URL for crawlers. Respect robots.txt can be disabled in the request."
}

Name resolution error

This error means that there was a problem with the website host name resolution. Most likelt the website does not exist or there is a typo in the URL.

API error response example:

{
    //...
	"job_items": [
		{
			//...
			"error_code": "name_not_resolved",
			"status": "error",
			"last_error": "Connection refused"
		}
	]
}

Website timeout

This error occurs when we tried to reach the webpage several times with different proxies, but unfortunately the website hasn't responded within a reasonable time. There could be several reasons for this:

Troubleshooting steps:

Check website accessibility - First, verify that the website and webpage are accessible by visiting the webpage manually in your browser
If it loads slowly in your browser - The issue is likely on the website's side (slow server, high traffic, or downtime)
If it loads instantly in your browser but still times out in the API - This indicates the website has sophisticated anti-bot protection that we cannot bypass

Common causes:

The website is slow to respond or experiencing high traffic
The website is temporarily down or experiencing server issues
The website has advanced anti-bot protection systems
The website has captcha or other interactive elements that weren't solved in time

We recommend retrying the request. If the problem persists and the website loads normally in your browser, please contact us at [email protected].

API error response example:

{
    //...
	"job_items": [
		{
			//...
			"error_code": "timeout_error",
			"status": "error",
			"last_error": "Website timeout. Please try again later or contact support at [email protected]"
		}
	]
}

Webpage Non Success

We tried hard, but the crawling attempt was not successful. The content may be empty or blocked by anti-bot protection. This typically happens when the webpage either returns no useful content or has sophisticated protection mechanisms preventing access.

Common causes:

The webpage has advanced anti-bot protection systems
The page content is blocked or restricted
The page loaded but contained no extractable content

We recommend checking if the webpage is accessible normally and trying again. If the problem persists, please contact us at [email protected].

API error response example:

{
    //...
	"job_items": [
		{
			//...
			"error_code": "webpage_non_success",
			"status": "error",
			"last_error": "The crawling attempt was not successful. The content may be empty or blocked by anti-bot protection."
		}
	]
}

LLM Max Context Length Error

This error occurs when the webpage content is too large and doesn't fit within the AI model's context window. The AI processing requires the entire webpage content to fit within its maximum context length limit. When a webpage has too much text, images, or other content, it exceeds this limit and cannot be processed.

A possible solution is to use the clean_selectors parameter which allows you to exclude unneeded content (like navigation, ads, footers) before sending it to the LLM. See the cleaning documentation for more details on how to use clean selectors.

API error response example:

{
    //...
	"job_items": [
		{
			//...
			"error_code": "llm_max_context_length_error",
			"status": "error",
			"last_error": "AI request error: maximum context length 128k tokens exceeded for this page"
		}
	]
}

Duplicate Content

This error occurs when the same content is detected multiple times within the same crawling job. The system uses content hashing to identify pages with identical content, even if they have different URLs. When a duplicate is found, the job item will fail with this error code and reference the URL where the content was first seen.

Common causes:

Multiple URLs pointing to the same content (e.g., with different query parameters or URL paths)
Mirror pages or duplicate content on the website
Pagination pages with identical content
URL variations that load the same content

When a duplicate is detected, you will not be charged for processing the duplicate item, as the balance is automatically refunded.

API error response example:

{
    //...
	"job_items": [
		{
			//...
			"error_code": "duplicate_item",
			"status": "error",
			"last_error": "Duplicate content of: https://example.com/original-page"
		}
	]
}

Internal error

This error means that something went wrong on our side. Please contact us on [email protected] if you encounter this error.

API error response example:

{
    //...
	"job_items": [
		{
			//...
			"error_code": "internal_error",
			"status": "error",
			"last_error": "Internal server error"
		}
	]
}

Errors

Job Level Errors

Insufficient Balance

Invalid request

Internal error

Job Item Level Errors

Host returned error

Website access denied

Blocked by robots.txt

Name resolution error

Website timeout

Webpage Non Success

LLM Max Context Length Error

Duplicate Content

Internal error

On this page