Job URL filters

Sometimes you don't need content of all pages for website. You want to get only specific blog posts or filter out some pages. For this you can use whitelist_regexp and blacklist_regexp parameters in your job request.

Whitelisting specific pages for website crawling

Whitelisting is a way to include only specific pages in your crawling job. You can use a regular expression to match the URLs you want to include.

Whitelist example

For example, if you want to include only blog posts from a specific category, you can use the following regular expression:

{
    "url": "https://example.com/",
    "whitelist_regexp": "/blog/category/technology.*"
}

Let's say you have a blog with the following URLs:

In this case, the crawler will only include the URLs that match the whitelist_regexp pattern. The URLs that do not match the pattern will be excluded from the crawling job.

So the result job will only have the following URLs:

Excluded URLs:

Blacklisting specific pages for website crawling

Blacklisting is a way to exclude specific pages from your crawling job. You can use a regular expression to match the URLs you want to exclude.

Blacklist example

For example, if you want to exclude all URLs that contain the word "admin", you can use the following regular expression:

{
    "url": "https://example.com/",
    "blacklist_regexp": "/admin.*"
}

Let's say you have a website with the following URLs:

In this case, the crawler will exclude the URLs that match the blacklist_regexp pattern. The URLs that do not match the pattern will be included in the crawling job.

So the result job will only have the following URLs:

Excluded URLs:

Combining Whitelisting and Blacklisting

You can combine both whitelisting and blacklisting in your crawling job. This allows you to include only specific pages while excluding others. First, the crawler will apply the whitelist_regexp to include only the URLs that match the pattern. Then, it will apply the blacklist_regexp to exclude any URLs that match that pattern.

Example

For example, if you want to include only blog posts from a specific category except lifestyle, you can use the following regular expressions:

{
    "url": "https://example.com/",
    "whitelist_regexp": "/blog/category/technology.*",
    "blacklist_regexp": "/blog/category/lifestyle.*"
}

Let's say you have a website with the following URLs:

In this case, the crawler will include only the URLs that match the whitelist_regexp pattern and do not match the blacklist_regexp pattern.

So the result job will only have the following URLs:

Excluded URLs:

What if I need to whitelist by several patterns?

You can use the | operator to combine multiple patterns in a single regular expression. This allows you to include URLs that match any of the specified patterns. For example:

{
    "url": "https://example.com/",
    "whitelist_regexp": "/blog/category/(technology|lifestyle).*"
}

How to debug whitelist_regexp and blacklist_regexp

When you are creating a job with whitelist_regexp or blacklist_regexp, it can be difficult to know if your regular expression is correct and what pages will be included or excluded from the crawling job.

In addition, it is not known in advance website URL structure.

What we recommend is to crawl without any filters first (set small limit of items 100, for example) and then use the urls API endpoint to get the list of URLs that were crawled.

You can then use tools like regex101 (opens in a new tab) to debug your regexps. Paste the list of URLs into the Find matches section and write regexp.

/images/docs/filters-101regexp.png

Job Crawling output format types