Job URL filters
Sometimes you don't need content of all pages for website. You want to get only specific blog posts or filter out some pages.
For this you can use whitelist_regexp
and blacklist_regexp
parameters in your job request.
Whitelisting specific pages for website crawling
Whitelisting is a way to include only specific pages in your crawling job. You can use a regular expression to match the URLs you want to include.
Whitelist example
For example, if you want to include only blog posts from a specific category, you can use the following regular expression:
{
"url": "https://example.com/",
"whitelist_regexp": "/blog/category/technology.*"
}
Let's say you have a blog with the following URLs:
- https://example.com/blog/category/technology/post1 (opens in a new tab)
- https://example.com/blog/category/technology/post2 (opens in a new tab)
- https://example.com/blog/category/lifestyle/post1 (opens in a new tab)
- https://example.com/blog/category/lifestyle/post2 (opens in a new tab)
In this case, the crawler will only include the URLs that match the whitelist_regexp
pattern. The URLs that do not match the pattern will be excluded from the crawling job.
So the result job will only have the following URLs:
- https://example.com/blog/category/technology/post1 (opens in a new tab)
- https://example.com/blog/category/technology/post2 (opens in a new tab)
Excluded URLs:
- https://example.com/blog/category/lifestyle/post1 (opens in a new tab)
- https://example.com/blog/category/lifestyle/post2 (opens in a new tab)
Blacklisting specific pages for website crawling
Blacklisting is a way to exclude specific pages from your crawling job. You can use a regular expression to match the URLs you want to exclude.
Blacklist example
For example, if you want to exclude all URLs that contain the word "admin", you can use the following regular expression:
{
"url": "https://example.com/",
"blacklist_regexp": "/admin.*"
}
Let's say you have a website with the following URLs:
- https://example.com/admin/dashboard (opens in a new tab)
- https://example.com/admin/settings (opens in a new tab)
- https://example.com/blog/post1 (opens in a new tab)
- https://example.com/blog/post2 (opens in a new tab)
In this case, the crawler will exclude the URLs that match the blacklist_regexp
pattern. The URLs that do not match the pattern will be included in the crawling job.
So the result job will only have the following URLs:
- https://example.com/blog/post1 (opens in a new tab)
- https://example.com/blog/post2 (opens in a new tab)
Excluded URLs:
- https://example.com/admin/dashboard (opens in a new tab)
- https://example.com/admin/settings (opens in a new tab)
Combining Whitelisting and Blacklisting
You can combine both whitelisting and blacklisting in your crawling job. This allows you to include only specific pages while excluding others.
First, the crawler will apply the whitelist_regexp
to include only the URLs that match the pattern. Then, it will apply the blacklist_regexp
to exclude any URLs that match that pattern.
Example
For example, if you want to include only blog posts from a specific category except lifestyle
, you can use the following regular expressions:
{
"url": "https://example.com/",
"whitelist_regexp": "/blog/category/technology.*",
"blacklist_regexp": "/blog/category/lifestyle.*"
}
Let's say you have a website with the following URLs:
- https://example.com/blog/category/technology/post1 (opens in a new tab)
- https://example.com/blog/category/technology/post2 (opens in a new tab)
- https://example.com/blog/category/lifestyle/post1 (opens in a new tab)
- https://example.com/blog/category/lifestyle/post2 (opens in a new tab)
- https://example.com/blog/category/sports/post1 (opens in a new tab)
- https://example.com/blog/category/sports/post2 (opens in a new tab)
- https://example.com/admin/dashboard (opens in a new tab)
- https://example.com/admin/settings (opens in a new tab)
In this case, the crawler will include only the URLs that match the whitelist_regexp
pattern and do not match the blacklist_regexp
pattern.
So the result job will only have the following URLs:
- https://example.com/blog/category/technology/post1 (opens in a new tab)
- https://example.com/blog/category/technology/post2 (opens in a new tab)
- https://example.com/blog/category/lifestyle/post1 (opens in a new tab)
- https://example.com/blog/category/lifestyle/post2 (opens in a new tab)
Excluded URLs:
- https://example.com/blog/category/sports/post1 (opens in a new tab)
- https://example.com/blog/category/sports/post2 (opens in a new tab)
- https://example.com/admin/dashboard (opens in a new tab)
- https://example.com/admin/settings (opens in a new tab)
What if I need to whitelist by several patterns?
You can use the |
operator to combine multiple patterns in a single regular expression. This allows you to include URLs that match any of the specified patterns.
For example:
{
"url": "https://example.com/",
"whitelist_regexp": "/blog/category/(technology|lifestyle).*"
}
How to debug whitelist_regexp and blacklist_regexp
When you are creating a job with whitelist_regexp
or blacklist_regexp
, it can be difficult to know if your regular expression is correct and what pages will be included or excluded from the crawling job.
In addition, it is not known in advance website URL structure.
What we recommend is to crawl without any filters first (set small limit of items 100, for example) and then use the urls
API endpoint to get the list of URLs that were crawled.
You can then use tools like regex101 (opens in a new tab) to debug your regexps. Paste the list of URLs into the Find matches
section and write regexp.