Table of Contents
- How to Build a Web Crawler
- What is a Web Crawler?
- Key Parts of Every Web Crawler
- Fetcher (The Scraper)
- Parser (The Reader)
- URL Manager (The To-Do List)
- Storage (The Memory)
- Planning Your Web Crawler
- Set Clear Goals
- Know Your Target Websites
- Decide on Depth and Scope
- Common Challenges You Will Face
- Rate Limits and Being Polite
- Pages That Need JavaScript
- Anti-Bot Protection
- Best Practices for a Good Web Crawler
- Have a Dashboard
- Respect robots.txt Rules
- Handle Errors Without Crashing
- Avoid Duplicate Pages
- Start simple
- Scale Step by Step
- Build It Yourself or Use an API?
- Summary
How to Build a Web Crawler
Hi, I'm Andrew. I'm a software engineer with fifteen years of experience. I'm building WebCrawler API for more than two years and I will show some problems and explain work that you will need if you want to build your webcrawler API. If you read this you will understand at the end how much work do you need to do to build your webcrawler and if it worth it or it is better to use existing API.
What is a Web Crawler?
Webcrawler is a tool, that automates process of getting data from the website. It could be any kind of data, webpage content, headers, tags, SEO information, links, images, etc. The difference between scraper and crawler did scraper only get information and data from the single page and return result. The web crawler get the page from the seed page and then process all links and follow all links to get information also from redirected pages. Like a spider that crawling web. That's why it called web crawler.
Key Parts of Every Web Crawler
Every web crawler contains all vital part.
- Fetcher. Content receiver.
- Parser (Reader). Parser is part is the part that's responsible for getting content processing content that Fetcher retrieved. It extract required information, for example images or block post article. And also it extract links.
- URL manager. It's the part that's responsible of managing URLs. You need to prepare URL for further parsing.
Fetcher (The Scraper)
Fetcher is the part that actually do the network job: make request, download response, and return raw content to the next steps. If you think about crawler like a small browser, Fetcher is your browser tab.
Here is a tiny fetcher example with a few headers:
// Node 18+
// Idea: fetch HTML with a few headers.
// Real crawlers also need timeouts, retries, and backoff.
export async function fetchPage(url) {
const res = await fetch(url, {
redirect: "follow",
headers: {
"user-agent": "MyCrawler/1.0",
"accept-language": "en-US,en;q=0.9",
},
});
const html = await res.text();
return { url: res.url, status: res.status, html };
}
In real life Fetcher is not just fetch(url). You will need to decide a lot of things:
- What HTTP client you use (Node fetch, axios, undici, etc.)
- How you set headers (User-Agent, Accept-Language, cookies)
- How you handle redirects (follow, stop, max redirect count)
- How you handle timeouts and retries (and when you should NOT retry)
- How you limit concurrency (so you don't DDoS website and you don't kill your own server)
Also you will meet websites that works only with JavaScript. For simple HTML pages a normal HTTP request is enough. But if content rendered in browser, you need headless browser (Playwright/Puppeteer) or some rendering service. This is the moment when crawler become expensive: CPU, memory, and time per page go up a lot.
One more thing: fetchers are the first place where you fight anti-bot. You will see blocks, CAPTCHAs, 403/429, weird redirects, and sometimes just empty HTML. So you need good logging (request id, status code, response size) and you need to store some debug info (headers, final URL) to understand what is going on.
Parser (The Reader)
Parser is the part that take raw response from Fetcher and turn it into structured data. For HTML it means: read document, find important blocks, extract fields, and extract links for next crawl.
A simple HTML parser example (extract title + links) using cheerio:
import * as cheerio from "cheerio";
export function parseHtml({ url, html }) {
const $ = cheerio.load(html);
const title = $("title").text().trim() || null;
// Idea: collect links and resolve relative URLs.
// Real code must skip junk links and handle invalid URLs.
const links = $("a[href]")
.map((_, a) => new URL($(a).attr("href"), url).toString())
.get();
return { title, links };
}
The biggest mistake is to think that Parser is only “select some CSS selectors and done”. In practice pages are messy:
- HTML is broken, tags not closed, weird nesting
- Text is mixed with navigation, ads, cookie banners
- Same site can have multiple templates for different pages
- Encoding can be wrong (UTF-8 vs something else)
So you usually implement parsing in layers. First: detect content type (HTML, JSON, PDF, image) and choose parser. Second: clean the HTML (remove scripts/styles, normalize whitespace). Third: extract what you need (title, main content, meta tags, headings, images, etc.). Fourth: convert it to the format you need (markdown, JSON, text, etc.)
For crawler the most important output of Parser is links. You need to find all <a href> (and sometimes src like iframe), resolve relative URLs, remove junk (mailto, tel, javascript links), and normalize. And if you don't normalize, you will crawl duplicates forever: /page, /page/, /page?utm=....
Good Parser also produce signals for URL manager and storage: canonical URL, robots meta tags, noindex/nofollow, content hash, detected language, and errors. This is how you keep crawler stable when you scale from 100 pages to millions.
URL Manager (The To-Do List)
URL Manager is the brain of crawler. Fetcher and Parser can be perfect, but if you manage URLs wrong you will waste money and time. This part decide what URL you crawl next, what you skip, and how you avoid infinite loops.
In simplest version it's just a queue: push seed URL, pop URL, fetch, parse, push new links. But real websites will break this naive approach immediately.
Here is a tiny (but practical) URL normalization + dedupe loop you can build on:
export function normalizeUrl(input) {
const u = new URL(input);
u.hash = "";
return u.toString();
}
export async function crawl(seedUrl, { fetcher, parser, maxPages = 100 } = {}) {
const seen = new Set();
const queue = [normalizeUrl(seedUrl)];
while (queue.length && seen.size < maxPages) {
const url = queue.shift();
if (seen.has(url)) continue;
seen.add(url);
const { html } = await fetcher(url);
const { links } = parser({ url, html });
for (const link of links) queue.push(normalizeUrl(link));
}
return { pagesCrawled: seen.size };
}
Things that URL Manager usually do:
- Deduplication: don't crawl same URL again and again
- Redirects: many links redirects to the same page
- Normalization: make URL consistent (/page vs /page/, remove #hash, sort query params)
- Filtering: skip mailto:, tel:, javascript:, logout links, calendar pages, etc.
- Scope rules: stay inside domain/subdomain/path, limit depth, limit total pages
- Priorities: home page first, then important pages, then everything else
- Depth: how deep to follow the links
Very important problem is canonical and redirects. You can fetch URL A, it redirects to URL B, and HTML says canonical is URL C. If you don't unify this, you will store duplicates and your crawl graph become garbage. So URL Manager usually keeps mapping: requested URL -> final URL -> canonical URL.
Also you need strategy for failures. Some URLs should be retried (temporary network issue), some should be dropped (404), some should be paused (429 rate limit). This logic lives here too, because it controls the queue.
When crawler grows, URL Manager become a real system: database table, indexes, statuses, attempts counter, and maybe distributed queue. This is why people underestimate crawling: it's not “fetch pages”, it's “manage millions of URLs safely”.
Storage (The Memory)
Storage is where crawler stop being a script and become a product. Because if you can fetch and parse, but you can't store results correctly, you basically did nothing.
You usually store two types of data:
- Crawl state (operational data): URL statuses, attempts, next run time, last seen, redirect/canonical mapping.
- Extracted data (business data): page HTML snapshot (optional), cleaned text, markdown, metadata, links, images, SEO fields, etc.
The first one must be reliable and fast for updates. This is usually Postgres or Redis + Postgres. If you lose crawl state you will re-crawl same pages and burn resources.
The second one depends on your use case. If you need search, you probably want a search index. If you need analytics, you might want columnar storage later. If you only need “give me parsed content”, Postgres can be enough for long time. But storing content in the DB only could become very expensive, so you have think about moving it to the file storage, like Cloudflare R2 or Amazon S3 and only save link in the main DB.
Important details that people miss:
- Versioning: your parser will change, so store parser_version to reprocess old data
- Idempotency: same URL processed twice should not create duplicates
- Raw vs processed: sometimes you need to keep raw HTML for debugging or re-parsing
- Size limits: pages can be huge, don't save everything forever by default
- Link graph: storing edges (from -> to) is expensive, but it's super useful
And last: storage is where you answer “how do I debug this?” When user see that “my crawler missed my page”, you need to open record and see: request URL, final URL, status, response size, parse error, extracted links. If you don't store this, you will be blind.
Planning Your Web Crawler
Before you write code, do small planning. It will save you a lot of time because crawler problems are usually product decisions, not technical bugs.
Set Clear Goals
Define what you want to extract and what “done” means. If you skip this step, you will build crawler that can do everything and it will cost you a lot.
Start from output. Do you need full HTML, cleaned text, markdown, screenshots, SEO fields, structured data (JSON-LD), only links, or something else? Every extra field increases complexity because you need parsing rules, storage, and support/debug later.
Then define constraints:
- Freshness: one-time crawl or re-crawl every day/week?
- Quality: is “good enough” OK or you need 99% accuracy?
- Performance: how many pages per minute you expect?
- Budget: how much money per 1k pages is acceptable?
And define failure rules. For example: skip pages that require login, skip pages with CAPTCHA, stop at 429, or retry 3 times. This sounds boring, but it's exactly how you avoid endless edge cases.
Know Your Target Websites
Check how websites behave (static vs JS, robots rules, rate limits) so you don't build wrong fetcher/parsing stack.
And here is the main difference in real life. You can have two very different use cases:
-
Small known list of websites. Final list is fixed and rarely changes. This is easier because you can test every website in advance, understand if you need JavaScript rendering, and tune parsing rules per site. Usually this list comes from you, or from customer who knows exactly what they want.
-
Unpredictable / user-provided websites. URLs come from users and list is changing all the time. This is much harder because you must predict everything: broken HTML, heavy JS apps, redirects chains, weird encodings, anti-bot, random downtime. In this case you plan more around safety: limits, fallbacks, retries, good logs, and clear error messages.
Decide on Depth and Scope
Decide how far you go from seed URLs and where you stop, otherwise crawl will grow forever. This is not theory. If you crawl without limits you will hit crawl traps like calendars, endless filters, infinite pagination, and you will waste weeks.
Depth is about how many link hops you allow. For example:
- Depth 0: only seed URL
- Depth 1: all links from seed
- Depth 2+: links from links, etc.
Scope is about what URLs are allowed. Typical scope rules:
- Domain: only example.com (or allow subdomains)
- Path prefix: only /blog/ and skip /admin/
- Query params: allowlist important params, drop tracking (utm_*, fbclid)
- Content types: HTML only, or also PDFs/images
Also set crawl budget. Even for “crawl entire site” you still need maximums:
- Max pages per job (hard stop)
- Max time per job (so it doesn't run forever)
- Max depth (so it doesn't go too deep)
And define stop conditions. For example: stop when queue is empty, stop after N consecutive errors, stop if too many pages looks duplicate, stop if you hit rate limit too long.
In practice I recommend to start conservative. First run depth 1-2, only same domain, drop most query params. Then look at results and expand rules slowly. This is exactly how you avoid “my crawler downloaded 2 million URLs and 95% is garbage”. All depending on your use-case of course.
Common Challenges You Will Face
If you build crawler once, you will meet all of this. It doesn't matter what language you use. Most problems are not “bug in code”, it's reality of the web: websites protect themselves, websites are slow, websites are inconsistent and buggy.
Below I list the most common challenges. You can solve all of them, but every solution adds cost and complexity, so it is better to know it early.
Rate Limits and Being Polite
Rate limits is the first thing you will hit when you crawl something bigger than 50 pages. Websites don't want you to send 200 requests per second. Even if they don't have strict rules, their servers are not prepared for that.
Sometimes you will see open source crawlers that promise “10k pages per second”. If it was true for real websites, they would be blocked in seconds by anti-bot protections. High throughput is possible only in controlled environment (your own websites) or with a lot of expensive infrastructure.
This is why I implemented limits in WebcrawlerAPI and the notion “politeness” exists. It means you crawl in a way that looks like normal user traffic:
- Limit concurrency per host (for example 1-5 requests at the same time)
- Add delay between requests (random jitter is good)
- Respect robots.txt crawl-delay (if present)
- Use caching for repeated resources when possible
Also you need to handle responses like 429 (Too Many Requests). If you ignore it and continue, you will get blocked. So typical behavior is:
- backoff and retry later
- reduce concurrency
- read Retry-After header when it exists
Important detail: rate limits are not only about being nice. It's also stability for your own system. If you start 1000 requests, you will run out of sockets, memory, and CPU and your crawler will crash.
So even in perfect world with no anti-bot, you still need rate limiting. This is why URL manager and fetcher must work together: queue + per-host scheduler.
One simple pattern is “per-host queue + delay”. It is not perfect, but it prevents you from blasting a single domain:
const nextAt = new Map(); // host -> unix ms
export async function politeFetch(url) {
const { host } = new URL(url);
const waitMs = Math.max(0, (nextAt.get(host) ?? 0) - Date.now());
if (waitMs) await new Promise((r) => setTimeout(r, waitMs));
// Idea: keep a small gap between requests to the same host.
nextAt.set(host, Date.now() + 500);
return fetch(url);
}
Pages That Need JavaScript
This is the moment when many crawler projects die. Because HTML you get from normal HTTP request is not always the HTML user sees in the browser.
Modern websites often ship empty shell and then render content with JavaScript. So you request page, and response body looks like:
- a <div id="root"></div>
- a bunch of scripts
- and no real content
You have few options:
-
Try to find API behind the page. Many sites load data from JSON endpoint. If you can call it directly it will be faster and cheaper than browser rendering.
-
Use headless browser (Playwright/Puppeteer). It works for most cases because it executes JS like real browser. But it is expensive: each page needs CPU + memory, it is slower, and it is much easier to detect and block.
-
Hybrid approach. Start with normal fetch. If you detect “empty content” or you don't find required fields, fallback to browser rendering only for those pages. However, this could also be tricky. Websites can load content frame (template) first and then use JS to download specific data.
In planning you should decide how much JS you can afford. If you render everything in browser, your 1k pages crawl becomes minutes/hours instead of seconds. Also your infra cost will be much higher.
And one more problem: JS pages are not deterministic. You will deal with timeouts, loading spinners, cookie banners, late-rendered content, and A/B tests. So you need time budgets and clear rules like “wait for selector X” or “wait network idle”, otherwise crawler will hang forever.
A common hybrid pattern: try plain fetch first, then fallback to Playwright only if content looks empty:
// Idea: if plain HTML looks empty, render in a browser.
// Real code uses Playwright/Puppeteer and reuses browser instances.
async function renderInBrowser(url) {
// open headless browser, goto url, return rendered HTML
return "";
}
export async function fetchHtmlSmart(url) {
const html = await fetch(url).then((r) => r.text());
if (html.length > 2_000) return html;
return renderInBrowser(url);
}
Anti-Bot Protection
Sooner or later you will get blocked. Sometimes on first request. Sometimes after 1000 pages. It depends on website and how aggressive you crawl.
If you crawl from you local machine - this could make your IP to get into black list and you will start see Anti-bot protection check very often. From the other side: datacenter IPs are by default are suspicious.
Anti-bot is a whole world. It can be simple (rate limit + IP ban) or very advanced (fingerprinting, behavior analysis, challenges). Common signals that you are blocked:
- 403/401 when it should be 200
- 429 even with low traffic
- Redirect to “verify you are human”
- HTML that looks normal, but content is empty (they serve you different page)
- CAPTCHA or JavaScript challenge
The important thing: fighting anti-bot is not only technical. It is also legal and ethical area depending on what you crawl. So first rule is: crawl public pages, respect robots, and don't try to break protections on websites that clearly don't want you there.
From engineering side, you still need to handle it gracefully. Don't just retry forever. You need:
- Detect block pages and mark URL as blocked
- Backoff and slow down per host
- Rotate IPs/proxies (if your use case allows it)
- Keep consistent headers and cookies (session)
- Use browser rendering for some sites (but it can be even more detectable)
Also remember that anti-bot is why “10k pages per second” promises are mostly marketing. If you crawl fast, you look like a bot. If you crawl slow, it can work, but now you need queue, scheduler, and good monitoring.
This is why observability is critical. Store response status, final URL, response size, and small HTML snippet for debug. Without it you will not even understand that you are blocked.
Best Practices for a Good Web Crawler
Best practices depends on many factors. If you crawl 5 known websites it is one thing. If you crawl user-provided URLs at scale it is completely different. Also it depends what content you want:
- Only HTML status + headers
- Full HTML snapshot
- Cleaned text or markdown
- Links graph
- SEO fields (title, meta description, canonical, hreflang)
- Images and files (PDFs)
- Structured data (JSON)
- Screenshots (real browser)
But there are few general advices that works almost always.
Have a Dashboard
Have a simple dashboard to track progress and understand what is happening now: queue size, processed pages, errors, slow domains, and current retries. Without it you will debug crawler by guessing and reading logs all day.
Respect robots.txt Rules
robots.txt is a small text file on website root (like https://example.com/robots.txt) that describes what crawlers are allowed to access. It is not a security tool. But it is a rule of the web, and if you ignore it you will get blocked faster and you can create legal problems for yourself.
In crawler you should treat robots as first-class input. At the entry point, before you start crawling a domain, download and parse its robots.txt and check if your crawler (by User-Agent) is allowed. And it is not enough to check it once. For every single URL you are going to fetch you should check that path against robots rules again.
Why again? Because robots file can allow / but disallow /private/, or allow /blog/ but disallow /search. If you skip per-page checks your URL manager will happily enqueue forbidden URLs and you will waste requests.
So keep robots rules cached per host, refresh it sometimes, and make URL manager use it as filter.
In Node, it is easiest to use an existing parser (because robots syntax has edge cases):
// Idea: download /robots.txt and check if URL is allowed.
// Real code uses a parser library like `robots-parser` and caches per host.
export async function isAllowedByRobots(url) {
const robotsUrl = new URL("/robots.txt", url);
const robotsTxt = await fetch(robotsUrl).then((r) => (r.ok ? r.text() : ""));
// parse robotsTxt and check rules
return true;
}
Handle Errors Without Crashing
Crawler will fail all the time. Network fails, DNS fails, Puppeteer/Playwright fails, websites return 500, HTML is broken, parser throws exception. If one error crashes whole job, you will never crawl anything big.
So you need error handling on every layer. Fetcher should return structured error (timeout, DNS, status code, blocked) instead of throwing and killing process. Parser should fail per page, not per job. URL manager should mark URL as failed with attempts counter and reason.
Practical rules that help:
- Timeouts everywhere (connect + response + overall)
- Retries only for retryable errors (timeouts, temporary 5xx), not for 404
- Max attempts per URL (otherwise you will retry forever)
- Circuit breaker per host (if domain is down, pause it)
- Store all error messages/statuses so you can debug later
Also make crawler idempotent. If job restarts in the middle, it should continue from stored state, not start from scratch. This is difference between demo crawler and production crawler.
One small thing that helps a lot is to make fetcher return structured outcomes instead of throwing:
export async function safeFetch(url) {
try {
const res = await fetch(url);
return { ok: res.ok, status: res.status, html: await res.text() };
} catch (err) {
return { ok: false, error: String(err) };
}
}
Avoid Duplicate Pages
Duplicates is silent killer. You think you crawl 100k pages, but in reality it can be 10k unique pages and 90k duplicates with different URLs. It happens because web is full of aliases:
- Trailing slash: /page vs /page/
- Tracking params: ?utm_source=...
- Session params: ?sid=...
- Sort/filter params that generate endless combinations
- Same content on multiple paths
- Redirects
First line of defense is URL normalization in URL manager. Strip #hash, normalize trailing slash, lower-case host, remove known tracking params, sort query params. Then dedupe by normalized URL.
Second line is canonical and redirects. After fetch, save mapping requested URL -> final URL (after redirects) and then read canonical from HTML. Use canonical as primary key when it exists.
Third line is content-level dedupe. Sometimes different URLs are different but content is same. So store content hash (for example hash of cleaned text) and detect duplicates. It is very helpful for pagination traps.
If you do these three steps, your crawl becomes cheaper, faster, and your data is cleaner.
Start simple
Start from small crawler that actually works end-to-end. One seed URL, fetch HTML, parse a few fields, extract links, store result. No proxies, no headless browser, no distributed queue.
When this simple version is stable, you will see real problems in logs and data. Then you can add features one by one: retries, better normalization, per-host limits, JS rendering, more parsers. If you start from “enterprise crawler”, you will spend months and still don't have something you can trust.
Scale Step by Step
Scaling is not “add more servers”. First scale your correctness and observability.
Do it in steps:
- Crawl 100 pages and verify output manually
- Crawl 10k pages and watch error types, duplicates, and storage size
- Only after that go to 1M pages with proper limits and monitoring
Every scale level will reveal new problems: memory leaks, slow parsers, database hot spots, and anti-bot blocks. If you jump directly to big crawl, you will just burn money and still don't know what is wrong.
Build It Yourself or Use an API?
You can build crawler yourself. It is possible. But now you see what it really means: fetcher, parser, URL manager, storage, retries, rate limits, robots rules, JS rendering, anti-bot, monitoring.
If you have a fixed list of websites and you crawl them every day, building custom crawler can make sense. You can tune it, you can control costs, and you can get exactly the data format you need in a predictable way.
But if your use case is “user give me any URL and I should crawl it”, this is a different level. You will spend a lot of time on edge cases and infra, not on your product. In this case using existing crawling API is often cheaper and faster.
My rule of thumb:
- Build it if crawling is core feature and you have people to maintain it.
- Use API if crawling is supporting feature and you just need reliable data.
Summary
Web crawler is not hard to start, but hard to make reliable. Start with simple version, define goals and scope, respect robots and rate limits, and build good visibility into what crawler is doing. And before you invest months, be honest: maybe existing crawler API is already good enough for your use case.