Table of Contents
- 1. Crawlee
- When Crawlee fits
- Key Features
- Code Example
- Potential Drawbacks
- 2. Crawl4AI
- When Crawl4AI fits
- Key Features
- Code Example
- Potential Drawbacks
- 3. LLM Scraper
- When LLM Scraper fits
- Key Features
- Code Example
- Potential Drawbacks
- 4. Katana
- When Katana fits
- Key Features
- Usage Example
- Potential Drawbacks
- 5. GPT-Crawler
- When GPT-Crawler fits
- Key Features
- Code Example
- Potential Drawbacks
Looking for an open source web crawler you can run yourself? This guide covers the best self-hostable options: libraries and tools you install, run, and maintain on your own infrastructure.
Not interested in managing infrastructure? If you want a production-ready crawling API without the operational overhead, see our managed web crawling services guide instead. Managed services handle proxies, retries, anti-bot handling, and scaling for you.
Open source tools make sense when you need full control over the crawling process, have specific requirements that a hosted API won't meet, or simply prefer not to send data through a third-party service. The tradeoff is real though: you own the infrastructure, the maintenance, and every edge case that comes up.
1. Crawlee

Crawlee is an open source web scraping and browser automation library built and maintained by Apify. It's available in both JavaScript/TypeScript and Python. 22k+ GitHub stars. Apache 2.0 license.
The core idea is simple: Crawlee handles the infrastructure so you can focus on the scraping logic. Proxy rotation, session management, human-like browser fingerprints, automatic link queuing - all built in. What it doesn't do is decide what to extract or how to structure the output. That part is yours to write.
One thing worth knowing: Crawlee is built by Apify, the same team behind the Apify platform. If your project grows beyond what you want to self-host, there's a natural path to running your Crawlee scrapers as Apify Actors in their cloud - same code, managed execution.
When Crawlee fits
Crawlee is a good fit when:
- You want full code-level control over the crawling logic
- Your output format varies - sometimes JSON, sometimes markdown, sometimes raw HTML - and you need to decide that per-scraper
- You're comfortable managing your own infrastructure (a server, Docker, a cron job)
- You don't need structured extraction by default - you're fine writing selectors or processing raw content yourself
- You want to build something repeatable and maintainable, not a one-off script
It's not the right tool if you want to give it a URL and get clean markdown back with no code written. For that, use a crawl API.
Key Features
- HTTP crawling (CheerioCrawler) and headless browser crawling (PlaywrightCrawler, PuppeteerCrawler) in the same library
- Automatic link extraction and request queue management
- Built-in proxy rotation and session handling
- Human-like browser fingerprints out of the box - helps avoid detection without extra config
- Export to JSON, CSV, or any custom format
- Available in JavaScript/TypeScript (Node.js 16+) and Python
- Forever free, Apache 2.0
Code Example
Here's a basic Crawlee crawler in Node.js that crawls a site, logs page titles, and saves results to a dataset:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks, log }) {
const title = await page.title();
log.info(`Crawling: ${request.loadedUrl}`);
// Save whatever you need - title, HTML, markdown, custom extraction
await Dataset.pushData({ title, url: request.loadedUrl });
// Automatically finds and queues links on the page
await enqueueLinks();
},
});
await crawler.run(['https://example.com']);
Install and run:
npx crawlee create my-crawler
# or manually
npm install crawlee playwright
The Python version works similarly with BeautifulSoupCrawler for lightweight HTTP crawling or PlaywrightCrawler for JS-heavy sites.
Potential Drawbacks
- You write and maintain all the code - selectors, output format, storage, scheduling
- No built-in markdown or LLM-ready output - you handle content processing yourself
- Infrastructure is your responsibility: server, scaling, monitoring, retries at the job level
- Steeper learning curve than calling a crawl API
Crawlee is the strongest option on this list if you're building a serious, maintainable scraping system and want library-level control. If you're prototyping or need something working in an afternoon, a managed API will be faster.
2. Crawl4AI

Crawl4AI is an open source Python crawler built specifically for AI workflows. 62k+ GitHub stars, actively maintained, no cloud version - you run it yourself. Give it a URL, get back clean markdown. Simple model, no API keys required.
The project moves fast. Recent releases added crash recovery and a prefetch mode that's 5-10x faster on large jobs. There's an active community (50k+ developers) and the repo is sponsored, which suggests the maintainer isn't going away soon.
When Crawl4AI fits
- You want clean markdown output from URLs but don't want to pay for a managed service
- You're already working in Python
- You need to self-host for data privacy or compliance reasons
- You want more extraction control than a simple crawl API provides - CSS selectors, semantic filtering, JS execution
- You're building something at scale and want to avoid per-page API costs
No cloud version means no managed proxies, no anti-bot handling as a service, no SLA. You handle all of that.
Key Features
- Clean markdown output from any URL - no API keys, no subscriptions
- Multiple extraction strategies: CSS/XPath selectors, cosine similarity semantic matching, BM25 relevance filtering
- JavaScript execution via Playwright - handles dynamic, JS-heavy pages
- Session reuse, proxy support, stealth modes, persistent browser authentication
- Parallel processing built in - designed for high-volume crawling
- Deploy as CLI (crwl), Docker container, REST API, or use as a Python library
- Fully open source, no mandatory subscriptions
Code Example
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://docs.example.com/"
)
print(result.markdown) # Clean markdown, ready to use
asyncio.run(main())
Install:
pip install -U crawl4ai
crawl4ai-setup # installs Playwright browsers
crawl4ai-doctor # verifies the setup
Potential Drawbacks
- Python only - no JavaScript/TypeScript SDK
- No managed infrastructure: you're responsible for proxies, anti-bot handling, scaling, and uptime
- Self-hosting adds operational overhead compared to a crawl API
- Still pre-1.0 (v0.8.x) - API can change between releases
Crawl4AI is the strongest Python option on this list if you want AI-ready output without a managed service. It won't suit production use cases where reliability and anti-bot handling matter at scale, but for research, prototyping, or internal tooling it's hard to beat.
3. LLM Scraper

LLM Scraper is a TypeScript library that uses an LLM to extract structured data from web pages. You define a Zod schema describing what you want, point it at a page, and the model figures out the extraction. 6.2k stars, actively maintained, last commit March 2026.
The key difference from everything else on this list: LLM Scraper doesn't use CSS selectors or XPath. It sends the page content to a language model and asks it to fill in your schema. That means it handles messy, inconsistent HTML well - but it also means every extraction costs an LLM API call.
It's built on Playwright and supports OpenAI, Claude, Gemini, Llama, and others through the Vercel AI SDK.
When LLM Scraper fits
- You need structured typed data from pages with inconsistent or complex HTML
- You don't want to write and maintain CSS selectors
- You're already using an LLM in your stack and the per-request cost is acceptable
- You're building in TypeScript and want full type safety on the extracted output
- The target sites are few and the extraction logic is hard to express as selectors
It's not a general-purpose site crawler. It loads one page at a time via Playwright, extracts from it, and returns typed data. There's no built-in link following or bulk crawling.
Key Features
- Schema-based extraction via Zod - define what you want, get typed TypeScript output back
- Supports any LLM through Vercel AI SDK: OpenAI, Claude, Gemini, Llama, Qwen
- 6 content formats: HTML, raw HTML, markdown, plain text, screenshots (for multimodal models), or custom
- Streaming support - get partial results as the model processes
- Code generation mode - outputs reusable Playwright scripts instead of just data
- TypeScript, MIT license
Code Example
import { chromium } from 'playwright'
import LLMScraper from 'llm-scraper'
import { openai } from '@ai-sdk/openai'
import { z } from 'zod'
const browser = await chromium.launch()
const scraper = new LLMScraper(openai('gpt-4o'))
const page = await browser.newPage()
await page.goto('https://news.ycombinator.com')
// Define exactly what you want - the LLM does the extraction
const schema = z.object({
top: z.array(z.object({
title: z.string(),
points: z.number(),
by: z.string(),
})).length(5).describe('Top 5 stories on Hacker News'),
})
const { data } = await scraper.run(page, schema, { format: 'html' })
console.log(data.top)
await browser.close()
Install:
npm install zod playwright llm-scraper @ai-sdk/openai
Potential Drawbacks
- Every extraction costs an LLM API call - this adds up at scale
- Slower than selector-based extraction - you're waiting on an LLM response each time
- No built-in crawling - you handle page navigation and link following yourself
- Requires a Playwright browser running locally - adds setup overhead
- 6k stars is small compared to Crawlee or Crawl4AI - smaller community, fewer examples
LLM Scraper is the right tool when the extraction logic is too messy or variable for selectors and you'd rather let a model handle it. If you need to crawl many pages cheaply and fast, the per-call LLM cost will make it impractical - use Crawlee or Crawl4AI instead.
4. Katana

Katana is a crawling and spidering framework by ProjectDiscovery, the team behind tools like Nuclei and httpx. Written in Go. 16k GitHub stars. Actively maintained at v1.5.0.
It's the most different tool on this list. Katana is not about extracting content - it's about discovering URLs and endpoints. You give it a domain, it maps every URL, endpoint, and path it can find, including ones buried in JavaScript files. The primary audience is security researchers and pentesters mapping an attack surface, not developers building data pipelines.
That said, it's useful in a content crawling context too: if you need a fast, complete list of all URLs on a site before deciding what to scrape, Katana is one of the best tools for that job.
When Katana fits
- You need to map all URLs and endpoints on a site quickly
- You want to discover hidden endpoints in JavaScript files
- You're doing security research, bug bounty, or reconnaissance
- You need a fast CLI tool you can pipe into other tools (httpx, nuclei, etc.)
- You want a Go binary with no runtime dependencies to install
Not the right fit if you need to extract page content, generate markdown, or get structured data out of pages.
Key Features
- Dual crawling modes: fast HTTP-based and headless browser (for JS-heavy sites)
- JavaScript endpoint extraction - parses JS files to find hidden URLs and API paths
- Scope management - crawl within specific domains, subdomains, or regex-defined boundaries
- Output to stdout, file, or JSON - easy to pipe into other tools
- Experimental form filling for discovering endpoints behind forms
- CAPTCHA solving support (reCAPTCHA, Turnstile, hCaptcha via capsolver)
- Single binary, no runtime required - install via Go, Homebrew, or Docker
Usage Example
# Basic crawl - outputs all discovered URLs
katana -u https://example.com
# Headless mode for JavaScript-heavy sites
katana -u https://example.com -headless
# Output as JSON
katana -u https://example.com -json-output
# Pipe from a list of domains
cat domains.txt | katana -o output.txt
Install:
go install github.com/projectdiscovery/katana/cmd/katana@latest
# or
brew install katana
Potential Drawbacks
- Not a content extractor - outputs URLs, not page content or markdown
- Go-only - no library to import into your Python or TypeScript project
- CLI-first design, less suited for embedding in an application
- Security tooling origins mean features are oriented toward recon, not data extraction
Katana is a great addition to a crawling pipeline when you need to discover all URLs on a site fast and reliably. Pair it with something like Crawl4AI if you need both URL discovery and content extraction.
5. GPT-Crawler

GPT-Crawler is a TypeScript tool by Builder.io that crawls a website and outputs a single JSON file you can upload to OpenAI to create a custom GPT. That's the whole pitch. 18k+ GitHub stars.
The project is no longer actively maintained. The last meaningful development was in 2023-2024. There are open issues and pull requests that haven't been touched in over a year. If you use it, expect to fix things yourself.
That said, it's still functional for its specific use case and the code is simple enough to fork and adapt.
When GPT-Crawler fits
- You want to build a custom GPT or OpenAI assistant from a website's content
- You have a small, well-structured site with consistent HTML
- You're comfortable with TypeScript and willing to maintain a fork if things break
- You need a one-time output, not an ongoing crawling pipeline
It's not a general-purpose crawler. It doesn't do markdown output for arbitrary use cases, it doesn't handle complex anti-bot situations, and there's no active community fixing issues.
Key Features
- Crawls a site from a starting URL, follows links matching a pattern you configure
- Extracts content using CSS selectors - you tell it exactly which element to pull
- Outputs a single JSON file formatted for OpenAI's custom GPT upload
- Runs locally via Node.js or Docker, or as a small Express API
- TypeScript, Apache 2.0 license
Code Example
Configuration lives in config.ts:
export const defaultConfig: Config = {
url: "https://docs.example.com/",
match: "https://docs.example.com/**",
selector: `.docs-content`, // CSS selector for main content
maxPagesToCrawl: 50,
outputFileName: "output.json",
};
Then run:
git clone https://github.com/builderio/gpt-crawler
npm install
npm start
The output is a output.json file you upload directly to OpenAI when creating a custom GPT.
Potential Drawbacks
- Abandoned - no active maintenance, issues go unanswered
- Very narrow use case: OpenAI custom GPT creation only
- No built-in proxy support, JavaScript-heavy sites may fail silently
- CSS selector approach breaks if the target site changes its HTML structure
- Not suitable for production pipelines or anything requiring reliability
GPT-Crawler is worth using if your exact need is "turn this documentation site into a custom GPT" and you accept that you're on your own if it breaks. For anything more general, use Crawlee or a managed crawl API.