Cloudflare is one of the biggest cloud providers in the world. Following the AI trend, Cloudflare started deepening into the AI and LLM infrastructure, introducing Vectorise, Worker AI and AI Gateway.
Recently, they launched a fully managed RAG service called AutoRAG. There are several ways to fill AutoRAG with data, and one of them is Web Crawler.
However, at the time of writing this blog post, Cloudflare Webcrawler has not been released yet.
As a workaround, Cloudflare offers several guides on how to put website content to AutoRAG in the Markdown format.
1 Use Worker AI with URL parameters
The idea is to have a serverless function which accepts the URL, runs a browser and gets the content of the webpage:
import puppeteer from "@cloudflare/puppeteer";
// Define our environment bindings
interface Env {
MY_BROWSER: any;
HTML_BUCKET: R2Bucket;
}
// Define request body structure
interface RequestBody {
url: string;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
// Only accept POST requests
if (request.method !== "POST") {
return new Response("Please send a POST request with a target URL", {
status: 405,
});
}
// Get URL from request body
const body = (await request.json()) as RequestBody;
// Note: Only use this parser for websites you own
const targetUrl = new URL(body.url);
// Launch browser and create new page
const browser = await puppeteer.launch(env.MY_BROWSER);
const page = await browser.newPage();
// Navigate to the page and fetch its html
await page.goto(targetUrl.href);
const htmlPage = await page.content();
// Create filename and store in R2
const key = targetUrl.hostname + "_" + Date.now() + ".html";
await env.HTML_BUCKET.put(key, htmlPage);
// Close browser
await browser.close();
// Return success response
return new Response(
JSON.stringify({
success: true,
message: "Page rendered and stored successfully",
key: key,
}),
{
headers: { "Content-Type": "application/json" },
},
);
},
} satisfies ExportedHandler<Env>;
It utilises Cloudflare Puppeteer headless browser platform to navigate and get the content of the webpage. After that, the content goes directly to the AutoRAG bucket. Read full guide here,
2. Build a simple Webcrawler, using Queues and Browser Rendering.
The idea is similar to the above one: use serverless functions to run a headless browser, but in order to get more than a single page, you now have to setup KV and Queue and push all links there:
type Result = {
numCloudflareLinks: number;
screenshot: ArrayBuffer;
};
const crawlPage = async (url: string): Promise<Result> => {
const page = await (browser as puppeteer.Browser).newPage();
await page.goto(url, {
waitUntil: "load",
});
const numCloudflareLinks = await page.$$eval("a", (links) => {
links = links.filter((link) => {
try {
return new URL(link.href).hostname.includes("cloudflare.com");
} catch {
return false;
}
});
return links.length;
});
await page.setViewport({
width: 1920,
height: 1080,
deviceScaleFactor: 1,
});
return {
numCloudflareLinks,
screenshot: ((await page.screenshot({ fullPage: true })) as Buffer).buffer,
};
};
This is much closer to Webcrawler than the previous one. Full article here.
Problems of Cloudflare Web Crawler
Since Cloudflare offers a basic solution, it only covers primitive scenarios. It works well if you crawl only your private website, doesn't have any antibot or you can switch it off temporarily and have only basic JS.
If you need to crawl various websites - you have to write a lot of code, manage infrastructure yourself, deal with Puppeteer, queue, store links, do not duplicate content, handle cleaning of the data, and, of course, find a proxy to bypass Cloudflare Turnstile Captcha.
Yes, Cloudflare browser can't bypass Cloudflare Captcha.
Most likely, if your business is ready to pay to save time to get the crawler website data as a Markdown, you don't want all this hassle with infrastructure and a proxy.
WebcrawlerAPI vs Cloudflare Web Crawler
Here, WecrawlerAPI can help you. WebcrawlerAPI allows you to get the website data with a single API call.
For example, JavaScript integration could be done in 2 steps:
- Get an access key in the Dashboard
- Use this JS snippet
import webcrawlerapi from "webcrawlerapi-js";
async function main() {
const client = new webcrawlerapi.WebcrawlerClient(
"YOUR API ACCESS KEY HERE",
)
const response = await client.crawl({
"items_limit": 3,
"url": "https://books.toscrape.com/",
"scrape_type": "markdown"
}
)
console.log(response)
}
main().catch(console.error);
Or Python:
# pip install webcrawlerapi-python
from webcrawlerapi.client import WebCrawlerAPI
client = WebCrawlerAPI("<YOUR API KEY HERE>")
# Create a new crawl job with the same parameters
job = client.crawl(
url="https://books.toscrape.com",
items_limit=10,
scrape_type="markdown",
)
print(f"Created job ID: {job.id}")
# Output item details
for item in job.job_items:
print(f"\nPage: {item.title}")
print(f"URL: {item.original_url}")
print(f"Item status: {item.status}")
print(f"Error code: {item.error_code}")
content = item.content
if content:
print(f"Content preview: {content[:100]}")
else:
print("Content not available or item not done")
That's it. Don't worry about links, content, blocking or headless browsers. WebcrawlerAPI handles that for you.
Cloudflare has good infrastructure for AI and a solid basic solution. It fits well if you want to scrape your own website. It could be very cheap or even free because Cloudflare has a huge free plan.
When you need a more advanced solution, crawl various websites, deal with anti-bot protection - more specialised services, like WebcrawlerAPI, could be much better.