What data does a web crawler collect?

Webcrawling

Answer

Common crawler data includes URLs, status codes, headers, page content, metadata, links, and timestamps. Many systems also store canonical URLs, redirect chains, and content hashes for deduplication. If rendering is needed, crawlers can capture the final DOM or even screenshots. Some pipelines also attach extracted fields or structured data for downstream use. The exact fields depend on the purpose of the crawl. Keeping a consistent schema makes analysis and monitoring easier.