Table of Contents
- Quick comparison
- What HTML is good at
- What cleaned text is good at
- Use cases for crawling and RAG ingestion
- When HTML should be used
- When cleaned text should be used
- Practical tradeoffs (what tends to break)
- Link-heavy pages
- Layout-heavy pages
- Node.js snippet: Strip HTML tags into rough cleaned text
- Conclusion
HTML and cleaned text sit at opposite ends of the output spectrum. HTML keeps almost everything (including markup). Cleaned text keeps only readable text (and usually drops most structure).
If Markdown is being considered too, HTML vs Markdown and Cleaned Text vs Markdown should be read.
Quick comparison
| Topic | HTML | Cleaned Text |
|---|---|---|
| Best for | Fidelity and re-processing later | RAG, embeddings, fast reading |
| Keeps links | Yes (as <a href> etc.) | Usually no (or links are flattened) |
| Keeps structure | Yes (DOM) | Limited |
| Size | Larger | Smaller |
| Common failure | Noise: scripts, nav, repeated UI | Context loss: lists, tables, link targets |
What HTML is good at
HTML is usually preferred when:
- Maximum fidelity is needed
- The page must be re-parsed later with different rules
- Link targets, attributes, and DOM structure matter
Typical crawling cases:
- Product pages where microdata or attributes are needed
- Pages where selectors will be applied later
- Audits where evidence must be preserved
If extracted fields are the goal, a structured format should be used after parsing, as covered in Best Prompt Data.
What cleaned text is good at
Cleaned text is usually preferred when:
- The content will be embedded for RAG
- Token cost should be reduced
- Navigation and boilerplate should be removed
Cleaned text vs Markdown is compared in Cleaned Text vs Markdown.
Use cases for crawling and RAG ingestion
When HTML should be used
HTML is usually the safer choice when:
- Re-processing is expected (parsing rules will change)
- Link URLs must be preserved exactly
- Tables and lists must be reconstructed later
A practical downside is that HTML often includes a lot of noise. Boilerplate must be removed in a second step.
When cleaned text should be used
Cleaned text is usually the safer choice when:
- The primary goal is retrieval over the readable content
- Chunking will be done without relying on DOM structure
- Storage and token costs must be kept down
A practical downside is that important structure can be lost, especially:
- Tables (column meaning is lost)
- Lists (nesting can be flattened)
- Links (anchor text remains but target URLs can be dropped)
If structure must be preserved for readability, Markdown can be considered in HTML vs Markdown.
Practical tradeoffs (what tends to break)
Link-heavy pages
If a page is mostly a set of links (directories, documentation sidebars), cleaned text can become hard to use because the URL targets are lost. HTML keeps that.
Layout-heavy pages
If a page is mostly layout (menus, cards, footers), HTML can be too noisy. Cleaned text usually performs better for RAG, because the noise is removed.
Node.js snippet: Strip HTML tags into rough cleaned text
This is intentionally rough. It is only suitable as a fallback or a quick test.
// Node 18+
// Rough HTML to text conversion without external deps.
import { readFile } from "node:fs/promises";
const html = await readFile("page.html", "utf8");
// Remove script/style blocks
let text = html
.replace(/<script[\s\S]*?<\/script>/gi, "")
.replace(/<style[\s\S]*?<\/style>/gi, "");
// Replace tags with spaces, then normalize whitespace
text = text.replace(/<[^>]+>/g, " ");
text = text.replace(/\s+/g, " ").trim();
console.log(text.slice(0, 600));
Conclusion
- HTML is usually selected when fidelity and re-processing matter.
- Cleaned text is usually selected when RAG and readable content are the goal.
- A common pattern is: HTML is stored for traceability, and cleaned text is produced for embeddings.
If a single best default is being sought for RAG, HTML vs Cleaned Text vs Markdown can be used as the tie-breaker.