HTML vs Cleaned Text: Choosing the Right Output Format

HTML and cleaned text sit at opposite ends of the output spectrum. HTML keeps almost everything (including markup). Cleaned text keeps only readable text (and usually drops most structure).

If Markdown is being considered too, HTML vs Markdown and Cleaned Text vs Markdown should be read.

Quick comparison

Topic	HTML	Cleaned Text
Best for	Fidelity and re-processing later	RAG, embeddings, fast reading
Keeps links	Yes (as <a href> etc.)	Usually no (or links are flattened)
Keeps structure	Yes (DOM)	Limited
Size	Larger	Smaller
Common failure	Noise: scripts, nav, repeated UI	Context loss: lists, tables, link targets

What HTML is good at

HTML is usually preferred when:

Maximum fidelity is needed
The page must be re-parsed later with different rules
Link targets, attributes, and DOM structure matter

Typical crawling cases:

Product pages where microdata or attributes are needed
Pages where selectors will be applied later
Audits where evidence must be preserved

If extracted fields are the goal, a structured format should be used after parsing, as covered in Best Prompt Data.

What cleaned text is good at

Cleaned text is usually preferred when:

The content will be embedded for RAG
Token cost should be reduced
Navigation and boilerplate should be removed

Cleaned text vs Markdown is compared in Cleaned Text vs Markdown.

Use cases for crawling and RAG ingestion

When HTML should be used

HTML is usually the safer choice when:

Re-processing is expected (parsing rules will change)
Link URLs must be preserved exactly
Tables and lists must be reconstructed later

A practical downside is that HTML often includes a lot of noise. Boilerplate must be removed in a second step.

When cleaned text should be used

Cleaned text is usually the safer choice when:

The primary goal is retrieval over the readable content
Chunking will be done without relying on DOM structure
Storage and token costs must be kept down

A practical downside is that important structure can be lost, especially:

Tables (column meaning is lost)
Lists (nesting can be flattened)
Links (anchor text remains but target URLs can be dropped)

If structure must be preserved for readability, Markdown can be considered in HTML vs Markdown.

Practical tradeoffs (what tends to break)

Link-heavy pages

If a page is mostly a set of links (directories, documentation sidebars), cleaned text can become hard to use because the URL targets are lost. HTML keeps that.

Layout-heavy pages

If a page is mostly layout (menus, cards, footers), HTML can be too noisy. Cleaned text usually performs better for RAG, because the noise is removed.

Node.js snippet: Strip HTML tags into rough cleaned text

This is intentionally rough. It is only suitable as a fallback or a quick test.

// Node 18+
// Rough HTML to text conversion without external deps.

import { readFile } from "node:fs/promises";

const html = await readFile("page.html", "utf8");

// Remove script/style blocks
let text = html
  .replace(/<script[\s\S]*?<\/script>/gi, "")
  .replace(/<style[\s\S]*?<\/style>/gi, "");

// Replace tags with spaces, then normalize whitespace
text = text.replace(/<[^>]+>/g, " ");
text = text.replace(/\s+/g, " ").trim();

console.log(text.slice(0, 600));

Conclusion

HTML is usually selected when fidelity and re-processing matter.
Cleaned text is usually selected when RAG and readable content are the goal.
A common pattern is: HTML is stored for traceability, and cleaned text is produced for embeddings.

If a single best default is being sought for RAG, HTML vs Cleaned Text vs Markdown can be used as the tie-breaker.

HTML and cleaned text sit at opposite ends of the output spectrum. HTML keeps almost everything (including markup). Cleaned text keeps only readable text (and usually drops most structure).

If Markdown is being considered too, HTML vs Markdown and Cleaned Text vs Markdown should be read.

Quick comparison

Topic	HTML	Cleaned Text
Best for	Fidelity and re-processing later	RAG, embeddings, fast reading
Keeps links	Yes (as <a href> etc.)	Usually no (or links are flattened)
Keeps structure	Yes (DOM)	Limited
Size	Larger	Smaller
Common failure	Noise: scripts, nav, repeated UI	Context loss: lists, tables, link targets

What HTML is good at

HTML is usually preferred when:

Maximum fidelity is needed
The page must be re-parsed later with different rules
Link targets, attributes, and DOM structure matter

Typical crawling cases:

Product pages where microdata or attributes are needed
Pages where selectors will be applied later
Audits where evidence must be preserved

If extracted fields are the goal, a structured format should be used after parsing, as covered in Best Prompt Data.

What cleaned text is good at

Cleaned text is usually preferred when:

The content will be embedded for RAG
Token cost should be reduced
Navigation and boilerplate should be removed

Cleaned text vs Markdown is compared in Cleaned Text vs Markdown.

Use cases for crawling and RAG ingestion

When HTML should be used

HTML is usually the safer choice when:

Re-processing is expected (parsing rules will change)
Link URLs must be preserved exactly
Tables and lists must be reconstructed later

A practical downside is that HTML often includes a lot of noise. Boilerplate must be removed in a second step.

When cleaned text should be used

Cleaned text is usually the safer choice when:

The primary goal is retrieval over the readable content
Chunking will be done without relying on DOM structure
Storage and token costs must be kept down

A practical downside is that important structure can be lost, especially:

Tables (column meaning is lost)
Lists (nesting can be flattened)
Links (anchor text remains but target URLs can be dropped)

If structure must be preserved for readability, Markdown can be considered in HTML vs Markdown.

Practical tradeoffs (what tends to break)

Link-heavy pages

If a page is mostly a set of links (directories, documentation sidebars), cleaned text can become hard to use because the URL targets are lost. HTML keeps that.

Layout-heavy pages

If a page is mostly layout (menus, cards, footers), HTML can be too noisy. Cleaned text usually performs better for RAG, because the noise is removed.

Node.js snippet: Strip HTML tags into rough cleaned text

This is intentionally rough. It is only suitable as a fallback or a quick test.

// Node 18+
// Rough HTML to text conversion without external deps.

import { readFile } from "node:fs/promises";

const html = await readFile("page.html", "utf8");

// Remove script/style blocks
let text = html
  .replace(/<script[\s\S]*?<\/script>/gi, "")
  .replace(/<style[\s\S]*?<\/style>/gi, "");

// Replace tags with spaces, then normalize whitespace
text = text.replace(/<[^>]+>/g, " ");
text = text.replace(/\s+/g, " ").trim();

console.log(text.slice(0, 600));

Conclusion

HTML is usually selected when fidelity and re-processing matter.
Cleaned text is usually selected when RAG and readable content are the goal.
A common pattern is: HTML is stored for traceability, and cleaned text is produced for embeddings.

If a single best default is being sought for RAG, HTML vs Cleaned Text vs Markdown can be used as the tie-breaker.

HTML vs Cleaned Text: Choosing the Right Output Format

Table of Contents

Table of Contents

Quick comparison

What HTML is good at

What cleaned text is good at

Use cases for crawling and RAG ingestion

When HTML should be used

When cleaned text should be used

Practical tradeoffs (what tends to break)

Link-heavy pages

Layout-heavy pages

Node.js snippet: Strip HTML tags into rough cleaned text

Conclusion

HTML vs Cleaned Text: Choosing the Right Output Format

Table of Contents

Table of Contents

Quick comparison

What HTML is good at

What cleaned text is good at

Use cases for crawling and RAG ingestion

When HTML should be used

When cleaned text should be used

Practical tradeoffs (what tends to break)

Link-heavy pages

Layout-heavy pages

Node.js snippet: Strip HTML tags into rough cleaned text

Conclusion