HTML vs Cleaned Text vs Markdown: Which Should Be Used for RAG?

When RAG is being built on top of crawled pages, output format choices tend to decide the whole pipeline. HTML, cleaned text, and Markdown can all work, but different costs are paid.

Pairwise guides are available in:

Quick comparison

Topic	HTML	Cleaned Text	Markdown
Best for	Fidelity and re-processing	Embeddings and retrieval	Readable structure for humans
Keeps links (targets)	Yes	Usually no	Sometimes (depends on conversion)
Keeps structure	High (DOM)	Low	Medium
Token cost	High	Low	Medium
RAG chunking	Harder (needs parsing)	Simple	Simple (headings help)

What should be optimized for in RAG

In real pipelines, three goals are usually competing:

Retrieval quality (what gets found)
Answer quality (what gets used)
Traceability (what was the source and where)

Those goals are affected by how much structure is preserved and how much noise is carried.

If extracted structured fields are required too, prompt data formats are covered in Best Prompt Data.

Recommended decision flow

Step 1: Is re-processing expected?

If parsing rules are expected to change, HTML is often stored as the source of truth. Cleaned text and Markdown can be re-generated later.

Step 2: Is retrieval being done over full content?

If embeddings are the core, cleaned text is usually the default. It reduces noise and token cost.

Step 3: Is human review part of the workflow?

If humans must read chunks, Markdown is often used because headings and lists remain scannable.

Practical patterns that tend to work

Pattern A: Store HTML, embed cleaned text

This pattern is common because both traceability and retrieval are supported.

HTML is stored for evidence and re-processing.
Cleaned text is chunked and embedded.
URLs and titles are stored as metadata.

Pattern B: Convert to Markdown, then chunk by headings

This pattern is common for docs and knowledge bases.

HTML is converted to Markdown.
## headings are used as chunk boundaries.
Lists and code blocks are preserved.

Markdown conversion tradeoffs are covered in HTML vs Markdown.

Pattern C: Cleaned text only (fast path)

This pattern is used when:

The site is mostly prose
Links and tables are not critical
Cost and simplicity are prioritized

The downside is that structure and link targets can be lost.

Common RAG edge cases

Tables

If tables carry meaning (specs, pricing), cleaned text can flatten them into nonsense. HTML can preserve them, but additional parsing is required. Markdown tables can work, but generation is not always stable.

Link directories

If a page is mostly links, cleaned text can lose targets. HTML keeps them. Markdown can keep them if links are preserved as [text](url).

Boilerplate-heavy pages

HTML often includes repeated headers, footers, cookie banners, and navigation. If not removed, embeddings can be polluted. Cleaned text usually reduces this problem.

Node.js snippet: A simple "store HTML + embed cleaned text" record

This example shows a practical envelope for storage. No product-specific features are implied.

// Node 18+
// Create an ingestion record that keeps HTML for traceability
// and keeps cleaned text for embedding.

const record = {
  url: "https://example.com/page",
  fetched_at: new Date().toISOString(),
  html: "<html>...</html>",
  cleaned_text: "Readable content goes here...",
};

console.log(JSON.stringify(record, null, 2));

Conclusion

HTML is usually selected for fidelity and re-processing.
Cleaned text is usually selected for embeddings and retrieval.
Markdown is usually selected when readable structure is valuable, especially for docs.
A mixed approach is often used: HTML for storage, cleaned text (or Markdown) for RAG.

If prompt input formats are being chosen too, Best Prompt Data should be read alongside these output guides.

When RAG is being built on top of crawled pages, output format choices tend to decide the whole pipeline. HTML, cleaned text, and Markdown can all work, but different costs are paid.

Pairwise guides are available in:

Quick comparison

Topic	HTML	Cleaned Text	Markdown
Best for	Fidelity and re-processing	Embeddings and retrieval	Readable structure for humans
Keeps links (targets)	Yes	Usually no	Sometimes (depends on conversion)
Keeps structure	High (DOM)	Low	Medium
Token cost	High	Low	Medium
RAG chunking	Harder (needs parsing)	Simple	Simple (headings help)

What should be optimized for in RAG

In real pipelines, three goals are usually competing:

Retrieval quality (what gets found)
Answer quality (what gets used)
Traceability (what was the source and where)

Those goals are affected by how much structure is preserved and how much noise is carried.

If extracted structured fields are required too, prompt data formats are covered in Best Prompt Data.

Recommended decision flow

Step 1: Is re-processing expected?

If parsing rules are expected to change, HTML is often stored as the source of truth. Cleaned text and Markdown can be re-generated later.

Step 2: Is retrieval being done over full content?

If embeddings are the core, cleaned text is usually the default. It reduces noise and token cost.

Step 3: Is human review part of the workflow?

If humans must read chunks, Markdown is often used because headings and lists remain scannable.

Practical patterns that tend to work

Pattern A: Store HTML, embed cleaned text

This pattern is common because both traceability and retrieval are supported.

HTML is stored for evidence and re-processing.
Cleaned text is chunked and embedded.
URLs and titles are stored as metadata.

Pattern B: Convert to Markdown, then chunk by headings

This pattern is common for docs and knowledge bases.

HTML is converted to Markdown.
## headings are used as chunk boundaries.
Lists and code blocks are preserved.

Markdown conversion tradeoffs are covered in HTML vs Markdown.

Pattern C: Cleaned text only (fast path)

This pattern is used when:

The site is mostly prose
Links and tables are not critical
Cost and simplicity are prioritized

The downside is that structure and link targets can be lost.

Common RAG edge cases

Tables

Link directories

If a page is mostly links, cleaned text can lose targets. HTML keeps them. Markdown can keep them if links are preserved as [text](url).

Boilerplate-heavy pages

HTML often includes repeated headers, footers, cookie banners, and navigation. If not removed, embeddings can be polluted. Cleaned text usually reduces this problem.

Node.js snippet: A simple "store HTML + embed cleaned text" record

This example shows a practical envelope for storage. No product-specific features are implied.

// Node 18+
// Create an ingestion record that keeps HTML for traceability
// and keeps cleaned text for embedding.

const record = {
  url: "https://example.com/page",
  fetched_at: new Date().toISOString(),
  html: "<html>...</html>",
  cleaned_text: "Readable content goes here...",
};

console.log(JSON.stringify(record, null, 2));

Conclusion

HTML is usually selected for fidelity and re-processing.
Cleaned text is usually selected for embeddings and retrieval.
Markdown is usually selected when readable structure is valuable, especially for docs.
A mixed approach is often used: HTML for storage, cleaned text (or Markdown) for RAG.

If prompt input formats are being chosen too, Best Prompt Data should be read alongside these output guides.

HTML vs Cleaned Text vs Markdown: Which Should Be Used for RAG?

Table of Contents

Table of Contents

Quick comparison

What should be optimized for in RAG

Recommended decision flow

Step 1: Is re-processing expected?

Step 2: Is retrieval being done over full content?

Step 3: Is human review part of the workflow?

Practical patterns that tend to work

Pattern A: Store HTML, embed cleaned text

Pattern B: Convert to Markdown, then chunk by headings

Pattern C: Cleaned text only (fast path)

Common RAG edge cases

Tables

Link directories

Boilerplate-heavy pages

Node.js snippet: A simple "store HTML + embed cleaned text" record

Conclusion

HTML vs Cleaned Text vs Markdown: Which Should Be Used for RAG?

Table of Contents

Table of Contents

Quick comparison

What should be optimized for in RAG

Recommended decision flow

Step 1: Is re-processing expected?

Step 2: Is retrieval being done over full content?

Step 3: Is human review part of the workflow?

Practical patterns that tend to work

Pattern A: Store HTML, embed cleaned text

Pattern B: Convert to Markdown, then chunk by headings

Pattern C: Cleaned text only (fast path)

Common RAG edge cases

Tables

Link directories

Boilerplate-heavy pages

Node.js snippet: A simple "store HTML + embed cleaned text" record

Conclusion