HTML vs Cleaned Text vs Markdown: Which Should Be Used for RAG?

A practical guide to choosing HTML, cleaned text, or Markdown for RAG ingestion from crawled pages, including tradeoffs and a simple decision flow.

Written byAndrii
Published on

When RAG is being built on top of crawled pages, output format choices tend to decide the whole pipeline. HTML, cleaned text, and Markdown can all work, but different costs are paid.

Pairwise guides are available in:

Quick comparison

TopicHTMLCleaned TextMarkdown
Best forFidelity and re-processingEmbeddings and retrievalReadable structure for humans
Keeps links (targets)YesUsually noSometimes (depends on conversion)
Keeps structureHigh (DOM)LowMedium
Token costHighLowMedium
RAG chunkingHarder (needs parsing)SimpleSimple (headings help)

What should be optimized for in RAG

In real pipelines, three goals are usually competing:

  1. Retrieval quality (what gets found)
  2. Answer quality (what gets used)
  3. Traceability (what was the source and where)

Those goals are affected by how much structure is preserved and how much noise is carried.

If extracted structured fields are required too, prompt data formats are covered in Best Prompt Data.

If parsing rules are expected to change, HTML is often stored as the source of truth. Cleaned text and Markdown can be re-generated later.

If embeddings are the core, cleaned text is usually the default. It reduces noise and token cost.

If humans must read chunks, Markdown is often used because headings and lists remain scannable.

Practical patterns that tend to work

Pattern A: Store HTML, embed cleaned text

This pattern is common because both traceability and retrieval are supported.

  • HTML is stored for evidence and re-processing.
  • Cleaned text is chunked and embedded.
  • URLs and titles are stored as metadata.

Pattern B: Convert to Markdown, then chunk by headings

This pattern is common for docs and knowledge bases.

  • HTML is converted to Markdown.
  • ## headings are used as chunk boundaries.
  • Lists and code blocks are preserved.

Markdown conversion tradeoffs are covered in HTML vs Markdown.

Pattern C: Cleaned text only (fast path)

This pattern is used when:

  • The site is mostly prose
  • Links and tables are not critical
  • Cost and simplicity are prioritized

The downside is that structure and link targets can be lost.

Common RAG edge cases

Tables

If tables carry meaning (specs, pricing), cleaned text can flatten them into nonsense. HTML can preserve them, but additional parsing is required. Markdown tables can work, but generation is not always stable.

If a page is mostly links, cleaned text can lose targets. HTML keeps them. Markdown can keep them if links are preserved as [text](url).

Boilerplate-heavy pages

HTML often includes repeated headers, footers, cookie banners, and navigation. If not removed, embeddings can be polluted. Cleaned text usually reduces this problem.

Node.js snippet: A simple "store HTML + embed cleaned text" record

This example shows a practical envelope for storage. No product-specific features are implied.

// Node 18+
// Create an ingestion record that keeps HTML for traceability
// and keeps cleaned text for embedding.

const record = {
  url: "https://example.com/page",
  fetched_at: new Date().toISOString(),
  html: "<html>...</html>",
  cleaned_text: "Readable content goes here...",
};

console.log(JSON.stringify(record, null, 2));

Conclusion

  • HTML is usually selected for fidelity and re-processing.
  • Cleaned text is usually selected for embeddings and retrieval.
  • Markdown is usually selected when readable structure is valuable, especially for docs.
  • A mixed approach is often used: HTML for storage, cleaned text (or Markdown) for RAG.

If prompt input formats are being chosen too, Best Prompt Data should be read alongside these output guides.


About the Author

Andrii Mazurian
Andrew Mazurian@andriixzvf

Founder, WebCrawlerAPI · 🇳🇱 Netherlands

Engineer with 15 years of experience in APIs, big data, and infrastructure. Founded WebCrawlerAPI in 2024 with a single goal: to build the best data API, and have been shipping it every day since.