Table of Contents
- Quick comparison
- What should be optimized for in RAG
- Recommended decision flow
- Step 1: Is re-processing expected?
- Step 2: Is retrieval being done over full content?
- Step 3: Is human review part of the workflow?
- Practical patterns that tend to work
- Pattern A: Store HTML, embed cleaned text
- Pattern B: Convert to Markdown, then chunk by headings
- Pattern C: Cleaned text only (fast path)
- Common RAG edge cases
- Tables
- Link directories
- Boilerplate-heavy pages
- Node.js snippet: A simple "store HTML + embed cleaned text" record
- Conclusion
When RAG is being built on top of crawled pages, output format choices tend to decide the whole pipeline. HTML, cleaned text, and Markdown can all work, but different costs are paid.
Pairwise guides are available in:
Quick comparison
| Topic | HTML | Cleaned Text | Markdown |
|---|---|---|---|
| Best for | Fidelity and re-processing | Embeddings and retrieval | Readable structure for humans |
| Keeps links (targets) | Yes | Usually no | Sometimes (depends on conversion) |
| Keeps structure | High (DOM) | Low | Medium |
| Token cost | High | Low | Medium |
| RAG chunking | Harder (needs parsing) | Simple | Simple (headings help) |
What should be optimized for in RAG
In real pipelines, three goals are usually competing:
- Retrieval quality (what gets found)
- Answer quality (what gets used)
- Traceability (what was the source and where)
Those goals are affected by how much structure is preserved and how much noise is carried.
If extracted structured fields are required too, prompt data formats are covered in Best Prompt Data.
Recommended decision flow
Step 1: Is re-processing expected?
If parsing rules are expected to change, HTML is often stored as the source of truth. Cleaned text and Markdown can be re-generated later.
Step 2: Is retrieval being done over full content?
If embeddings are the core, cleaned text is usually the default. It reduces noise and token cost.
Step 3: Is human review part of the workflow?
If humans must read chunks, Markdown is often used because headings and lists remain scannable.
Practical patterns that tend to work
Pattern A: Store HTML, embed cleaned text
This pattern is common because both traceability and retrieval are supported.
- HTML is stored for evidence and re-processing.
- Cleaned text is chunked and embedded.
- URLs and titles are stored as metadata.
Pattern B: Convert to Markdown, then chunk by headings
This pattern is common for docs and knowledge bases.
- HTML is converted to Markdown.
- ## headings are used as chunk boundaries.
- Lists and code blocks are preserved.
Markdown conversion tradeoffs are covered in HTML vs Markdown.
Pattern C: Cleaned text only (fast path)
This pattern is used when:
- The site is mostly prose
- Links and tables are not critical
- Cost and simplicity are prioritized
The downside is that structure and link targets can be lost.
Common RAG edge cases
Tables
If tables carry meaning (specs, pricing), cleaned text can flatten them into nonsense. HTML can preserve them, but additional parsing is required. Markdown tables can work, but generation is not always stable.
Link directories
If a page is mostly links, cleaned text can lose targets. HTML keeps them. Markdown can keep them if links are preserved as [text](url).
Boilerplate-heavy pages
HTML often includes repeated headers, footers, cookie banners, and navigation. If not removed, embeddings can be polluted. Cleaned text usually reduces this problem.
Node.js snippet: A simple "store HTML + embed cleaned text" record
This example shows a practical envelope for storage. No product-specific features are implied.
// Node 18+
// Create an ingestion record that keeps HTML for traceability
// and keeps cleaned text for embedding.
const record = {
url: "https://example.com/page",
fetched_at: new Date().toISOString(),
html: "<html>...</html>",
cleaned_text: "Readable content goes here...",
};
console.log(JSON.stringify(record, null, 2));
Conclusion
- HTML is usually selected for fidelity and re-processing.
- Cleaned text is usually selected for embeddings and retrieval.
- Markdown is usually selected when readable structure is valuable, especially for docs.
- A mixed approach is often used: HTML for storage, cleaned text (or Markdown) for RAG.
If prompt input formats are being chosen too, Best Prompt Data should be read alongside these output guides.