Webcrawler API LogoWebCrawler API
PricingDocsBlogSign inSign Up
Webcrawler API LogoWebCrawler API

Tools

  • Website to Markdown
  • llms.txt Generator
  • HTML to Readability

Resources

  • Blog
  • Docs
  • Changelog

Follow us

  • Github
  • X (Twitter)
  • Postman
  • Swagger

Legal

  • Privacy Policy
  • Terms & Conditions
  • Refund Policy

Made in Netherlands 🇳🇱
2023-2026   ©103Labs
    ComparisonRAGHTMLMarkdown

    HTML vs Cleaned Text vs Markdown: Which Should Be Used for RAG?

    A practical guide to choosing HTML, cleaned text, or Markdown for RAG ingestion from crawled pages, including tradeoffs and a simple decision flow.

    Written byAndrew
    Published onFeb 1, 2026

    Table of Contents

    • Quick comparison
    • What should be optimized for in RAG
    • Recommended decision flow
    • Step 1: Is re-processing expected?
    • Step 2: Is retrieval being done over full content?
    • Step 3: Is human review part of the workflow?
    • Practical patterns that tend to work
    • Pattern A: Store HTML, embed cleaned text
    • Pattern B: Convert to Markdown, then chunk by headings
    • Pattern C: Cleaned text only (fast path)
    • Common RAG edge cases
    • Tables
    • Link directories
    • Boilerplate-heavy pages
    • Node.js snippet: A simple "store HTML + embed cleaned text" record
    • Conclusion

    Table of Contents

    • Quick comparison
    • What should be optimized for in RAG
    • Recommended decision flow
    • Step 1: Is re-processing expected?
    • Step 2: Is retrieval being done over full content?
    • Step 3: Is human review part of the workflow?
    • Practical patterns that tend to work
    • Pattern A: Store HTML, embed cleaned text
    • Pattern B: Convert to Markdown, then chunk by headings
    • Pattern C: Cleaned text only (fast path)
    • Common RAG edge cases
    • Tables
    • Link directories
    • Boilerplate-heavy pages
    • Node.js snippet: A simple "store HTML + embed cleaned text" record
    • Conclusion

    When RAG is being built on top of crawled pages, output format choices tend to decide the whole pipeline. HTML, cleaned text, and Markdown can all work, but different costs are paid.

    Pairwise guides are available in:

    • HTML vs Markdown
    • Cleaned Text vs Markdown
    • HTML vs Cleaned Text

    Quick comparison

    TopicHTMLCleaned TextMarkdown
    Best forFidelity and re-processingEmbeddings and retrievalReadable structure for humans
    Keeps links (targets)YesUsually noSometimes (depends on conversion)
    Keeps structureHigh (DOM)LowMedium
    Token costHighLowMedium
    RAG chunkingHarder (needs parsing)SimpleSimple (headings help)

    What should be optimized for in RAG

    In real pipelines, three goals are usually competing:

    1. Retrieval quality (what gets found)
    2. Answer quality (what gets used)
    3. Traceability (what was the source and where)

    Those goals are affected by how much structure is preserved and how much noise is carried.

    If extracted structured fields are required too, prompt data formats are covered in Best Prompt Data.

    Recommended decision flow

    Step 1: Is re-processing expected?

    If parsing rules are expected to change, HTML is often stored as the source of truth. Cleaned text and Markdown can be re-generated later.

    Step 2: Is retrieval being done over full content?

    If embeddings are the core, cleaned text is usually the default. It reduces noise and token cost.

    Step 3: Is human review part of the workflow?

    If humans must read chunks, Markdown is often used because headings and lists remain scannable.

    Practical patterns that tend to work

    Pattern A: Store HTML, embed cleaned text

    This pattern is common because both traceability and retrieval are supported.

    • HTML is stored for evidence and re-processing.
    • Cleaned text is chunked and embedded.
    • URLs and titles are stored as metadata.

    Pattern B: Convert to Markdown, then chunk by headings

    This pattern is common for docs and knowledge bases.

    • HTML is converted to Markdown.
    • ## headings are used as chunk boundaries.
    • Lists and code blocks are preserved.

    Markdown conversion tradeoffs are covered in HTML vs Markdown.

    Pattern C: Cleaned text only (fast path)

    This pattern is used when:

    • The site is mostly prose
    • Links and tables are not critical
    • Cost and simplicity are prioritized

    The downside is that structure and link targets can be lost.

    Common RAG edge cases

    Tables

    If tables carry meaning (specs, pricing), cleaned text can flatten them into nonsense. HTML can preserve them, but additional parsing is required. Markdown tables can work, but generation is not always stable.

    Link directories

    If a page is mostly links, cleaned text can lose targets. HTML keeps them. Markdown can keep them if links are preserved as [text](url).

    Boilerplate-heavy pages

    HTML often includes repeated headers, footers, cookie banners, and navigation. If not removed, embeddings can be polluted. Cleaned text usually reduces this problem.

    Node.js snippet: A simple "store HTML + embed cleaned text" record

    This example shows a practical envelope for storage. No product-specific features are implied.

    // Node 18+
    // Create an ingestion record that keeps HTML for traceability
    // and keeps cleaned text for embedding.
    
    const record = {
      url: "https://example.com/page",
      fetched_at: new Date().toISOString(),
      html: "<html>...</html>",
      cleaned_text: "Readable content goes here...",
    };
    
    console.log(JSON.stringify(record, null, 2));
    

    Conclusion

    • HTML is usually selected for fidelity and re-processing.
    • Cleaned text is usually selected for embeddings and retrieval.
    • Markdown is usually selected when readable structure is valuable, especially for docs.
    • A mixed approach is often used: HTML for storage, cleaned text (or Markdown) for RAG.

    If prompt input formats are being chosen too, Best Prompt Data should be read alongside these output guides.