HTML vs Cleaned Text: Choosing the Right Output Format

HTML vs cleaned text for web crawling and RAG: what is preserved, what is lost, and which output format is safer for real pipelines.

Written byAndrii
Published on

HTML and cleaned text sit at opposite ends of the output spectrum. HTML keeps almost everything (including markup). Cleaned text keeps only readable text (and usually drops most structure).

If Markdown is being considered too, HTML vs Markdown and Cleaned Text vs Markdown should be read.

Quick comparison

TopicHTMLCleaned Text
Best forFidelity and re-processing laterRAG, embeddings, fast reading
Keeps linksYes (as <a href> etc.)Usually no (or links are flattened)
Keeps structureYes (DOM)Limited
SizeLargerSmaller
Common failureNoise: scripts, nav, repeated UIContext loss: lists, tables, link targets

What HTML is good at

HTML is usually preferred when:

  • Maximum fidelity is needed
  • The page must be re-parsed later with different rules
  • Link targets, attributes, and DOM structure matter

Typical crawling cases:

  • Product pages where microdata or attributes are needed
  • Pages where selectors will be applied later
  • Audits where evidence must be preserved

If extracted fields are the goal, a structured format should be used after parsing, as covered in Best Prompt Data.

What cleaned text is good at

Cleaned text is usually preferred when:

  • The content will be embedded for RAG
  • Token cost should be reduced
  • Navigation and boilerplate should be removed

Cleaned text vs Markdown is compared in Cleaned Text vs Markdown.

Use cases for crawling and RAG ingestion

When HTML should be used

HTML is usually the safer choice when:

  • Re-processing is expected (parsing rules will change)
  • Link URLs must be preserved exactly
  • Tables and lists must be reconstructed later

A practical downside is that HTML often includes a lot of noise. Boilerplate must be removed in a second step.

When cleaned text should be used

Cleaned text is usually the safer choice when:

  • The primary goal is retrieval over the readable content
  • Chunking will be done without relying on DOM structure
  • Storage and token costs must be kept down

A practical downside is that important structure can be lost, especially:

  • Tables (column meaning is lost)
  • Lists (nesting can be flattened)
  • Links (anchor text remains but target URLs can be dropped)

If structure must be preserved for readability, Markdown can be considered in HTML vs Markdown.

Practical tradeoffs (what tends to break)

If a page is mostly a set of links (directories, documentation sidebars), cleaned text can become hard to use because the URL targets are lost. HTML keeps that.

Layout-heavy pages

If a page is mostly layout (menus, cards, footers), HTML can be too noisy. Cleaned text usually performs better for RAG, because the noise is removed.

Node.js snippet: Strip HTML tags into rough cleaned text

This is intentionally rough. It is only suitable as a fallback or a quick test.

// Node 18+
// Rough HTML to text conversion without external deps.

import { readFile } from "node:fs/promises";

const html = await readFile("page.html", "utf8");

// Remove script/style blocks
let text = html
  .replace(/<script[\s\S]*?<\/script>/gi, "")
  .replace(/<style[\s\S]*?<\/style>/gi, "");

// Replace tags with spaces, then normalize whitespace
text = text.replace(/<[^>]+>/g, " ");
text = text.replace(/\s+/g, " ").trim();

console.log(text.slice(0, 600));

Conclusion

  • HTML is usually selected when fidelity and re-processing matter.
  • Cleaned text is usually selected when RAG and readable content are the goal.
  • A common pattern is: HTML is stored for traceability, and cleaned text is produced for embeddings.

If a single best default is being sought for RAG, HTML vs Cleaned Text vs Markdown can be used as the tie-breaker.


About the Author

Andrii Mazurian
Andrew Mazurian@andriixzvf

Founder, WebCrawlerAPI · 🇳🇱 Netherlands

Engineer with 15 years of experience in APIs, big data, and infrastructure. Founded WebCrawlerAPI in 2024 with a single goal: to build the best data API, and have been shipping it every day since.