Table of Contents
- Quick comparison
- What Markdown is good at
- What plain text is good at
- Use cases in web crawling, scraping, and RAG
- When Markdown should be used
- When plain text should be used
- Practical tradeoffs
- Markdown can inflate tokens
- Plain text can hide hierarchy
- Node.js snippet: Create simple RAG chunks from Markdown headings
- Conclusion
Markdown and plain text can look similar, but different expectations are created. Markdown implies structure (headings, lists). Plain text implies that structure is not needed and should not be relied on.
A broader guide to prompt data formats is provided in Best Prompt Data.
Quick comparison
| Topic | Markdown | Plain Text |
|---|---|---|
| Best for | Readable structured docs | Raw content and simple prompts |
| Parsing reliability | Medium | Low (no explicit structure) |
| Human readability | High | High (but less scannable) |
| RAG chunking | Good (headings help) | Good (simpler, fewer tokens) |
| Common failure | Inconsistent formatting | Missing boundaries, ambiguous sections |
What Markdown is good at
Markdown is usually selected when:
- Sections should be clear (H2/H3 headings)
- Lists should remain lists
- Code examples should be fenced and preserved
Markdown output tradeoffs are covered in Cleaned Text vs Markdown.
What plain text is good at
Plain text is usually selected when:
- A minimum surface area is wanted (no markup)
- The content is already clean and should not be restructured
- Prompt tokens should be reduced by removing formatting
If the source is HTML, the output decision is covered in HTML vs Cleaned Text.
Use cases in web crawling, scraping, and RAG
When Markdown should be used
Markdown is usually preferred when:
- The output will be read by humans
- Chunk boundaries should follow headings
- Quotes, bullet points, and code blocks matter for meaning
When plain text should be used
Plain text is usually preferred when:
- The text is being embedded and retrieved by similarity search
- Formatting noise should be removed
- Simple extraction is being done with a second pass later
For strict extraction into fields, plain text is usually not enough. JSON is usually chosen, as covered in Markdown vs JSON.
Practical tradeoffs
Markdown can inflate tokens
Headings and bullet syntax add tokens. That cost can matter when large crawls are processed. Plain text can be cheaper to store and embed.
Plain text can hide hierarchy
If multiple sections exist (pricing, terms, specs), headings can be valuable. Without them, chunking and retrieval can get worse.
Node.js snippet: Create simple RAG chunks from Markdown headings
This chunker is intentionally simple. It splits on ## and keeps the heading with the chunk.
// Node 18+
// Split Markdown into chunks by H2 headings.
import { readFile } from "node:fs/promises";
const md = await readFile("page.md", "utf8");
const parts = md.split(/\n##\s+/);
const chunks = [];
for (let i = 0; i < parts.length; i++) {
const text = i === 0 ? parts[i] : "## " + parts[i];
const trimmed = text.trim();
if (trimmed) chunks.push(trimmed);
}
console.log("Chunks:", chunks.length);
console.log("First chunk preview:\n", chunks[0]?.slice(0, 300));
Conclusion
- Markdown is usually selected when readable structure helps.
- Plain text is usually selected when simplicity and lower overhead are more important than structure.
- For many RAG pipelines, plain text is used for embeddings and Markdown is used for human review outputs.
If the decision is really about tables, CSV should be compared in Markdown vs CSV.