Markdown vs Plain Text: Choosing the Right Format for LLM Prompts

Markdown vs plain text for prompts and scraped content: structure, readability, chunking for RAG, and practical tradeoffs.

Written byAndrii
Published on

Markdown and plain text can look similar, but different expectations are created. Markdown implies structure (headings, lists). Plain text implies that structure is not needed and should not be relied on.

A broader guide to prompt data formats is provided in Best Prompt Data.

Quick comparison

TopicMarkdownPlain Text
Best forReadable structured docsRaw content and simple prompts
Parsing reliabilityMediumLow (no explicit structure)
Human readabilityHighHigh (but less scannable)
RAG chunkingGood (headings help)Good (simpler, fewer tokens)
Common failureInconsistent formattingMissing boundaries, ambiguous sections

What Markdown is good at

Markdown is usually selected when:

  • Sections should be clear (H2/H3 headings)
  • Lists should remain lists
  • Code examples should be fenced and preserved

Markdown output tradeoffs are covered in Cleaned Text vs Markdown.

What plain text is good at

Plain text is usually selected when:

  • A minimum surface area is wanted (no markup)
  • The content is already clean and should not be restructured
  • Prompt tokens should be reduced by removing formatting

If the source is HTML, the output decision is covered in HTML vs Cleaned Text.

Use cases in web crawling, scraping, and RAG

When Markdown should be used

Markdown is usually preferred when:

  • The output will be read by humans
  • Chunk boundaries should follow headings
  • Quotes, bullet points, and code blocks matter for meaning

When plain text should be used

Plain text is usually preferred when:

  • The text is being embedded and retrieved by similarity search
  • Formatting noise should be removed
  • Simple extraction is being done with a second pass later

For strict extraction into fields, plain text is usually not enough. JSON is usually chosen, as covered in Markdown vs JSON.

Practical tradeoffs

Markdown can inflate tokens

Headings and bullet syntax add tokens. That cost can matter when large crawls are processed. Plain text can be cheaper to store and embed.

Plain text can hide hierarchy

If multiple sections exist (pricing, terms, specs), headings can be valuable. Without them, chunking and retrieval can get worse.

Node.js snippet: Create simple RAG chunks from Markdown headings

This chunker is intentionally simple. It splits on ## and keeps the heading with the chunk.

// Node 18+
// Split Markdown into chunks by H2 headings.

import { readFile } from "node:fs/promises";

const md = await readFile("page.md", "utf8");
const parts = md.split(/\n##\s+/);

const chunks = [];
for (let i = 0; i < parts.length; i++) {
  const text = i === 0 ? parts[i] : "## " + parts[i];
  const trimmed = text.trim();
  if (trimmed) chunks.push(trimmed);
}

console.log("Chunks:", chunks.length);
console.log("First chunk preview:\n", chunks[0]?.slice(0, 300));

Conclusion

  • Markdown is usually selected when readable structure helps.
  • Plain text is usually selected when simplicity and lower overhead are more important than structure.
  • For many RAG pipelines, plain text is used for embeddings and Markdown is used for human review outputs.

If the decision is really about tables, CSV should be compared in Markdown vs CSV.


About the Author

Andrii Mazurian
Andrew Mazurian@andriixzvf

Founder, WebCrawlerAPI · 🇳🇱 Netherlands

Engineer with 15 years of experience in APIs, big data, and infrastructure. Founded WebCrawlerAPI in 2024 with a single goal: to build the best data API, and have been shipping it every day since.