CSV vs Plain Text: Choosing the Right Format for LLM Prompts

CSV vs plain text for scraped outputs and prompt data: when a dataset is needed, when narrative text is enough, and what to avoid.

Written byAndrii
Published on

CSV and plain text are easy to confuse because both look "simple". The difference is that CSV implies a dataset with a schema (columns). Plain text implies that the content is the product.

A broader overview of formats is provided in Best Prompt Data.

Quick comparison

TopicCSVPlain Text
Best forFlat tabular datasetsRaw page content and simple outputs
Parsing reliabilityHigh (with correct quoting)Low
Human editingHigh (spreadsheets)High
RAG fitNot great as-isGood for embeddings and chunking
Common failureBroken quoting with real-world textAmbiguity and missing fields

What CSV is good at

CSV is usually selected when:

  • One row per page/product is needed
  • A stable set of columns exists
  • Export to spreadsheet tools is important

If nested structures are needed, JSON is often preferred, as covered in JSON vs CSV.

What plain text is good at

Plain text is usually selected when:

  • The main value is the content itself
  • Embeddings and retrieval are planned
  • Formatting noise should be minimized

If light structure is helpful, Markdown can be compared in Markdown vs Plain Text.

Use cases in web crawling, scraping, and RAG

When CSV should be used

CSV is usually preferred when:

  • A list, directory, or catalog is being extracted
  • Data will be filtered, sorted, and joined
  • Audits are being done in spreadsheet tools

When plain text should be used

Plain text is usually preferred when:

  • Page content is being indexed for RAG
  • Summaries are being generated without strict fields
  • The pipeline is text-first and extraction is optional

If the output starts as HTML, conversion choices are covered in HTML vs Cleaned Text and HTML vs Markdown.

Practical tradeoffs

CSV is a poor container for long content

Long descriptions often contain commas, quotes, and newlines. That can be handled, but it must be enforced. If the primary goal is content, plain text is usually simpler.

Plain text does not provide a schema

If a dataset is expected, plain text will require a second pass to extract fields. That can work, but the complexity is just shifted.

Node.js snippet: Turn extracted lines into a simple CSV

This example turns "key: value" lines into a CSV with two columns.

// Node 18+
// Convert simple "key: value" lines into CSV.

import { readFile } from "node:fs/promises";

const text = await readFile("pairs.txt", "utf8");
const rows = [];

for (const line of text.split("\n")) {
  const trimmed = line.trim();
  if (!trimmed) continue;
  const idx = trimmed.indexOf(":");
  if (idx === -1) continue;
  const key = trimmed.slice(0, idx).trim();
  const value = trimmed.slice(idx + 1).trim();
  rows.push({ key, value });
}

const out = ["key,value"];
for (const r of rows) {
  const k = `"${r.key.replaceAll('"', '""')}"`;
  const v = `"${r.value.replaceAll('"', '""')}"`;
  out.push(`${k},${v}`);
}

console.log(out.join("\n"));

Conclusion

  • CSV is usually selected for flat datasets with stable columns.
  • Plain text is usually selected for content-first outputs and RAG ingestion.
  • If both are needed, a common approach is: plain text is stored for content, CSV is generated only for specific exports.

If a readable structured document is preferred over plain text, Markdown vs CSV can be compared next.


About the Author

Andrii Mazurian
Andrew Mazurian@andriixzvf

Founder, WebCrawlerAPI · 🇳🇱 Netherlands

Engineer with 15 years of experience in APIs, big data, and infrastructure. Founded WebCrawlerAPI in 2024 with a single goal: to build the best data API, and have been shipping it every day since.