Table of Contents
- Quick comparison
- What CSV is good at
- What plain text is good at
- Use cases in web crawling, scraping, and RAG
- When CSV should be used
- When plain text should be used
- Practical tradeoffs
- CSV is a poor container for long content
- Plain text does not provide a schema
- Node.js snippet: Turn extracted lines into a simple CSV
- Conclusion
CSV and plain text are easy to confuse because both look "simple". The difference is that CSV implies a dataset with a schema (columns). Plain text implies that the content is the product.
A broader overview of formats is provided in Best Prompt Data.
Quick comparison
| Topic | CSV | Plain Text |
|---|---|---|
| Best for | Flat tabular datasets | Raw page content and simple outputs |
| Parsing reliability | High (with correct quoting) | Low |
| Human editing | High (spreadsheets) | High |
| RAG fit | Not great as-is | Good for embeddings and chunking |
| Common failure | Broken quoting with real-world text | Ambiguity and missing fields |
What CSV is good at
CSV is usually selected when:
- One row per page/product is needed
- A stable set of columns exists
- Export to spreadsheet tools is important
If nested structures are needed, JSON is often preferred, as covered in JSON vs CSV.
What plain text is good at
Plain text is usually selected when:
- The main value is the content itself
- Embeddings and retrieval are planned
- Formatting noise should be minimized
If light structure is helpful, Markdown can be compared in Markdown vs Plain Text.
Use cases in web crawling, scraping, and RAG
When CSV should be used
CSV is usually preferred when:
- A list, directory, or catalog is being extracted
- Data will be filtered, sorted, and joined
- Audits are being done in spreadsheet tools
When plain text should be used
Plain text is usually preferred when:
- Page content is being indexed for RAG
- Summaries are being generated without strict fields
- The pipeline is text-first and extraction is optional
If the output starts as HTML, conversion choices are covered in HTML vs Cleaned Text and HTML vs Markdown.
Practical tradeoffs
CSV is a poor container for long content
Long descriptions often contain commas, quotes, and newlines. That can be handled, but it must be enforced. If the primary goal is content, plain text is usually simpler.
Plain text does not provide a schema
If a dataset is expected, plain text will require a second pass to extract fields. That can work, but the complexity is just shifted.
Node.js snippet: Turn extracted lines into a simple CSV
This example turns "key: value" lines into a CSV with two columns.
// Node 18+
// Convert simple "key: value" lines into CSV.
import { readFile } from "node:fs/promises";
const text = await readFile("pairs.txt", "utf8");
const rows = [];
for (const line of text.split("\n")) {
const trimmed = line.trim();
if (!trimmed) continue;
const idx = trimmed.indexOf(":");
if (idx === -1) continue;
const key = trimmed.slice(0, idx).trim();
const value = trimmed.slice(idx + 1).trim();
rows.push({ key, value });
}
const out = ["key,value"];
for (const r of rows) {
const k = `"${r.key.replaceAll('"', '""')}"`;
const v = `"${r.value.replaceAll('"', '""')}"`;
out.push(`${k},${v}`);
}
console.log(out.join("\n"));
Conclusion
- CSV is usually selected for flat datasets with stable columns.
- Plain text is usually selected for content-first outputs and RAG ingestion.
- If both are needed, a common approach is: plain text is stored for content, CSV is generated only for specific exports.
If a readable structured document is preferred over plain text, Markdown vs CSV can be compared next.