Table of Contents
JSON and plain text usually serve different goals. JSON is used when fields must be extracted and parsed. Plain text is used when content must be read, embedded, or searched without strict structure.
A broader overview is available in Best Prompt Data.
Quick comparison
| Topic | JSON | Plain Text |
|---|---|---|
| Best for | Structured extraction | Raw content and simple inputs |
| Parsing reliability | High | Low |
| Human readability | Medium | High |
| RAG embeddings | Good (metadata) | Good (content) |
| Common failure | Invalid JSON | Ambiguous boundaries and missing fields |
What JSON is good at
JSON is usually selected when:
- Product, article, or directory fields must be extracted
- Downstream systems expect predictable keys
- Validation and schema constraints are required
If a readable report is needed, Markdown vs JSON can be a better fit.
What plain text is good at
Plain text is usually selected when:
- Source content is being fed into embeddings
- Formatting is unnecessary or harmful
- A later step will perform extraction
If the source is HTML, output choices are covered in HTML vs Cleaned Text and Cleaned Text vs Markdown.
Use cases in web crawling, scraping, and RAG
When JSON should be used
JSON is usually preferred when:
- A database insert will happen
- Deduping is done by keys (sku, url, canonical_url)
- Multiple fields must be extracted per page
When plain text should be used
Plain text is usually preferred when:
- The goal is semantic search over page content
- Chunking and embedding are the next steps
- "Good enough" extraction is acceptable, or extraction is deferred
If headings are useful for chunking, Markdown can be used instead, as covered in Markdown vs Plain Text.
Practical tradeoffs
Plain text makes QA harder
Without fields, it becomes harder to check if "price" or "author" was extracted correctly. Everything becomes a text search problem.
JSON can lose nuance
If the entire page is forced into JSON fields, nuance can be lost unless a raw text field is included too.
A common compromise is:
- Plain text (or Markdown) is stored as content
- JSON metadata is stored as meta
Node.js snippet: Attach metadata to plain text for RAG
This pattern keeps the chunk text clean while keeping metadata separate.
// Node 18+
// Wrap plain text content with a JSON metadata envelope.
import { readFile } from "node:fs/promises";
const content = await readFile("content.txt", "utf8");
const record = {
url: "https://example.com/page",
title: "Example Page",
content,
};
console.log(JSON.stringify(record, null, 2));
Conclusion
- JSON is usually selected for extraction and reliable parsing.
- Plain text is usually selected for content-first RAG ingestion and low overhead.
- A hybrid is often used: plain text for content and JSON for metadata.
If the decision is between human-friendly structure and raw text, Markdown vs Plain Text should be compared next.