JSON vs Plain Text: Choosing the Right Format for LLM Prompts

JSON vs plain text for scraping and RAG pipelines: when strict fields are needed, when raw text is enough, and how to choose safely.

Written byAndrii
Published on

JSON and plain text usually serve different goals. JSON is used when fields must be extracted and parsed. Plain text is used when content must be read, embedded, or searched without strict structure.

A broader overview is available in Best Prompt Data.

Quick comparison

TopicJSONPlain Text
Best forStructured extractionRaw content and simple inputs
Parsing reliabilityHighLow
Human readabilityMediumHigh
RAG embeddingsGood (metadata)Good (content)
Common failureInvalid JSONAmbiguous boundaries and missing fields

What JSON is good at

JSON is usually selected when:

  • Product, article, or directory fields must be extracted
  • Downstream systems expect predictable keys
  • Validation and schema constraints are required

If a readable report is needed, Markdown vs JSON can be a better fit.

What plain text is good at

Plain text is usually selected when:

  • Source content is being fed into embeddings
  • Formatting is unnecessary or harmful
  • A later step will perform extraction

If the source is HTML, output choices are covered in HTML vs Cleaned Text and Cleaned Text vs Markdown.

Use cases in web crawling, scraping, and RAG

When JSON should be used

JSON is usually preferred when:

  • A database insert will happen
  • Deduping is done by keys (sku, url, canonical_url)
  • Multiple fields must be extracted per page

When plain text should be used

Plain text is usually preferred when:

  • The goal is semantic search over page content
  • Chunking and embedding are the next steps
  • "Good enough" extraction is acceptable, or extraction is deferred

If headings are useful for chunking, Markdown can be used instead, as covered in Markdown vs Plain Text.

Practical tradeoffs

Plain text makes QA harder

Without fields, it becomes harder to check if "price" or "author" was extracted correctly. Everything becomes a text search problem.

JSON can lose nuance

If the entire page is forced into JSON fields, nuance can be lost unless a raw text field is included too.

A common compromise is:

  • Plain text (or Markdown) is stored as content
  • JSON metadata is stored as meta

Node.js snippet: Attach metadata to plain text for RAG

This pattern keeps the chunk text clean while keeping metadata separate.

// Node 18+
// Wrap plain text content with a JSON metadata envelope.

import { readFile } from "node:fs/promises";

const content = await readFile("content.txt", "utf8");

const record = {
  url: "https://example.com/page",
  title: "Example Page",
  content,
};

console.log(JSON.stringify(record, null, 2));

Conclusion

  • JSON is usually selected for extraction and reliable parsing.
  • Plain text is usually selected for content-first RAG ingestion and low overhead.
  • A hybrid is often used: plain text for content and JSON for metadata.

If the decision is between human-friendly structure and raw text, Markdown vs Plain Text should be compared next.


About the Author

Andrii Mazurian
Andrew Mazurian@andriixzvf

Founder, WebCrawlerAPI · 🇳🇱 Netherlands

Engineer with 15 years of experience in APIs, big data, and infrastructure. Founded WebCrawlerAPI in 2024 with a single goal: to build the best data API, and have been shipping it every day since.