Table of Contents
JSON and YAML solve the same general problem: structured data. The difference is that JSON is strict, while YAML is flexible and human-friendly. That flexibility is where most surprises are introduced.
A broader guide is available in Best Prompt Data.
Quick comparison
| Topic | JSON | YAML |
|---|---|---|
| Best for | Machine parsing, APIs, validation | Human-edited config and small manifests |
| Schema validation | Strong | Possible, but less common in practice |
| Comments | Not supported | Supported |
| Typing surprises | Fewer | More (implicit types) |
| Common failure | Trailing commas, quoting | Indentation, implicit booleans/dates |
What JSON is good at
JSON is usually preferred when:
- A downstream parser must not guess
- A contract is required (keys, types, required fields)
- Data is being stored as objects or sent over APIs
JSON paired with Markdown is covered in Markdown vs JSON.
What YAML is good at
YAML is usually preferred when:
- Humans will edit the output
- Comments are useful
- Config-like nesting is needed, but strictness is not
If a readable document is needed instead of config, Markdown is often selected, as covered in Markdown vs YAML.
Use cases in web crawling, scraping, and RAG
When JSON should be used
JSON is usually the safer choice when:
- Page extractions will be inserted into a database
- A batch crawl produces many records that must be merged or deduped
- RAG metadata must be consistent across all chunks
When YAML should be used
YAML is usually a fit when:
- Extraction rules are being generated and edited manually
- A "job spec" is being passed around by humans
- Small manifests are being produced where a strict validator is not needed
For tabular datasets, CSV can be compared in JSON vs CSV.
Practical failure modes
YAML implicit types
In YAML, on, yes, 2026-02-01, and 123 can be interpreted as boolean, date, and number depending on the parser. In scraping, that can silently change meaning.
JSON strictness is a feature
The strictness is usually annoying for humans, but it is valuable for pipelines. If the model emits invalid JSON, the failure is immediate and detectable.
Node.js snippet: Enforce "JSON only" output in a pipeline
The simplest enforcement is: parsing is attempted, and the job is failed if parsing fails. That behavior tends to tighten model behavior over time.
// Node 18+
// Fail fast if JSON is invalid.
import { readFile } from "node:fs/promises";
const text = await readFile("output.json", "utf8");
let data;
try {
data = JSON.parse(text);
} catch (e) {
console.error("Invalid JSON output:", e.message);
process.exit(1);
}
console.log("OK:", Array.isArray(data) ? "array" : "object");
Conclusion
- JSON is usually selected for reliability, validation, and downstream parsing.
- YAML is usually selected for human-edited config-like content and comments.
- For most scraping and RAG ingestion pipelines, JSON is usually the default unless human editing is a core requirement.
If minimal text is desired instead of structured data, JSON vs Plain Text should be read next.