Table of Contents
- Quick comparison
- What YAML is good at
- What plain text is good at
- Use cases in web crawling, scraping, and RAG
- When YAML should be used
- When plain text should be used
- Practical tradeoffs
- YAML is not ideal for large generated datasets
- Plain text makes structured QA difficult
- Node.js snippet: Combine YAML-like config with plain text content
- Conclusion
YAML and plain text are often used at different stages. YAML is usually used for structured manifests and small records. Plain text is usually used for page content and embeddings.
A broader overview is available in Best Prompt Data.
Quick comparison
| Topic | YAML | Plain Text |
|---|---|---|
| Best for | Config-like data and manifests | Raw content and simple outputs |
| Parsing reliability | Medium (indentation matters) | Low (no structure) |
| Human readability | High | High |
| RAG fit | Good for metadata | Good for content |
| Common failure | Indentation and implicit types | Missing boundaries and ambiguity |
What YAML is good at
YAML is usually selected when:
- A job manifest is being created (rules, filters, selectors)
- Humans will tweak values
- Nested config is needed and comments matter
If strict parsing is required, JSON can be preferred, as covered in JSON vs YAML.
What plain text is good at
Plain text is usually selected when:
- The focus is on content, not fields
- Embeddings will be created for RAG
- Formatting should be minimized
If structure is helpful for chunking, Markdown can be compared in Markdown vs Plain Text.
Use cases in web crawling, scraping, and RAG
When YAML should be used
YAML is usually preferred when:
- Extraction rules are being passed between humans
- A small record is being stored, and a schema is not enforced
- Comments are needed to explain choices
When plain text should be used
Plain text is usually preferred when:
- The goal is search and retrieval over page content
- Chunking will be done later
- The output must be resilient to minor formatting issues
If the output is coming from HTML, the "raw vs cleaned" decision is covered in HTML vs Cleaned Text.
Practical tradeoffs
YAML is not ideal for large generated datasets
If thousands of YAML records are emitted by a model, indentation mistakes and typing surprises become frequent. JSON or CSV is usually safer at that scale.
Plain text makes structured QA difficult
If a "price" field is required, plain text alone can make validation hard. JSON can be compared in JSON vs Plain Text.
Node.js snippet: Combine YAML-like config with plain text content
A common pattern is: a config is kept in YAML and content is kept as plain text, then both are wrapped into a JSON record for ingestion.
// Node 18+
// Wrap plain text content with a config object.
const config = {
extract: ["title", "author", "date"],
language: "en",
};
const content = "Long page text goes here...";
const record = { config, content };
console.log(JSON.stringify(record, null, 2));
Conclusion
- YAML is usually selected for human-edited manifests and config-like data.
- Plain text is usually selected for content-first outputs and embeddings.
- In crawling and RAG pipelines, YAML often describes what should be extracted, while plain text carries the actual page content.
If a tabular export is needed, YAML vs CSV can be compared too.