YAML vs Plain Text: Choosing the Right Format for LLM Prompts

YAML vs plain text for prompt data and scraping workflows: when structured manifests help and when raw text is the safer choice.

Written byAndrii
Published on

YAML and plain text are often used at different stages. YAML is usually used for structured manifests and small records. Plain text is usually used for page content and embeddings.

A broader overview is available in Best Prompt Data.

Quick comparison

TopicYAMLPlain Text
Best forConfig-like data and manifestsRaw content and simple outputs
Parsing reliabilityMedium (indentation matters)Low (no structure)
Human readabilityHighHigh
RAG fitGood for metadataGood for content
Common failureIndentation and implicit typesMissing boundaries and ambiguity

What YAML is good at

YAML is usually selected when:

  • A job manifest is being created (rules, filters, selectors)
  • Humans will tweak values
  • Nested config is needed and comments matter

If strict parsing is required, JSON can be preferred, as covered in JSON vs YAML.

What plain text is good at

Plain text is usually selected when:

  • The focus is on content, not fields
  • Embeddings will be created for RAG
  • Formatting should be minimized

If structure is helpful for chunking, Markdown can be compared in Markdown vs Plain Text.

Use cases in web crawling, scraping, and RAG

When YAML should be used

YAML is usually preferred when:

  • Extraction rules are being passed between humans
  • A small record is being stored, and a schema is not enforced
  • Comments are needed to explain choices

When plain text should be used

Plain text is usually preferred when:

  • The goal is search and retrieval over page content
  • Chunking will be done later
  • The output must be resilient to minor formatting issues

If the output is coming from HTML, the "raw vs cleaned" decision is covered in HTML vs Cleaned Text.

Practical tradeoffs

YAML is not ideal for large generated datasets

If thousands of YAML records are emitted by a model, indentation mistakes and typing surprises become frequent. JSON or CSV is usually safer at that scale.

Plain text makes structured QA difficult

If a "price" field is required, plain text alone can make validation hard. JSON can be compared in JSON vs Plain Text.

Node.js snippet: Combine YAML-like config with plain text content

A common pattern is: a config is kept in YAML and content is kept as plain text, then both are wrapped into a JSON record for ingestion.

// Node 18+
// Wrap plain text content with a config object.

const config = {
  extract: ["title", "author", "date"],
  language: "en",
};

const content = "Long page text goes here...";

const record = { config, content };
console.log(JSON.stringify(record, null, 2));

Conclusion

  • YAML is usually selected for human-edited manifests and config-like data.
  • Plain text is usually selected for content-first outputs and embeddings.
  • In crawling and RAG pipelines, YAML often describes what should be extracted, while plain text carries the actual page content.

If a tabular export is needed, YAML vs CSV can be compared too.


About the Author

Andrii Mazurian
Andrew Mazurian@andriixzvf

Founder, WebCrawlerAPI · 🇳🇱 Netherlands

Engineer with 15 years of experience in APIs, big data, and infrastructure. Founded WebCrawlerAPI in 2024 with a single goal: to build the best data API, and have been shipping it every day since.