How dom_smoozie Rust Mozilla Readability alternative works

A practical, step-by-step explanation of how dom_smoothie (Rust) works as a Mozilla Readability alternative for main-content extraction.

Written byAndrii
Published on

Hi, I'm Andrew. I work on scraping and crawling systems every day in WebCrawlerAPI.

If you already used Mozilla Readability, dom_smoothie will feel familiar. The same core idea is used: score the DOM, find the main container, clean it, and return article content. But extra controls are added in dom_smoothie for retries, candidate selection, and output shaping.

If you want the Mozilla baseline explanation first, read: Mozilla Readability Algorithm (Readability.js) explanation. If you want a code-first JavaScript integration guide, read: Extracting article or blogpost content with Mozilla Readability.

In this post, I will explain how it works in real life, where it fails, and how to tune it without guessing.

The most important heuristics in plain English

These heuristics do most of the heavy lifting:

  1. Unlikely blocks are removed early. Sidebars, banners, menus, modal UI, and hidden nodes are filtered out.
  2. Class and id names are weighted. Positive names get score bonus. Negative names get penalty.
  3. Text-like blocks are scored, not full pages. Paragraph-ish nodes are used as the signal source.
  4. Scores are propagated to ancestors. The real article is usually a parent container, not a single <p>.
  5. Link density is used as a penalty. Navigation-heavy blocks usually contain many links and little writing.
  6. Siblings around the winner are merged. This recovers paragraphs split by ads, widgets, or template wrappers.
  7. Conditional cleanup runs after extraction. Tables, forms, junk embeds, empty tags, and noisy blocks are removed.

That is the practical core. It is not magic. It is a layered heuristic pipeline.

High-level end-to-end flow (parse() stages)

At high level, Readability::parse() in dom_smoothie does this:

  1. Check parse budget (max_elements_to_parse).
  2. Parse metadata from JSON-LD (optional).
  3. Parse metadata from <meta> and <title>, then merge.
  4. Prepare DOM (remove scripts/styles/comments, normalize messy HTML).
  5. Pre-filter obvious noise (hidden/dialog/byline/duplicate title nodes).
  6. Run main content extraction and pick best candidate.
  7. Clean extracted content with prep_article().
  8. Post-process URLs/links/classes.
  9. Build final Article object.

If extraction fails, GrabFailed is returned.

Real-life note: default parse() can retry with relaxed heuristics if output is too short. This costs CPU, but success rate improves on weird pages.

Inside each extraction attempt, the flow is simple and strict.

The body is walked in document order.

  • Unlikely candidates can be removed
  • Empty structural nodes can be removed
  • Score-eligible tags are collected (section, h2-h6, p, td, pre)
  • div blocks are normalized into paragraph-like structure when needed

This normalization step is important because many pages use <div> for everything.

Very short text blocks are ignored (< 25 chars).

For valid text blocks, base score is built from:

  • constant base (2)
  • punctuation signal (comma-like count)
  • text length bonus (capped)

That score is pushed up to ancestors up to depth 5:

  • parent gets full share
  • grandparent gets half
  • higher levels get smaller fraction

This is how wrapper containers win, not leaf text nodes.

Ancestors also get intrinsic score by tag type:

  • div gets positive prior
  • pre, td, blockquote get smaller positive prior
  • lists/forms/headings/tables can get penalties depending on type

If class/id weighting is enabled, positive patterns add points and negative patterns subtract points.

For candidates above a minimum score threshold, score is adjusted:

adjusted = score * (1 - linkDensity)

This prevents nav-like blocks from winning when link text dominates.

Only top n_top_candidates are kept (default 5). The highest score starts as the top candidate, then selection mode logic can promote a better ancestor.

Candidate selection modes: Readability vs DomSmoothie

dom_smoothie supports two selection modes with slightly different behavior.

CandidateSelectMode::Readability

  • Mozilla-like behavior
  • Looks for common ancestor among high-scoring alternatives
  • Uses overlap and relative strength checks
  • Usually safer for classic article templates

CandidateSelectMode::DomSmoothie

  • Intersects ancestor sets of strong alternatives
  • Tries to choose the strongest meaningful common ancestor
  • Can be better on fragmented modern layouts with wrappers

In practice:

  • Start with Readability if you need conservative behavior
  • Switch to DomSmoothie if content is often split or wrapper-heavy

Sibling merge and why it matters

After top candidate is chosen, extraction is not finished.

The algorithm also checks siblings under the same parent and appends the ones that look article-like. Without this step, many articles lose intro or trailing paragraphs.

Sibling inclusion uses:

  • threshold based on top score (max(10, topScore * 0.2))
  • class-name bonus if sibling class matches top candidate class
  • paragraph heuristics for unscored siblings:
    • enough length
    • sentence-like text
    • low link density

This single step fixes a lot of real pages that inject ads between paragraphs.

Cleanup pipeline (prep_article) and practical effect

prep_article() is a cleanup pass over extracted content. Order matters.

Main actions:

  1. Remove tiny share/social blocks
  2. Mark data tables vs layout tables
  3. Repair lazy images (data-* to real src/srcset)
  4. Conditionally clean forms/fieldsets
  5. Remove junk nodes (footer, aside, noisy embeds, inputs)
  6. Remove negative-weight headings
  7. Conditionally clean table, ul, div
  8. Rename h1 to h2
  9. Strip presentational attributes/styles
  10. Remove empty paragraphs and extra <br>
  11. Flatten single-cell tables

Practical effect:

  • Output becomes much more stable for markdown conversion
  • Placeholder images and layout artifacts are reduced
  • Link and text quality gets better for LLM/RAG pipelines

Retry strategy and policies (Strict/Moderate/Clean/Raw)

dom_smoothie exposes fixed policies:

  • Strict = StripUnlikelys + WeightClasses + CleanConditionally
  • Moderate = WeightClasses + CleanConditionally
  • Clean = CleanConditionally
  • Raw = no heuristic flags

Default parse() does staged fallback if extracted text is below char_threshold:

  1. run strict
  2. disable StripUnlikelys
  3. disable WeightClasses
  4. disable CleanConditionally

If threshold is never reached, the longest attempt is returned.

Use parse_with_policy(policy) when you want one deterministic pass.

Preflight is_probably_readable() and when to use

is_probably_readable() is a cheap gate before full extraction.

It checks nodes like p, pre, article, and some div-related patterns, then:

  • skips hidden/unlikely/list-like paragraph nodes
  • requires minimum text per node (default 140 chars)
  • accumulates score using sqrt(textLen - minContentLen)
  • returns true when threshold is reached (default 20)

Use it when:

  • you crawl many non-article pages
  • you need to save CPU on extraction
  • you want fast early rejection for nav/search/list pages

Do not use it as a quality guarantee. It is a preflight only.

Tuning knobs and tradeoffs

Most useful options in practice:

  • char_threshold Higher value reduces false positives, but can drop short valid articles.

  • n_top_candidates More candidates helps hard pages, but increases processing cost.

  • min_score_to_adjust Changes when link-density penalty starts.

  • candidate_select_mode Readability is conservative, DomSmoothie can recover fragmented content better.

  • max_elements_to_parse Protects CPU/memory on huge DOMs, but can fail giant pages if too low.

  • disable_json_ld Faster and simpler metadata path, but you may lose high-quality structured metadata.

  • keep_classes and classes_to_preserve Better styling compatibility vs cleaner output.

  • text_mode (Raw, Formatted, Markdown) Choose based on downstream pipeline, not personal preference.

Failure cases and debugging checklist

No extractor works on all pages. These are common failure modes.

  • Content loaded by JS after initial HTML (empty shell problem)
  • List/search pages that look text-heavy but are not articles
  • Very short pages that cannot pass thresholds
  • Pages with extreme link-heavy templates
  • Broken HTML where wrappers are malformed
  • Aggressive cleanup removing valid blocks

My quick checklist:

  1. Inspect fetched HTML first. Is real content present server-side?
  2. Check is_probably_readable() result before parse.
  3. Compare Strict vs Raw output lengths.
  4. Try alternative candidate_select_mode.
  5. Lower char_threshold for short-form sources.
  6. Inspect sibling merge impact (lost intro/outro is a common signal).
  7. Verify lazy image normalization if media looks empty.
  8. Log top candidate score, link density, and final text length.

Practical WebCrawlerAPI usage (main_content_only)

If you do not want to run extraction infra yourself, use main_content_only in WebCrawlerAPI.

// Node 18+
const response = await fetch("https://api.webcrawlerapi.com/v1/scrape", {
  method: "POST",
  headers: {
    Authorization: "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com/article",
    main_content_only: true,
    scrape_type: "markdown",
  }),
});

const data = await response.json();
console.log(data);

For quick experiments, you can also use the free tool: HTML Main Content Readability.

If you want the Mozilla baseline first, read this guide: Mozilla Readability Algorithm (Readability.js) explanation. If you want the JS implementation tutorial, read: Extracting article or blogpost content with Mozilla Readability.


About the Author

Andrii Mazurian
Andrew Mazurian@andriixzvf

Founder, WebCrawlerAPI · 🇳🇱 Netherlands

Engineer with 15 years of experience in APIs, big data, and infrastructure. Founded WebCrawlerAPI in 2024 with a single goal: to build the best data API, and have been shipping it every day since.