How dom_smoozie Rust Mozilla Readability alternative works

Hi, I'm Andrew. I work on scraping and crawling systems every day in WebCrawlerAPI.

If you already used Mozilla Readability, dom_smoothie will feel familiar. The same core idea is used: score the DOM, find the main container, clean it, and return article content. But extra controls are added in dom_smoothie for retries, candidate selection, and output shaping.

If you want the Mozilla baseline explanation first, read: Mozilla Readability Algorithm (Readability.js) explanation. If you want a code-first JavaScript integration guide, read: Extracting article or blogpost content with Mozilla Readability.

In this post, I will explain how it works in real life, where it fails, and how to tune it without guessing.

The most important heuristics in plain English

These heuristics do most of the heavy lifting:

Unlikely blocks are removed early. Sidebars, banners, menus, modal UI, and hidden nodes are filtered out.
Class and id names are weighted. Positive names get score bonus. Negative names get penalty.
Text-like blocks are scored, not full pages. Paragraph-ish nodes are used as the signal source.
Scores are propagated to ancestors. The real article is usually a parent container, not a single <p>.
Link density is used as a penalty. Navigation-heavy blocks usually contain many links and little writing.
Siblings around the winner are merged. This recovers paragraphs split by ads, widgets, or template wrappers.
Conditional cleanup runs after extraction. Tables, forms, junk embeds, empty tags, and noisy blocks are removed.

That is the practical core. It is not magic. It is a layered heuristic pipeline.

High-level end-to-end flow (parse() stages)

At high level, Readability::parse() in dom_smoothie does this:

Check parse budget (max_elements_to_parse).
Parse metadata from JSON-LD (optional).
Parse metadata from <meta> and <title>, then merge.
Prepare DOM (remove scripts/styles/comments, normalize messy HTML).
Pre-filter obvious noise (hidden/dialog/byline/duplicate title nodes).
Run main content extraction and pick best candidate.
Clean extracted content with prep_article().
Post-process URLs/links/classes.
Build final Article object.

If extraction fails, GrabFailed is returned.

Real-life note: default parse() can retry with relaxed heuristics if output is too short. This costs CPU, but success rate improves on weird pages.

Core extraction model: candidates, scoring, ancestors, links, winners

Inside each extraction attempt, the flow is simple and strict.

1) Candidate collection

The body is walked in document order.

Unlikely candidates can be removed
Empty structural nodes can be removed
Score-eligible tags are collected (section, h2-h6, p, td, pre)
div blocks are normalized into paragraph-like structure when needed

This normalization step is important because many pages use <div> for everything.

2) Base scoring

Very short text blocks are ignored (< 25 chars).

For valid text blocks, base score is built from:

constant base (2)
punctuation signal (comma-like count)
text length bonus (capped)

3) Ancestor propagation

That score is pushed up to ancestors up to depth 5:

parent gets full share
grandparent gets half
higher levels get smaller fraction

This is how wrapper containers win, not leaf text nodes.

4) Intrinsic score and class weighting

Ancestors also get intrinsic score by tag type:

div gets positive prior
pre, td, blockquote get smaller positive prior
lists/forms/headings/tables can get penalties depending on type

If class/id weighting is enabled, positive patterns add points and negative patterns subtract points.

5) Link density adjustment

For candidates above a minimum score threshold, score is adjusted:

adjusted = score * (1 - linkDensity)

This prevents nav-like blocks from winning when link text dominates.

6) Top candidates

Only top n_top_candidates are kept (default 5). The highest score starts as the top candidate, then selection mode logic can promote a better ancestor.

Candidate selection modes: Readability vs DomSmoothie

dom_smoothie supports two selection modes with slightly different behavior.

CandidateSelectMode::Readability

Mozilla-like behavior
Looks for common ancestor among high-scoring alternatives
Uses overlap and relative strength checks
Usually safer for classic article templates

CandidateSelectMode::DomSmoothie

Intersects ancestor sets of strong alternatives
Tries to choose the strongest meaningful common ancestor
Can be better on fragmented modern layouts with wrappers

In practice:

Start with Readability if you need conservative behavior
Switch to DomSmoothie if content is often split or wrapper-heavy

Sibling merge and why it matters

After top candidate is chosen, extraction is not finished.

The algorithm also checks siblings under the same parent and appends the ones that look article-like. Without this step, many articles lose intro or trailing paragraphs.

Sibling inclusion uses:

threshold based on top score (max(10, topScore * 0.2))
class-name bonus if sibling class matches top candidate class
paragraph heuristics for unscored siblings:
- enough length
- sentence-like text
- low link density

This single step fixes a lot of real pages that inject ads between paragraphs.

Cleanup pipeline (prep_article) and practical effect

prep_article() is a cleanup pass over extracted content. Order matters.

Main actions:

Remove tiny share/social blocks
Mark data tables vs layout tables
Repair lazy images (data-* to real src/srcset)
Conditionally clean forms/fieldsets
Remove junk nodes (footer, aside, noisy embeds, inputs)
Remove negative-weight headings
Conditionally clean table, ul, div
Rename h1 to h2
Strip presentational attributes/styles
Remove empty paragraphs and extra <br>
Flatten single-cell tables

Practical effect:

Output becomes much more stable for markdown conversion
Placeholder images and layout artifacts are reduced
Link and text quality gets better for LLM/RAG pipelines

Retry strategy and policies (Strict/Moderate/Clean/Raw)

dom_smoothie exposes fixed policies:

Strict = StripUnlikelys + WeightClasses + CleanConditionally
Moderate = WeightClasses + CleanConditionally
Clean = CleanConditionally
Raw = no heuristic flags

Default parse() does staged fallback if extracted text is below char_threshold:

run strict
disable StripUnlikelys
disable WeightClasses
disable CleanConditionally

If threshold is never reached, the longest attempt is returned.

Use parse_with_policy(policy) when you want one deterministic pass.

Preflight is_probably_readable() and when to use

is_probably_readable() is a cheap gate before full extraction.

It checks nodes like p, pre, article, and some div-related patterns, then:

skips hidden/unlikely/list-like paragraph nodes
requires minimum text per node (default 140 chars)
accumulates score using sqrt(textLen - minContentLen)
returns true when threshold is reached (default 20)

Use it when:

you crawl many non-article pages
you need to save CPU on extraction
you want fast early rejection for nav/search/list pages

Do not use it as a quality guarantee. It is a preflight only.

Tuning knobs and tradeoffs

Most useful options in practice:

char_threshold Higher value reduces false positives, but can drop short valid articles.
n_top_candidates More candidates helps hard pages, but increases processing cost.
min_score_to_adjust Changes when link-density penalty starts.
candidate_select_mode Readability is conservative, DomSmoothie can recover fragmented content better.
max_elements_to_parse Protects CPU/memory on huge DOMs, but can fail giant pages if too low.
disable_json_ld Faster and simpler metadata path, but you may lose high-quality structured metadata.
keep_classes and classes_to_preserve Better styling compatibility vs cleaner output.
text_mode (Raw, Formatted, Markdown) Choose based on downstream pipeline, not personal preference.

Failure cases and debugging checklist

No extractor works on all pages. These are common failure modes.

Content loaded by JS after initial HTML (empty shell problem)
List/search pages that look text-heavy but are not articles
Very short pages that cannot pass thresholds
Pages with extreme link-heavy templates
Broken HTML where wrappers are malformed
Aggressive cleanup removing valid blocks

My quick checklist:

Inspect fetched HTML first. Is real content present server-side?
Check is_probably_readable() result before parse.
Compare Strict vs Raw output lengths.
Try alternative candidate_select_mode.
Lower char_threshold for short-form sources.
Inspect sibling merge impact (lost intro/outro is a common signal).
Verify lazy image normalization if media looks empty.
Log top candidate score, link density, and final text length.

Practical WebCrawlerAPI usage (main_content_only)

If you do not want to run extraction infra yourself, use main_content_only in WebCrawlerAPI.

// Node 18+
const response = await fetch("https://api.webcrawlerapi.com/v1/scrape", {
  method: "POST",
  headers: {
    Authorization: "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com/article",
    main_content_only: true,
    scrape_type: "markdown",
  }),
});

const data = await response.json();
console.log(data);

For quick experiments, you can also use the free tool: HTML Main Content Readability.

If you want the Mozilla baseline first, read this guide: Mozilla Readability Algorithm (Readability.js) explanation. If you want the JS implementation tutorial, read: Extracting article or blogpost content with Mozilla Readability.

Hi, I'm Andrew. I work on scraping and crawling systems every day in WebCrawlerAPI.

In this post, I will explain how it works in real life, where it fails, and how to tune it without guessing.

The most important heuristics in plain English

These heuristics do most of the heavy lifting:

Unlikely blocks are removed early. Sidebars, banners, menus, modal UI, and hidden nodes are filtered out.
Class and id names are weighted. Positive names get score bonus. Negative names get penalty.
Text-like blocks are scored, not full pages. Paragraph-ish nodes are used as the signal source.
Scores are propagated to ancestors. The real article is usually a parent container, not a single <p>.
Link density is used as a penalty. Navigation-heavy blocks usually contain many links and little writing.
Siblings around the winner are merged. This recovers paragraphs split by ads, widgets, or template wrappers.
Conditional cleanup runs after extraction. Tables, forms, junk embeds, empty tags, and noisy blocks are removed.

That is the practical core. It is not magic. It is a layered heuristic pipeline.

High-level end-to-end flow (parse() stages)

At high level, Readability::parse() in dom_smoothie does this:

Check parse budget (max_elements_to_parse).
Parse metadata from JSON-LD (optional).
Parse metadata from <meta> and <title>, then merge.
Prepare DOM (remove scripts/styles/comments, normalize messy HTML).
Pre-filter obvious noise (hidden/dialog/byline/duplicate title nodes).
Run main content extraction and pick best candidate.
Clean extracted content with prep_article().
Post-process URLs/links/classes.
Build final Article object.

If extraction fails, GrabFailed is returned.

Real-life note: default parse() can retry with relaxed heuristics if output is too short. This costs CPU, but success rate improves on weird pages.

Core extraction model: candidates, scoring, ancestors, links, winners

Inside each extraction attempt, the flow is simple and strict.

1) Candidate collection

The body is walked in document order.

Unlikely candidates can be removed
Empty structural nodes can be removed
Score-eligible tags are collected (section, h2-h6, p, td, pre)
div blocks are normalized into paragraph-like structure when needed

This normalization step is important because many pages use <div> for everything.

2) Base scoring

Very short text blocks are ignored (< 25 chars).

For valid text blocks, base score is built from:

constant base (2)
punctuation signal (comma-like count)
text length bonus (capped)

3) Ancestor propagation

That score is pushed up to ancestors up to depth 5:

parent gets full share
grandparent gets half
higher levels get smaller fraction

This is how wrapper containers win, not leaf text nodes.

4) Intrinsic score and class weighting

Ancestors also get intrinsic score by tag type:

div gets positive prior
pre, td, blockquote get smaller positive prior
lists/forms/headings/tables can get penalties depending on type

If class/id weighting is enabled, positive patterns add points and negative patterns subtract points.

5) Link density adjustment

For candidates above a minimum score threshold, score is adjusted:

adjusted = score * (1 - linkDensity)

This prevents nav-like blocks from winning when link text dominates.

6) Top candidates

Only top n_top_candidates are kept (default 5). The highest score starts as the top candidate, then selection mode logic can promote a better ancestor.

Candidate selection modes: Readability vs DomSmoothie

dom_smoothie supports two selection modes with slightly different behavior.

CandidateSelectMode::Readability

Mozilla-like behavior
Looks for common ancestor among high-scoring alternatives
Uses overlap and relative strength checks
Usually safer for classic article templates

CandidateSelectMode::DomSmoothie

Intersects ancestor sets of strong alternatives
Tries to choose the strongest meaningful common ancestor
Can be better on fragmented modern layouts with wrappers

In practice:

Start with Readability if you need conservative behavior
Switch to DomSmoothie if content is often split or wrapper-heavy

Sibling merge and why it matters

After top candidate is chosen, extraction is not finished.

The algorithm also checks siblings under the same parent and appends the ones that look article-like. Without this step, many articles lose intro or trailing paragraphs.

Sibling inclusion uses:

threshold based on top score (max(10, topScore * 0.2))
class-name bonus if sibling class matches top candidate class
paragraph heuristics for unscored siblings:
- enough length
- sentence-like text
- low link density

This single step fixes a lot of real pages that inject ads between paragraphs.

Cleanup pipeline (prep_article) and practical effect

prep_article() is a cleanup pass over extracted content. Order matters.

Main actions:

Remove tiny share/social blocks
Mark data tables vs layout tables
Repair lazy images (data-* to real src/srcset)
Conditionally clean forms/fieldsets
Remove junk nodes (footer, aside, noisy embeds, inputs)
Remove negative-weight headings
Conditionally clean table, ul, div
Rename h1 to h2
Strip presentational attributes/styles
Remove empty paragraphs and extra <br>
Flatten single-cell tables

Practical effect:

Output becomes much more stable for markdown conversion
Placeholder images and layout artifacts are reduced
Link and text quality gets better for LLM/RAG pipelines

Retry strategy and policies (Strict/Moderate/Clean/Raw)

dom_smoothie exposes fixed policies:

Strict = StripUnlikelys + WeightClasses + CleanConditionally
Moderate = WeightClasses + CleanConditionally
Clean = CleanConditionally
Raw = no heuristic flags

Default parse() does staged fallback if extracted text is below char_threshold:

run strict
disable StripUnlikelys
disable WeightClasses
disable CleanConditionally

If threshold is never reached, the longest attempt is returned.

Use parse_with_policy(policy) when you want one deterministic pass.

Preflight is_probably_readable() and when to use

is_probably_readable() is a cheap gate before full extraction.

It checks nodes like p, pre, article, and some div-related patterns, then:

skips hidden/unlikely/list-like paragraph nodes
requires minimum text per node (default 140 chars)
accumulates score using sqrt(textLen - minContentLen)
returns true when threshold is reached (default 20)

Use it when:

you crawl many non-article pages
you need to save CPU on extraction
you want fast early rejection for nav/search/list pages

Do not use it as a quality guarantee. It is a preflight only.

Tuning knobs and tradeoffs

Most useful options in practice:

char_threshold Higher value reduces false positives, but can drop short valid articles.
n_top_candidates More candidates helps hard pages, but increases processing cost.
min_score_to_adjust Changes when link-density penalty starts.
candidate_select_mode Readability is conservative, DomSmoothie can recover fragmented content better.
max_elements_to_parse Protects CPU/memory on huge DOMs, but can fail giant pages if too low.
disable_json_ld Faster and simpler metadata path, but you may lose high-quality structured metadata.
keep_classes and classes_to_preserve Better styling compatibility vs cleaner output.
text_mode (Raw, Formatted, Markdown) Choose based on downstream pipeline, not personal preference.

Failure cases and debugging checklist

No extractor works on all pages. These are common failure modes.

Content loaded by JS after initial HTML (empty shell problem)
List/search pages that look text-heavy but are not articles
Very short pages that cannot pass thresholds
Pages with extreme link-heavy templates
Broken HTML where wrappers are malformed
Aggressive cleanup removing valid blocks

My quick checklist:

Inspect fetched HTML first. Is real content present server-side?
Check is_probably_readable() result before parse.
Compare Strict vs Raw output lengths.
Try alternative candidate_select_mode.
Lower char_threshold for short-form sources.
Inspect sibling merge impact (lost intro/outro is a common signal).
Verify lazy image normalization if media looks empty.
Log top candidate score, link density, and final text length.

Practical WebCrawlerAPI usage (main_content_only)

If you do not want to run extraction infra yourself, use main_content_only in WebCrawlerAPI.

// Node 18+
const response = await fetch("https://api.webcrawlerapi.com/v1/scrape", {
  method: "POST",
  headers: {
    Authorization: "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com/article",
    main_content_only: true,
    scrape_type: "markdown",
  }),
});

const data = await response.json();
console.log(data);

For quick experiments, you can also use the free tool: HTML Main Content Readability.

How dom_smoozie Rust Mozilla Readability alternative works

Table of Contents

Table of Contents

The most important heuristics in plain English

High-level end-to-end flow (parse() stages)

Core extraction model: candidates, scoring, ancestors, links, winners

1) Candidate collection

2) Base scoring

3) Ancestor propagation

4) Intrinsic score and class weighting

5) Link density adjustment

6) Top candidates

Candidate selection modes: Readability vs DomSmoothie

CandidateSelectMode::Readability

CandidateSelectMode::DomSmoothie

Sibling merge and why it matters

Cleanup pipeline (prep_article) and practical effect

Retry strategy and policies (Strict/Moderate/Clean/Raw)

Preflight is_probably_readable() and when to use

Tuning knobs and tradeoffs

Failure cases and debugging checklist

Practical WebCrawlerAPI usage (main_content_only)

How dom_smoozie Rust Mozilla Readability alternative works

Table of Contents

Table of Contents

The most important heuristics in plain English

High-level end-to-end flow (parse() stages)

Core extraction model: candidates, scoring, ancestors, links, winners

1) Candidate collection

2) Base scoring

3) Ancestor propagation

4) Intrinsic score and class weighting

5) Link density adjustment

6) Top candidates

Candidate selection modes: Readability vs DomSmoothie

CandidateSelectMode::Readability

CandidateSelectMode::DomSmoothie

Sibling merge and why it matters

Cleanup pipeline (prep_article) and practical effect

Retry strategy and policies (Strict/Moderate/Clean/Raw)

Preflight is_probably_readable() and when to use

Tuning knobs and tradeoffs

Failure cases and debugging checklist

Practical WebCrawlerAPI usage (main_content_only)