Table of Contents
- The most important heuristics in plain English
- High-level end-to-end flow (`parse()` stages)
- Core extraction model: candidates, scoring, ancestors, links, winners
- 1) Candidate collection
- 2) Base scoring
- 3) Ancestor propagation
- 4) Intrinsic score and class weighting
- 5) Link density adjustment
- 6) Top candidates
- Candidate selection modes: Readability vs DomSmoothie
- `CandidateSelectMode::Readability`
- `CandidateSelectMode::DomSmoothie`
- Sibling merge and why it matters
- Cleanup pipeline (`prep_article`) and practical effect
- Retry strategy and policies (`Strict/Moderate/Clean/Raw`)
- Preflight `is_probably_readable()` and when to use
- Tuning knobs and tradeoffs
- Failure cases and debugging checklist
- Practical WebCrawlerAPI usage (`main_content_only`)
Hi, I'm Andrew. I work on scraping and crawling systems every day in WebCrawlerAPI.
If you already used Mozilla Readability, dom_smoothie will feel familiar. The same core idea is used: score the DOM, find the main container, clean it, and return article content. But extra controls are added in dom_smoothie for retries, candidate selection, and output shaping.
If you want the Mozilla baseline explanation first, read: Mozilla Readability Algorithm (Readability.js) explanation. If you want a code-first JavaScript integration guide, read: Extracting article or blogpost content with Mozilla Readability.
In this post, I will explain how it works in real life, where it fails, and how to tune it without guessing.
The most important heuristics in plain English
These heuristics do most of the heavy lifting:
- Unlikely blocks are removed early. Sidebars, banners, menus, modal UI, and hidden nodes are filtered out.
- Class and id names are weighted. Positive names get score bonus. Negative names get penalty.
- Text-like blocks are scored, not full pages. Paragraph-ish nodes are used as the signal source.
- Scores are propagated to ancestors. The real article is usually a parent container, not a single <p>.
- Link density is used as a penalty. Navigation-heavy blocks usually contain many links and little writing.
- Siblings around the winner are merged. This recovers paragraphs split by ads, widgets, or template wrappers.
- Conditional cleanup runs after extraction. Tables, forms, junk embeds, empty tags, and noisy blocks are removed.
That is the practical core. It is not magic. It is a layered heuristic pipeline.
High-level end-to-end flow (parse() stages)
At high level, Readability::parse() in dom_smoothie does this:
- Check parse budget (max_elements_to_parse).
- Parse metadata from JSON-LD (optional).
- Parse metadata from <meta> and <title>, then merge.
- Prepare DOM (remove scripts/styles/comments, normalize messy HTML).
- Pre-filter obvious noise (hidden/dialog/byline/duplicate title nodes).
- Run main content extraction and pick best candidate.
- Clean extracted content with prep_article().
- Post-process URLs/links/classes.
- Build final Article object.
If extraction fails, GrabFailed is returned.
Real-life note: default parse() can retry with relaxed heuristics if output is too short. This costs CPU, but success rate improves on weird pages.
Core extraction model: candidates, scoring, ancestors, links, winners
Inside each extraction attempt, the flow is simple and strict.
1) Candidate collection
The body is walked in document order.
- Unlikely candidates can be removed
- Empty structural nodes can be removed
- Score-eligible tags are collected (section, h2-h6, p, td, pre)
- div blocks are normalized into paragraph-like structure when needed
This normalization step is important because many pages use <div> for everything.
2) Base scoring
Very short text blocks are ignored (< 25 chars).
For valid text blocks, base score is built from:
- constant base (2)
- punctuation signal (comma-like count)
- text length bonus (capped)
3) Ancestor propagation
That score is pushed up to ancestors up to depth 5:
- parent gets full share
- grandparent gets half
- higher levels get smaller fraction
This is how wrapper containers win, not leaf text nodes.
4) Intrinsic score and class weighting
Ancestors also get intrinsic score by tag type:
- div gets positive prior
- pre, td, blockquote get smaller positive prior
- lists/forms/headings/tables can get penalties depending on type
If class/id weighting is enabled, positive patterns add points and negative patterns subtract points.
5) Link density adjustment
For candidates above a minimum score threshold, score is adjusted:
adjusted = score * (1 - linkDensity)
This prevents nav-like blocks from winning when link text dominates.
6) Top candidates
Only top n_top_candidates are kept (default 5). The highest score starts as the top candidate, then selection mode logic can promote a better ancestor.
Candidate selection modes: Readability vs DomSmoothie
dom_smoothie supports two selection modes with slightly different behavior.
CandidateSelectMode::Readability
- Mozilla-like behavior
- Looks for common ancestor among high-scoring alternatives
- Uses overlap and relative strength checks
- Usually safer for classic article templates
CandidateSelectMode::DomSmoothie
- Intersects ancestor sets of strong alternatives
- Tries to choose the strongest meaningful common ancestor
- Can be better on fragmented modern layouts with wrappers
In practice:
- Start with Readability if you need conservative behavior
- Switch to DomSmoothie if content is often split or wrapper-heavy
Sibling merge and why it matters
After top candidate is chosen, extraction is not finished.
The algorithm also checks siblings under the same parent and appends the ones that look article-like. Without this step, many articles lose intro or trailing paragraphs.
Sibling inclusion uses:
- threshold based on top score (max(10, topScore * 0.2))
- class-name bonus if sibling class matches top candidate class
- paragraph heuristics for unscored siblings:
- enough length
- sentence-like text
- low link density
This single step fixes a lot of real pages that inject ads between paragraphs.
Cleanup pipeline (prep_article) and practical effect
prep_article() is a cleanup pass over extracted content. Order matters.
Main actions:
- Remove tiny share/social blocks
- Mark data tables vs layout tables
- Repair lazy images (data-* to real src/srcset)
- Conditionally clean forms/fieldsets
- Remove junk nodes (footer, aside, noisy embeds, inputs)
- Remove negative-weight headings
- Conditionally clean table, ul, div
- Rename h1 to h2
- Strip presentational attributes/styles
- Remove empty paragraphs and extra <br>
- Flatten single-cell tables
Practical effect:
- Output becomes much more stable for markdown conversion
- Placeholder images and layout artifacts are reduced
- Link and text quality gets better for LLM/RAG pipelines
Retry strategy and policies (Strict/Moderate/Clean/Raw)
dom_smoothie exposes fixed policies:
- Strict = StripUnlikelys + WeightClasses + CleanConditionally
- Moderate = WeightClasses + CleanConditionally
- Clean = CleanConditionally
- Raw = no heuristic flags
Default parse() does staged fallback if extracted text is below char_threshold:
- run strict
- disable StripUnlikelys
- disable WeightClasses
- disable CleanConditionally
If threshold is never reached, the longest attempt is returned.
Use parse_with_policy(policy) when you want one deterministic pass.
Preflight is_probably_readable() and when to use
is_probably_readable() is a cheap gate before full extraction.
It checks nodes like p, pre, article, and some div-related patterns, then:
- skips hidden/unlikely/list-like paragraph nodes
- requires minimum text per node (default 140 chars)
- accumulates score using sqrt(textLen - minContentLen)
- returns true when threshold is reached (default 20)
Use it when:
- you crawl many non-article pages
- you need to save CPU on extraction
- you want fast early rejection for nav/search/list pages
Do not use it as a quality guarantee. It is a preflight only.
Tuning knobs and tradeoffs
Most useful options in practice:
-
char_threshold Higher value reduces false positives, but can drop short valid articles.
-
n_top_candidates More candidates helps hard pages, but increases processing cost.
-
min_score_to_adjust Changes when link-density penalty starts.
-
candidate_select_mode Readability is conservative, DomSmoothie can recover fragmented content better.
-
max_elements_to_parse Protects CPU/memory on huge DOMs, but can fail giant pages if too low.
-
disable_json_ld Faster and simpler metadata path, but you may lose high-quality structured metadata.
-
keep_classes and classes_to_preserve Better styling compatibility vs cleaner output.
-
text_mode (Raw, Formatted, Markdown) Choose based on downstream pipeline, not personal preference.
Failure cases and debugging checklist
No extractor works on all pages. These are common failure modes.
- Content loaded by JS after initial HTML (empty shell problem)
- List/search pages that look text-heavy but are not articles
- Very short pages that cannot pass thresholds
- Pages with extreme link-heavy templates
- Broken HTML where wrappers are malformed
- Aggressive cleanup removing valid blocks
My quick checklist:
- Inspect fetched HTML first. Is real content present server-side?
- Check is_probably_readable() result before parse.
- Compare Strict vs Raw output lengths.
- Try alternative candidate_select_mode.
- Lower char_threshold for short-form sources.
- Inspect sibling merge impact (lost intro/outro is a common signal).
- Verify lazy image normalization if media looks empty.
- Log top candidate score, link density, and final text length.
Practical WebCrawlerAPI usage (main_content_only)
If you do not want to run extraction infra yourself, use main_content_only in WebCrawlerAPI.
// Node 18+
const response = await fetch("https://api.webcrawlerapi.com/v1/scrape", {
method: "POST",
headers: {
Authorization: "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
url: "https://example.com/article",
main_content_only: true,
scrape_type: "markdown",
}),
});
const data = await response.json();
console.log(data);
For quick experiments, you can also use the free tool: HTML Main Content Readability.
If you want the Mozilla baseline first, read this guide: Mozilla Readability Algorithm (Readability.js) explanation. If you want the JS implementation tutorial, read: Extracting article or blogpost content with Mozilla Readability.