Complete article content extraction code
If you want to understand how it works inside (scoring, candidates, cleanup), read: Mozilla Readability Algorithm (Readability.js), Step by Step. If you want the Rust alternative with extra policy and candidate-selection controls, read: How dom_smoozie Rust Mozilla Readability alternative works.
Here's a standalone JavaScript function that combines HTML cleaning with Mozilla's Readability parser:
//npm install @mozilla/readability jsdom
import { JSDOM } from "jsdom";
import { Readability } from "@mozilla/readability";
function extractArticleContent(url, html) {
try {
// Create a JSDOM document from the HTML
const dom = new JSDOM(html, {
url: url,
contentType: "text/html",
});
const document = dom.window.document;
// Optional: Clean unwanted elements first
const unwantedElements = document.querySelectorAll(
"script, style, noscript, iframe, footer, header, nav, .advertisement, .sidebar, .menu"
);
unwantedElements.forEach((element) => element.remove());
// Use Readability to extract article content
const reader = new Readability(document);
const article = reader.parse();
if (!article) {
return null;
}
return {
title: article.title || "",
content: article.content || "",
textContent: article.textContent || "",
length: article.length || 0,
excerpt: article.excerpt || "",
byline: article.byline || "",
dir: article.dir || "",
siteName: article.siteName || "",
lang: article.lang || "",
};
} catch (error) {
console.error("Error extracting article content:", error.message);
return null;
}
}
Want to try it without coding? Use the Readability tool to extract main content from any HTML. If you're extracting content from third-party sites, read: Web Scraping Ethics: A Complete Guide to Responsible Data Collection.
Removing unwanted HTML elements
Before using Readability, it's often helpful to clean up the HTML by removing elements that are definitely not article content. This improves extraction accuracy and reduces false positives.
Here's how to remove common unwanted elements using a simple cleaning function:
import { JSDOM } from "jsdom";
function cleanHtml(
html,
unwantedTags = "script, style, noscript, iframe, img, footer, header, nav, head"
) {
const dom = new JSDOM(html);
const document = dom.window.document;
// Remove unwanted elements
const elementsToRemove = document.querySelectorAll(unwantedTags);
elementsToRemove.forEach((element) => element.remove());
return dom.serialize();
}
This removes:
- Scripts and styles: JavaScript code and CSS that aren't content
- Navigation elements: Headers, footers, and navigation menus
- Media: Images and iframes that might interfere with text extraction
- Metadata: Head elements and other non-visible content
Using the extraction function
Here's how to use the function with a complete example:
import fetch from "node-fetch";
async function scrapeArticleContent(url) {
try {
// Fetch the webpage
const response = await fetch(url);
const html = await response.text();
// Extract article content
const articleContent = extractArticleContent(url, html);
if (articleContent) {
console.log("Title:", articleContent.title);
console.log("Author:", articleContent.byline);
console.log("Content length:", articleContent.length);
console.log("Excerpt:", articleContent.excerpt);
console.log(
"\nArticle content:\n",
articleContent.textContent.substring(0, 500) + "..."
);
} else {
console.log("Could not extract article content from this page");
}
} catch (error) {
console.error("Error scraping content:", error.message);
}
}
// Example usage
scrapeArticleContent("https://example-blog.com/article");
What Readability extracts
The Readability parser returns several useful properties:
- title: The article's main title
- content: Clean HTML content without navigation and ads
- textContent: Plain text version of the article content
- length: Character count of the article content
- excerpt: Short summary or first few sentences
- byline: Author information if found
- dir: Text direction (ltr/rtl)
- siteName: Name of the website
- lang: Language of the content
Handling edge cases
Not all pages will work perfectly with Readability. Here are some tips for better results:
function extractArticleContentRobust(url, html) {
const result = extractArticleContent(url, html);
// Fallback if Readability fails
if (!result || result.length < 100) {
console.log("Readability extraction failed, trying fallback...");
// Simple fallback: extract text from article content areas
const dom = new JSDOM(html);
const document = dom.window.document;
const contentSelectors = [
"article",
"main",
'[role="main"]',
".post",
".article-body",
".content",
];
for (const selector of contentSelectors) {
const element = document.querySelector(selector);
if (element && element.textContent.length > 100) {
return {
title: document.title || "",
textContent: element.textContent.trim(),
content: element.innerHTML,
length: element.textContent.length,
};
}
}
}
return result;
}
WebCrawlerAPI's main_content_only parameter
If you're looking for a ready-made solution that handles all the complexity of article content extraction, WebCrawlerAPI provides a simple parameter that does this automatically.
Just add main_content_only=true to your API request:
const response = await fetch("https://api.webcrawlerapi.com/v1/scrape", {
method: "POST",
headers: {
Authorization: "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
url: "https://example.com/article",
main_content_only: true,
scrape_type: "markdown",
}),
});
This automatically extracts only the main article content using advanced algorithms, saving you from having to implement and maintain the extraction logic yourself.
Learn more about the main_content_only parameter in the WebCrawlerAPI documentation.
