Extracting article or blogpost content with Mozilla Readability

Extract clean article content from any web page using Mozilla's Readability library—the same algorithm that powers Firefox Reader View. Complete JavaScript code examples with HTML cleaning and error handling.

Written byAndrii
Published on
Extracting article or blogpost content with Mozilla Readability

Complete article content extraction code

If you want to understand how it works inside (scoring, candidates, cleanup), read: Mozilla Readability Algorithm (Readability.js), Step by Step. If you want the Rust alternative with extra policy and candidate-selection controls, read: How dom_smoozie Rust Mozilla Readability alternative works.

Here's a standalone JavaScript function that combines HTML cleaning with Mozilla's Readability parser:

//npm install @mozilla/readability jsdom

import { JSDOM } from "jsdom";
import { Readability } from "@mozilla/readability";

function extractArticleContent(url, html) {
  try {
    // Create a JSDOM document from the HTML
    const dom = new JSDOM(html, {
      url: url,
      contentType: "text/html",
    });

    const document = dom.window.document;

    // Optional: Clean unwanted elements first
    const unwantedElements = document.querySelectorAll(
      "script, style, noscript, iframe, footer, header, nav, .advertisement, .sidebar, .menu"
    );
    unwantedElements.forEach((element) => element.remove());

    // Use Readability to extract article content
    const reader = new Readability(document);
    const article = reader.parse();

    if (!article) {
      return null;
    }

    return {
      title: article.title || "",
      content: article.content || "",
      textContent: article.textContent || "",
      length: article.length || 0,
      excerpt: article.excerpt || "",
      byline: article.byline || "",
      dir: article.dir || "",
      siteName: article.siteName || "",
      lang: article.lang || "",
    };
  } catch (error) {
    console.error("Error extracting article content:", error.message);
    return null;
  }
}

Want to try it without coding? Use the Readability tool to extract main content from any HTML. If you're extracting content from third-party sites, read: Web Scraping Ethics: A Complete Guide to Responsible Data Collection.

Removing unwanted HTML elements

Before using Readability, it's often helpful to clean up the HTML by removing elements that are definitely not article content. This improves extraction accuracy and reduces false positives.

Here's how to remove common unwanted elements using a simple cleaning function:

import { JSDOM } from "jsdom";

function cleanHtml(
  html,
  unwantedTags = "script, style, noscript, iframe, img, footer, header, nav, head"
) {
  const dom = new JSDOM(html);
  const document = dom.window.document;

  // Remove unwanted elements
  const elementsToRemove = document.querySelectorAll(unwantedTags);
  elementsToRemove.forEach((element) => element.remove());

  return dom.serialize();
}

This removes:

  • Scripts and styles: JavaScript code and CSS that aren't content
  • Navigation elements: Headers, footers, and navigation menus
  • Media: Images and iframes that might interfere with text extraction
  • Metadata: Head elements and other non-visible content

Using the extraction function

Here's how to use the function with a complete example:

import fetch from "node-fetch";

async function scrapeArticleContent(url) {
  try {
    // Fetch the webpage
    const response = await fetch(url);
    const html = await response.text();

    // Extract article content
    const articleContent = extractArticleContent(url, html);

    if (articleContent) {
      console.log("Title:", articleContent.title);
      console.log("Author:", articleContent.byline);
      console.log("Content length:", articleContent.length);
      console.log("Excerpt:", articleContent.excerpt);
      console.log(
        "\nArticle content:\n",
        articleContent.textContent.substring(0, 500) + "..."
      );
    } else {
      console.log("Could not extract article content from this page");
    }
  } catch (error) {
    console.error("Error scraping content:", error.message);
  }
}

// Example usage
scrapeArticleContent("https://example-blog.com/article");

What Readability extracts

The Readability parser returns several useful properties:

  • title: The article's main title
  • content: Clean HTML content without navigation and ads
  • textContent: Plain text version of the article content
  • length: Character count of the article content
  • excerpt: Short summary or first few sentences
  • byline: Author information if found
  • dir: Text direction (ltr/rtl)
  • siteName: Name of the website
  • lang: Language of the content

Handling edge cases

Not all pages will work perfectly with Readability. Here are some tips for better results:

function extractArticleContentRobust(url, html) {
  const result = extractArticleContent(url, html);

  // Fallback if Readability fails
  if (!result || result.length < 100) {
    console.log("Readability extraction failed, trying fallback...");

    // Simple fallback: extract text from article content areas
    const dom = new JSDOM(html);
    const document = dom.window.document;

    const contentSelectors = [
      "article",
      "main",
      '[role="main"]',
      ".post",
      ".article-body",
      ".content",
    ];

    for (const selector of contentSelectors) {
      const element = document.querySelector(selector);
      if (element && element.textContent.length > 100) {
        return {
          title: document.title || "",
          textContent: element.textContent.trim(),
          content: element.innerHTML,
          length: element.textContent.length,
        };
      }
    }
  }

  return result;
}

WebCrawlerAPI's main_content_only parameter

If you're looking for a ready-made solution that handles all the complexity of article content extraction, WebCrawlerAPI provides a simple parameter that does this automatically.

Just add main_content_only=true to your API request:

const response = await fetch("https://api.webcrawlerapi.com/v1/scrape", {
  method: "POST",
  headers: {
    Authorization: "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com/article",
    main_content_only: true,
    scrape_type: "markdown",
  }),
});

This automatically extracts only the main article content using advanced algorithms, saving you from having to implement and maintain the extraction logic yourself.

Learn more about the main_content_only parameter in the WebCrawlerAPI documentation.


About the Author

Andrii Mazurian
Andrew Mazurian@andriixzvf

Founder, WebCrawlerAPI · 🇳🇱 Netherlands

Engineer with 15 years of experience in APIs, big data, and infrastructure. Founded WebCrawlerAPI in 2024 with a single goal: to build the best data API, and have been shipping it every day since.