Webcrawler API LogoWebCrawler API
PricingDocsBlogSign inSign Up
Webcrawler API LogoWebCrawler API

Tools

  • Website to Markdown
  • llms.txt Generator
  • HTML to Readability

Resources

  • Blog
  • Docs
  • Glossary
  • Changelog

Follow us

  • Github
  • X (Twitter)
  • Postman
  • Swagger

Legal

  • Privacy Policy
  • Terms & Conditions
  • Refund Policy

Made in Netherlands 🇳🇱
2023-2026   ©103Labs
    JSTutorial

    Extracting article or blogpost content with Mozilla Readability

    Extract clean article content from any web page using Mozilla's Readability library—the same algorithm that powers Firefox Reader View. Complete JavaScript code examples with HTML cleaning and error handling.

    Written byAndrew
    Published onFeb 7, 2026
    Extracting article or blogpost content with Mozilla Readability

    Table of Contents

    • Complete article content extraction code
    • Removing unwanted HTML elements
    • Using the extraction function
    • What Readability extracts
    • Handling edge cases
    • WebCrawlerAPI's main_content_only parameter

    Table of Contents

    • Complete article content extraction code
    • Removing unwanted HTML elements
    • Using the extraction function
    • What Readability extracts
    • Handling edge cases
    • WebCrawlerAPI's main_content_only parameter

    Complete article content extraction code

    If you want to understand how it works inside (scoring, candidates, cleanup), read: Mozilla Readability Algorithm (Readability.js), Step by Step. If you want the Rust alternative with extra policy and candidate-selection controls, read: How dom_smoozie Rust Mozilla Readability alternative works.

    Here's a standalone JavaScript function that combines HTML cleaning with Mozilla's Readability parser:

    //npm install @mozilla/readability jsdom
    
    import { JSDOM } from "jsdom";
    import { Readability } from "@mozilla/readability";
    
    function extractArticleContent(url, html) {
      try {
        // Create a JSDOM document from the HTML
        const dom = new JSDOM(html, {
          url: url,
          contentType: "text/html",
        });
    
        const document = dom.window.document;
    
        // Optional: Clean unwanted elements first
        const unwantedElements = document.querySelectorAll(
          "script, style, noscript, iframe, footer, header, nav, .advertisement, .sidebar, .menu"
        );
        unwantedElements.forEach((element) => element.remove());
    
        // Use Readability to extract article content
        const reader = new Readability(document);
        const article = reader.parse();
    
        if (!article) {
          return null;
        }
    
        return {
          title: article.title || "",
          content: article.content || "",
          textContent: article.textContent || "",
          length: article.length || 0,
          excerpt: article.excerpt || "",
          byline: article.byline || "",
          dir: article.dir || "",
          siteName: article.siteName || "",
          lang: article.lang || "",
        };
      } catch (error) {
        console.error("Error extracting article content:", error.message);
        return null;
      }
    }
    

    Want to try it without coding? Use the Readability tool to extract main content from any HTML. If you're extracting content from third-party sites, read: Web Scraping Ethics: A Complete Guide to Responsible Data Collection.

    Removing unwanted HTML elements

    Before using Readability, it's often helpful to clean up the HTML by removing elements that are definitely not article content. This improves extraction accuracy and reduces false positives.

    Here's how to remove common unwanted elements using a simple cleaning function:

    import { JSDOM } from "jsdom";
    
    function cleanHtml(
      html,
      unwantedTags = "script, style, noscript, iframe, img, footer, header, nav, head"
    ) {
      const dom = new JSDOM(html);
      const document = dom.window.document;
    
      // Remove unwanted elements
      const elementsToRemove = document.querySelectorAll(unwantedTags);
      elementsToRemove.forEach((element) => element.remove());
    
      return dom.serialize();
    }
    

    This removes:

    • Scripts and styles: JavaScript code and CSS that aren't content
    • Navigation elements: Headers, footers, and navigation menus
    • Media: Images and iframes that might interfere with text extraction
    • Metadata: Head elements and other non-visible content

    Using the extraction function

    Here's how to use the function with a complete example:

    import fetch from "node-fetch";
    
    async function scrapeArticleContent(url) {
      try {
        // Fetch the webpage
        const response = await fetch(url);
        const html = await response.text();
    
        // Extract article content
        const articleContent = extractArticleContent(url, html);
    
        if (articleContent) {
          console.log("Title:", articleContent.title);
          console.log("Author:", articleContent.byline);
          console.log("Content length:", articleContent.length);
          console.log("Excerpt:", articleContent.excerpt);
          console.log(
            "\nArticle content:\n",
            articleContent.textContent.substring(0, 500) + "..."
          );
        } else {
          console.log("Could not extract article content from this page");
        }
      } catch (error) {
        console.error("Error scraping content:", error.message);
      }
    }
    
    // Example usage
    scrapeArticleContent("https://example-blog.com/article");
    

    What Readability extracts

    The Readability parser returns several useful properties:

    • title: The article's main title
    • content: Clean HTML content without navigation and ads
    • textContent: Plain text version of the article content
    • length: Character count of the article content
    • excerpt: Short summary or first few sentences
    • byline: Author information if found
    • dir: Text direction (ltr/rtl)
    • siteName: Name of the website
    • lang: Language of the content

    Handling edge cases

    Not all pages will work perfectly with Readability. Here are some tips for better results:

    function extractArticleContentRobust(url, html) {
      const result = extractArticleContent(url, html);
    
      // Fallback if Readability fails
      if (!result || result.length < 100) {
        console.log("Readability extraction failed, trying fallback...");
    
        // Simple fallback: extract text from article content areas
        const dom = new JSDOM(html);
        const document = dom.window.document;
    
        const contentSelectors = [
          "article",
          "main",
          '[role="main"]',
          ".post",
          ".article-body",
          ".content",
        ];
    
        for (const selector of contentSelectors) {
          const element = document.querySelector(selector);
          if (element && element.textContent.length > 100) {
            return {
              title: document.title || "",
              textContent: element.textContent.trim(),
              content: element.innerHTML,
              length: element.textContent.length,
            };
          }
        }
      }
    
      return result;
    }
    

    WebCrawlerAPI's main_content_only parameter

    If you're looking for a ready-made solution that handles all the complexity of article content extraction, WebCrawlerAPI provides a simple parameter that does this automatically.

    Just add main_content_only=true to your API request:

    const response = await fetch("https://api.webcrawlerapi.com/v1/scrape", {
      method: "POST",
      headers: {
        Authorization: "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        url: "https://example.com/article",
        main_content_only: true,
        scrape_type: "markdown",
      }),
    });
    

    This automatically extracts only the main article content using advanced algorithms, saving you from having to implement and maintain the extraction logic yourself.

    Learn more about the main_content_only parameter in the WebCrawlerAPI documentation.