How to extract article or blogpost content in JS using Readability.js

Why extracting article content matters

When crawling or scraping articles or blog posts, you often get a lot more than you need. Web pages are packed with navigation menus, sidebars, footers, advertisements, and other elements that aren't part of the main article content. This extra content can create problems:

Noise in your data: Irrelevant content dilutes the actual information you're trying to extract
False positives: Search algorithms might match on navigation text instead of actual content
Larger data size: Storing unnecessary content wastes space and processing power
Poor AI training: If you're feeding content to language models, unwanted elements can confuse the training process

The solution is to extract only the main article content - the core blog post text and primary information without the surrounding page elements.

Installing readability.js

The easiest way to extract article and blog post content in JavaScript is using Mozilla's Readability library. This is the same algorithm that powers Firefox's Reader Mode.

Install it via npm:

npm install @mozilla/readability

You'll also need JSDOM to work with HTML in a Node.js environment:

npm install jsdom

Removing unwanted HTML elements

Before using Readability, it's often helpful to clean up the HTML by removing elements that are definitely not article content. This improves extraction accuracy and reduces false positives.

Here's how to remove common unwanted elements using a simple cleaning function:

import { JSDOM } from 'jsdom';

function cleanHtml(html, unwantedTags = 'script, style, noscript, iframe, img, footer, header, nav, head') {
    const dom = new JSDOM(html);
    const document = dom.window.document;
    
    // Remove unwanted elements
    const elementsToRemove = document.querySelectorAll(unwantedTags);
    elementsToRemove.forEach(element => element.remove());
    
    return dom.serialize();
}

This removes:

Scripts and styles: JavaScript code and CSS that aren't content
Navigation elements: Headers, footers, and navigation menus
Media: Images and iframes that might interfere with text extraction
Metadata: Head elements and other non-visible content

Complete article content extraction code

Here's a standalone JavaScript function that combines HTML cleaning with Mozilla's Readability parser:

import { JSDOM } from 'jsdom';
import { Readability } from '@mozilla/readability';

function extractArticleContent(url, html) {
    try {
        // Create a JSDOM document from the HTML
        const dom = new JSDOM(html, {
            url: url,
            contentType: "text/html"
        });
        
        const document = dom.window.document;
        
        // Optional: Clean unwanted elements first
        const unwantedElements = document.querySelectorAll(
            'script, style, noscript, iframe, footer, header, nav, .advertisement, .sidebar, .menu'
        );
        unwantedElements.forEach(element => element.remove());
        
        // Use Readability to extract article content
        const reader = new Readability(document);
        const article = reader.parse();
        
        if (!article) {
            return null;
        }
        
        return {
            title: article.title || "",
            content: article.content || "",
            textContent: article.textContent || "",
            length: article.length || 0,
            excerpt: article.excerpt || "",
            byline: article.byline || "",
            dir: article.dir || "",
            siteName: article.siteName || "",
            lang: article.lang || ""
        };
        
    } catch (error) {
        console.error('Error extracting article content:', error.message);
        return null;
    }
}

Using the extraction function

Here's how to use the function with a complete example:

import fetch from 'node-fetch';

async function scrapeArticleContent(url) {
    try {
        // Fetch the webpage
        const response = await fetch(url);
        const html = await response.text();
        
        // Extract article content
        const articleContent = extractArticleContent(url, html);
        
        if (articleContent) {
            console.log('Title:', articleContent.title);
            console.log('Author:', articleContent.byline);
            console.log('Content length:', articleContent.length);
            console.log('Excerpt:', articleContent.excerpt);
            console.log('\nArticle content:\n', articleContent.textContent.substring(0, 500) + '...');
        } else {
            console.log('Could not extract article content from this page');
        }
        
    } catch (error) {
        console.error('Error scraping content:', error.message);
    }
}

// Example usage
scrapeArticleContent('https://example-blog.com/article');

What Readability extracts

The Readability parser returns several useful properties:

title: The article's main title
content: Clean HTML content without navigation and ads
textContent: Plain text version of the article content
length: Character count of the article content
excerpt: Short summary or first few sentences
byline: Author information if found
dir: Text direction (ltr/rtl)
siteName: Name of the website
lang: Language of the content

Handling edge cases

Not all pages will work perfectly with Readability. Here are some tips for better results:

function extractArticleContentRobust(url, html) {
    const result = extractArticleContent(url, html);
    
    // Fallback if Readability fails
    if (!result || result.length < 100) {
        console.log('Readability extraction failed, trying fallback...');
        
        // Simple fallback: extract text from article content areas
        const dom = new JSDOM(html);
        const document = dom.window.document;
        
        const contentSelectors = [
            'article', 'main', '[role="main"]', 
            '.post', '.article-body', '.content'
        ];
        
        for (const selector of contentSelectors) {
            const element = document.querySelector(selector);
            if (element && element.textContent.length > 100) {
                return {
                    title: document.title || "",
                    textContent: element.textContent.trim(),
                    content: element.innerHTML,
                    length: element.textContent.length
                };
            }
        }
    }
    
    return result;
}

WebCrawlerAPI's main_content_only parameter

If you're looking for a ready-made solution that handles all the complexity of article content extraction, WebCrawlerAPI provides a simple parameter that does this automatically.

Just add main_content_only=true to your API request:

const response = await fetch('https://api.webcrawlerapi.com/v1/scrape', {
    method: 'POST',
    headers: {
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        url: 'https://example.com/article',
        main_content_only: true,
        scrape_type: 'markdown'
    })
});

This automatically extracts only the main article content using advanced algorithms, saving you from having to implement and maintain the extraction logic yourself.

Learn more about the main_content_only parameter in the WebCrawlerAPI documentation.