Table of Contents
- How to Convert HTML to Clean Markdown in JavaScript
- Why You Need HTML to Markdown Conversion
- The Naive Approach and Why It Fails
- The Unified Pipeline Solution
- Install the Dependencies
- The Full Solution
- Cleaning Up Unwanted Tags with Cheerio
- Why rehype-minify-whitespace Is the Critical Step
- Handling Edge Cases
- Why This Matters for Real-World Scraping
- Use WebCrawlerAPI If You Don't Want to Maintain This
- Summary
How to Convert HTML to Clean Markdown in JavaScript
I'm Andrew. I build WebCrawlerAPI, and one thing I deal with constantly is HTML to Markdown conversion. Not simple clean HTML. Real-world HTML: pretty-printed, full of whitespace, scripts, navbars, cookie banners, and broken link text.
If you're building an LLM pipeline, a RAG system, or any kind of web scraper, at some point you need to turn HTML into Markdown. Markdown compresses well, it strips most of the noise, and LLMs handle it better than raw HTML.
The problem is that most converters give you garbage output when the HTML isn't perfectly formatted. This post shows you why that happens and how to fix it properly.
Why You Need HTML to Markdown Conversion
When you feed raw HTML into an LLM, you're wasting tokens. You're sending <div class="wrapper">, <span aria-hidden="true">, nav menus, footer links, cookie notices - all of it. The model has to work through all that noise to find the actual content.
Markdown strips that down to what matters: headings, paragraphs, lists, links, code blocks. It's readable, compact, and easy for models to process.
This matters for:
- RAG pipelines - when you're embedding web content into a vector store
- LLM context - when you need to pass page content to a model without blowing your token budget
- Web scrapers - when you need structured content, not HTML soup
- Documentation tools - when you're ingesting third-party docs
The Naive Approach and Why It Fails
The most popular library for this is turndown. It's simple, it works for clean HTML. You install it and do something like this:
import TurndownService from 'turndown';
const turndown = new TurndownService();
const markdown = turndown.convert(html);
For simple HTML it's fine. But real-world HTML is never simple.
Here's a common pattern you'll see in pretty-printed HTML:
<a href="https://example.com">
OpenClaw
</a>
Looks harmless. But run it through turndown and you get:
[ OpenClaw ](https://example.com)
That's a broken link. The newlines inside the <a> tag become part of the link text. Any Markdown renderer will either break on it or produce something ugly. And this isn't an edge case - pretty-printed HTML is everywhere. Most CMS systems, most documentation generators, most web frameworks produce exactly this kind of output.
The same problem happens with inline elements and paragraph text. Whitespace-heavy HTML creates messy Markdown with extra blank lines, broken formatting, and weird indentation.
You can try to fix it with regex after the fact. I've been there. It's a bad idea. You end up playing whack-a-mole with edge cases forever.
The right fix is to clean the whitespace at the HTML level, before conversion.
The Unified Pipeline Solution
The unified ecosystem is a JavaScript toolkit for processing content. It works as a pipeline: parse input, transform it through plugins, stringify output. For HTML to Markdown, the pipeline looks like this:
- rehype-parse - parse HTML into an AST (Abstract Syntax Tree)
- rehype-minify-whitespace - collapse whitespace in text nodes
- rehype-remark - convert the HTML AST to a Markdown AST
- remark-stringify - serialize the Markdown AST to a string
The key plugin here is rehype-minify-whitespace. It runs before the conversion and collapses all the extra whitespace inside text nodes. So \n OpenClaw\n becomes OpenClaw. The link text is clean before it ever gets converted to Markdown.
Install the Dependencies
npm install unified rehype-parse rehype-remark remark-stringify rehype-minify-whitespace cheerio
Or with pnpm:
pnpm add unified rehype-parse rehype-remark remark-stringify rehype-minify-whitespace cheerio
These are all ESM packages. Your package.json needs "type": "module" or you use .mjs extension.
The Full Solution
import { unified } from 'unified';
import rehypeParse from 'rehype-parse';
import rehypeRemark from 'rehype-remark';
import remarkStringify from 'remark-stringify';
import rehypeMinifyWhitespace from 'rehype-minify-whitespace';
import * as cheerio from 'cheerio';
async function htmlToMarkdown(html, cleanTags = 'script, style, noscript, iframe, img, footer, header, nav, head') {
// Step 1: remove tags we don't want in the output
const $ = cheerio.load(html);
$(cleanTags).remove();
const cleanedHtml = $('body').html();
if (!cleanedHtml) return '';
// Step 2: run through the unified pipeline
const result = await unified()
.use(rehypeParse)
.use(rehypeMinifyWhitespace) // <-- this is the critical step
.use(rehypeRemark)
.use(remarkStringify)
.process(cleanedHtml);
return String(result).trim();
}
// Example usage
const html = `
<html>
<head><title>My Page</title></head>
<body>
<nav><a href="/">Home</a></nav>
<article>
<h1>Hello World</h1>
<p>This is a <a href="https://example.com">
link with messy whitespace
</a> inside a paragraph.</p>
<ul>
<li>Item one</li>
<li>Item two</li>
</ul>
</article>
<footer>Copyright 2026</footer>
</body>
</html>
`;
const markdown = await htmlToMarkdown(html);
console.log(markdown);
Output:
# Hello World
This is a [link with messy whitespace](https://example.com) inside a paragraph.
- Item one
- Item two
Clean link text, clean structure, no nav, no footer. That's exactly what you want for LLM input.
Cleaning Up Unwanted Tags with Cheerio
Before conversion, you need to strip the parts of the page you don't want. That's where cheerio comes in. It's a server-side jQuery-like library for manipulating HTML.
The default tag list I use removes the most common noise:
script, style, noscript, iframe, img, footer, header, nav, head
- script and style - obvious, you never want these
- noscript and iframe - usually third-party content, tracking, embeds
- img - images convert to  Markdown, which is usually noise in LLM context. Remove unless you specifically need them
- footer, header, nav - site chrome, not content
You can customize this list depending on your use case. If you're scraping articles, you might also want to remove .sidebar, .comments, .related-posts and other structural elements. Cheerio accepts any CSS selector:
// Remove more aggressively for article extraction
const aggressiveClean = 'script, style, noscript, iframe, img, footer, header, nav, head, aside, .sidebar, .comments, form, button';
const markdown = await htmlToMarkdown(html, aggressiveClean);
One thing to be careful about: cheerio.load(html) wraps content in <html><body> if it isn't already wrapped. That's why I use $('body').html() to get just the body content back. If you use $.html() you'll get the full document including the <html> wrapper, which works fine but produces some extra Markdown artifacts.
Why rehype-minify-whitespace Is the Critical Step
This is worth explaining clearly because it's the non-obvious part.
HTML text nodes preserve whitespace as written in the source. So if a developer formatted their HTML like this:
<a href="/page">
Link Text
</a>
The text node inside that <a> is \n Link Text\n, not Link Text.
When rehype-remark converts this to Markdown, it takes the text node as-is. So you get [\n Link Text\n](/page) which is an invalid Markdown link.
rehype-minify-whitespace runs a pass over the AST before conversion and collapses all whitespace sequences to a single space, trims leading/trailing whitespace from text nodes, and handles inline vs block context correctly. After it runs, that same text node becomes Link Text, and the converted Markdown link is [Link Text](/page).
Without this plugin, you need post-processing regex to fix up link text, and regex is fragile. With it, the problem is solved at the right layer.
Handling Edge Cases
A few things worth noting from real-world use:
Tables: rehype-remark converts HTML tables to Markdown tables by default. This mostly works but can break on complex tables with merged cells. If you hit this, you can configure rehype-remark to stringify tables as HTML, or just strip table elements before conversion if you don't need them.
Code blocks: <pre><code> converts cleanly to fenced code blocks. If the original HTML has language hints via a class like class="language-javascript", rehype-remark will carry that through to the fenced code block.
Large pages: The pipeline is async and memory-efficient for most pages. But if you're processing very large documents (1MB+ of HTML), consider streaming or chunking. For typical web pages this is not an issue.
Character encoding: Make sure your HTML string is already decoded to UTF-8 before passing it in. If you're fetching HTML with Node's fetch, response.text() handles decoding. If you're reading files, use { encoding: 'utf-8' }.
Why This Matters for Real-World Scraping
If you're building anything with LLMs that touches the web, HTML to Markdown conversion is not optional. It's the difference between feeding your model useful content and feeding it noise.
The pipeline approach scales well:
- It's pure JavaScript, no binary dependencies
- It runs serverlessly without issues
- Processing time is fast, typically under 50ms for a normal page
- The output is consistent and predictable
I use a version of this exact pipeline inside WebCrawlerAPI when customers request Markdown output format. The whitespace normalization step was something I added after seeing broken link text in production. It looked like a small issue until I realized it was affecting every page crawled from any CMS or documentation site.
Use WebCrawlerAPI If You Don't Want to Maintain This
If you're building a product and you just need clean Markdown from URLs at scale, you don't have to manage this yourself.
WebCrawlerAPI handles the whole pipeline: fetching with JavaScript rendering, cleaning, and converting to Markdown. You send a URL, you get back clean Markdown ready for your LLM or vector store. It handles anti-bot, retries, encoding issues, and all the edge cases that come up in production.
The API returns clean Markdown with a single call:
const { WebcrawlerClient } = require('webcrawlerapi-js');
const client = new WebcrawlerClient('your-api-key');
const result = await client.scrape({
url: 'https://example.com/article',
});
console.log(result.markdown);
That's the same result as the pipeline above, but without managing dependencies, hosting, or edge cases.
Summary
Converting HTML to Markdown in JavaScript is simple on the surface and annoying in practice. The core issue is whitespace in HTML text nodes - pretty-printed HTML produces broken link text and messy formatting when converted naively.
The solution:
- Strip noise with cheerio before conversion
- Use the unified pipeline: rehypeParse -> rehypeMinifyWhitespace -> rehypeRemark -> remarkStringify
- rehype-minify-whitespace is the critical step - it fixes whitespace at the AST level before conversion
This gives you clean, consistent Markdown output from any real-world HTML page.