How to Convert Any Website to an RSS Feed

Sometimes updates are needed from a website you do not own. A vendor blog. A changelog. A docs page that quietly changes. New posts and changelog updates are still wanted, but there is no official feed. In real life, that is when a site is converted to an RSS feed so updates can be pulled programmatically via an API.

It can be done with WebCrawlerAPI feeds: a feed is created with POST https://api.webcrawlerapi.com/v2/feed, then changes are read as JSON Feed 1.1 from GET https://api.webcrawlerapi.com/v2/feed/:id/json, or as Atom 1.0 (RSS-style) from GET https://api.webcrawlerapi.com/v2/feed/:id/rss.

# 1) Create feed
curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed" \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/changelog",
    "name": "Example Changelog",
    "scrape_type": "markdown",
    "items_limit": 10,
    "max_depth": 1,
    "respect_robots_txt": false,
    "main_content_only": true
  }'

# 2) Receive updates as JSON Feed 1.1
# Content-Type: application/feed+json; charset=utf-8
curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/json?page=1&page_size=50" \
  -H "Authorization: Bearer <YOUR_API_KEY>"

# 3) Receive updates as Atom 1.0 (RSS-style)
# Content-Type: application/atom+xml; charset=utf-8
curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/rss?page=1&page_size=50" \
  -H "Authorization: Bearer <YOUR_API_KEY>"

Tiny response examples:

{
  "version": "https://jsonfeed.org/version/1.1",
  "title": "WebCrawlerAPI Feed: example.com",
  "items": [
    {
      "id": "item123",
      "url": "https://example.com/changelog/v1-2-3",
      "title": "v1.2.3",
      "summary": "New page discovered",
      "date_modified": "2026-02-06T10:00:00Z",
      "_webcrawlerapi": {
        "change_type": "new",
        "content_url": "https://cdn.webcrawlerapi.com/content/..."
      }
    }
  ]
}

<feed xmlns="http://www.w3.org/2005/Atom">
  <title>WebCrawlerAPI Feed: example.com</title>
  <entry>
    <id>urn:webcrawlerapi:feeditem:item123</id>
    <title>New: v1.2.3</title>
    <updated>2026-02-06T10:00:00Z</updated>
    <link href="https://example.com/changelog/v1-2-3" rel="alternate" />
    <summary type="text">New page discovered</summary>
  </entry>
</feed>

What is being built

Once a feed is set up, a stable output is produced that can be plugged into:

an RSS reader (Atom endpoint)
a Slack bot (webhook)
a cron job (poll JSON feed)
a database sync (store item IDs and change types)

This is the practical way to get RSS-style output without relying on the site owner.

Choose the right source URL (this matters more than tools)

Most failures are caused by choosing the wrong URL.

If the goal is to convert a web page to an RSS feed, the “page” should be a listing page that changes over time, not the homepage.

What usually works:

Blog updates: /blog, not /
Changelog updates: /changelog, /releases, /updates
Docs notes: a “What’s new” index page
Security advisories: advisory index page, not a single CVE page

What should be avoided:

pages with infinite filters and sorts (faceted navigation)
internal search pages that change per user/session
URLs with tracking parameters (utm_*, fbclid, etc.)

In other words: start from the cleanest “index of updates” page you can find.

Create the feed (POST /v2/feed)

Only one field is required: url.

Useful optional fields:

scrape_type: markdown (default), cleaned, or html
items_limit: max pages crawled per run (default: 10)
max_depth: link-follow depth from the seed URL (0-10)
whitelist_regexp: only URLs that match are crawled
blacklist_regexp: URLs that match are skipped
respect_robots_txt: robots.txt is respected when set to true (default: false)
main_content_only: boilerplate is removed when set to true (default: false)
webhook_url: changes are pushed to your server when set

This is a practical starting payload for a changelog feed:

curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed" \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/changelog",
    "name": "Example Changelog",
    "scrape_type": "markdown",
    "items_limit": 20,
    "max_depth": 1,
    "respect_robots_txt": true,
    "main_content_only": true,
    "webhook_url": "https://yourserver.com/webhook"
  }'

The returned id is the only thing needed for the read endpoints.

Receive updates as JSON Feed and RSS

Two formats are supported:

JSON Feed 1.1: easiest for code
Atom 1.0 (RSS-style): easiest for RSS readers

# JSON Feed
curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/json" \
  -H "Authorization: Bearer <YOUR_API_KEY>"

# Atom (RSS-style)
curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/rss" \
  -H "Authorization: Bearer <YOUR_API_KEY>"

Scope the crawl so it does not run away

In practice, this usually means: “watch a small slice of the site, and only the pages that matter”.

Scoping is where it is won or lost:

max_depth should be kept low (often 0 or 1)
whitelist_regexp should be used to keep only the pages that matter
blacklist_regexp should be used to block traps (search, tags, query strings)

Why query params can break everything

Many sites generate endless URL variants:

?page=2
?sort=newest
?tag=security
?price_min=...&price_max=...

That creates an infinite URL space. Budget is wasted. Duplicates appear. “New item” detection gets noisy.

Practical approach:

Start with max_depth: 0 or 1.
Add a whitelist_regexp that matches only the pages you want.
Add a blacklist_regexp that blocks obvious traps.

Pick the content format you actually need

scrape_type controls the stored content format:

markdown (default): good for reading and diffing
cleaned: good when “just the text” is needed
html: good when structure is needed (tables, code blocks, rich layout)

main_content_only can be enabled when nav/footers are too noisy. It is helpful, but it is not magic.

If you are building a crawler yourself, this decision usually lives in the Parser stage.

Polling vs webhooks

Updates can be consumed by polling the feed endpoints. Or they can be pushed via a webhook.

webhook_url is useful when:

latency matters (alerts should arrive quickly)
many feeds are tracked and fewer cron jobs are desired
a webhook receiver already exists

Even with webhooks, item processing should be made idempotent by storing the JSON Feed item id, then reconciling by id on every fetch.

What the webhook sends

When webhook_url is set, an HTTP POST request is sent after a feed run completes. The request body is JSON Feed 1.1 (the same shape as GET /v2/feed/:id/json), so the same parser can be reused.

Two practical details should be known:

Only new and changed items are pushed to the webhook.
unavailable items are tracked in the feed, but are not pushed to the webhook.

This is why the webhook should be treated as a trigger, not as a database. When the webhook fires, the JSON feed can be fetched and reconciled by item.id.

Resending a webhook

If your endpoint was down, the last completed feed run can be replayed:

curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/webhook/resend" \
  -H "Authorization: Bearer <YOUR_API_KEY>"

Common failure modes (and what usually fixes them)

JavaScript rendering: a different source URL is chosen if possible; otherwise a rendering approach is required
Missing dates: list pages with visible dates are preferred; detail pages may need to be crawled
Duplicates: query params are blocked; canonical paths are whitelisted
Pagination: a low depth is used and a whitelist is applied; full historical mirrors are avoided
403/429 blocks: scope is reduced and crawling is slowed; robots.txt is respected when needed

How to Convert Any Website to an RSS Feed

# 1) Create feed
curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed" \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/changelog",
    "name": "Example Changelog",
    "scrape_type": "markdown",
    "items_limit": 10,
    "max_depth": 1,
    "respect_robots_txt": false,
    "main_content_only": true
  }'

# 2) Receive updates as JSON Feed 1.1
# Content-Type: application/feed+json; charset=utf-8
curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/json?page=1&page_size=50" \
  -H "Authorization: Bearer <YOUR_API_KEY>"

# 3) Receive updates as Atom 1.0 (RSS-style)
# Content-Type: application/atom+xml; charset=utf-8
curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/rss?page=1&page_size=50" \
  -H "Authorization: Bearer <YOUR_API_KEY>"

Tiny response examples:

{
  "version": "https://jsonfeed.org/version/1.1",
  "title": "WebCrawlerAPI Feed: example.com",
  "items": [
    {
      "id": "item123",
      "url": "https://example.com/changelog/v1-2-3",
      "title": "v1.2.3",
      "summary": "New page discovered",
      "date_modified": "2026-02-06T10:00:00Z",
      "_webcrawlerapi": {
        "change_type": "new",
        "content_url": "https://cdn.webcrawlerapi.com/content/..."
      }
    }
  ]
}

<feed xmlns="http://www.w3.org/2005/Atom">
  <title>WebCrawlerAPI Feed: example.com</title>
  <entry>
    <id>urn:webcrawlerapi:feeditem:item123</id>
    <title>New: v1.2.3</title>
    <updated>2026-02-06T10:00:00Z</updated>
    <link href="https://example.com/changelog/v1-2-3" rel="alternate" />
    <summary type="text">New page discovered</summary>
  </entry>
</feed>

What is being built

Once a feed is set up, a stable output is produced that can be plugged into:

an RSS reader (Atom endpoint)
a Slack bot (webhook)
a cron job (poll JSON feed)
a database sync (store item IDs and change types)

This is the practical way to get RSS-style output without relying on the site owner.

Choose the right source URL (this matters more than tools)

Most failures are caused by choosing the wrong URL.

If the goal is to convert a web page to an RSS feed, the “page” should be a listing page that changes over time, not the homepage.

What usually works:

Blog updates: /blog, not /
Changelog updates: /changelog, /releases, /updates
Docs notes: a “What’s new” index page
Security advisories: advisory index page, not a single CVE page

What should be avoided:

pages with infinite filters and sorts (faceted navigation)
internal search pages that change per user/session
URLs with tracking parameters (utm_*, fbclid, etc.)

In other words: start from the cleanest “index of updates” page you can find.

Create the feed (POST /v2/feed)

Only one field is required: url.

Useful optional fields:

scrape_type: markdown (default), cleaned, or html
items_limit: max pages crawled per run (default: 10)
max_depth: link-follow depth from the seed URL (0-10)
whitelist_regexp: only URLs that match are crawled
blacklist_regexp: URLs that match are skipped
respect_robots_txt: robots.txt is respected when set to true (default: false)
main_content_only: boilerplate is removed when set to true (default: false)
webhook_url: changes are pushed to your server when set

This is a practical starting payload for a changelog feed:

curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed" \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/changelog",
    "name": "Example Changelog",
    "scrape_type": "markdown",
    "items_limit": 20,
    "max_depth": 1,
    "respect_robots_txt": true,
    "main_content_only": true,
    "webhook_url": "https://yourserver.com/webhook"
  }'

The returned id is the only thing needed for the read endpoints.

Receive updates as JSON Feed and RSS

Two formats are supported:

JSON Feed 1.1: easiest for code
Atom 1.0 (RSS-style): easiest for RSS readers

# JSON Feed
curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/json" \
  -H "Authorization: Bearer <YOUR_API_KEY>"

# Atom (RSS-style)
curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/rss" \
  -H "Authorization: Bearer <YOUR_API_KEY>"

Scope the crawl so it does not run away

In practice, this usually means: “watch a small slice of the site, and only the pages that matter”.

Scoping is where it is won or lost:

max_depth should be kept low (often 0 or 1)
whitelist_regexp should be used to keep only the pages that matter
blacklist_regexp should be used to block traps (search, tags, query strings)

Why query params can break everything

Many sites generate endless URL variants:

?page=2
?sort=newest
?tag=security
?price_min=...&price_max=...

That creates an infinite URL space. Budget is wasted. Duplicates appear. “New item” detection gets noisy.

Practical approach:

Start with max_depth: 0 or 1.
Add a whitelist_regexp that matches only the pages you want.
Add a blacklist_regexp that blocks obvious traps.

Pick the content format you actually need

scrape_type controls the stored content format:

markdown (default): good for reading and diffing
cleaned: good when “just the text” is needed
html: good when structure is needed (tables, code blocks, rich layout)

main_content_only can be enabled when nav/footers are too noisy. It is helpful, but it is not magic.

If you are building a crawler yourself, this decision usually lives in the Parser stage.

Polling vs webhooks

Updates can be consumed by polling the feed endpoints. Or they can be pushed via a webhook.

webhook_url is useful when:

latency matters (alerts should arrive quickly)
many feeds are tracked and fewer cron jobs are desired
a webhook receiver already exists

Even with webhooks, item processing should be made idempotent by storing the JSON Feed item id, then reconciling by id on every fetch.

What the webhook sends

When webhook_url is set, an HTTP POST request is sent after a feed run completes. The request body is JSON Feed 1.1 (the same shape as GET /v2/feed/:id/json), so the same parser can be reused.

Two practical details should be known:

Only new and changed items are pushed to the webhook.
unavailable items are tracked in the feed, but are not pushed to the webhook.

This is why the webhook should be treated as a trigger, not as a database. When the webhook fires, the JSON feed can be fetched and reconciled by item.id.

Resending a webhook

If your endpoint was down, the last completed feed run can be replayed:

curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/webhook/resend" \
  -H "Authorization: Bearer <YOUR_API_KEY>"

Common failure modes (and what usually fixes them)

JavaScript rendering: a different source URL is chosen if possible; otherwise a rendering approach is required
Missing dates: list pages with visible dates are preferred; detail pages may need to be crawled
Duplicates: query params are blocked; canonical paths are whitelisted
Pagination: a low depth is used and a whitelist is applied; full historical mirrors are avoided
403/429 blocks: scope is reduced and crawling is slowed; robots.txt is respected when needed

How to Convert Any Website to an RSS Feed

Table of Contents

Table of Contents

How to Convert Any Website to an RSS Feed

What is being built

Choose the right source URL (this matters more than tools)

Create the feed (POST /v2/feed)

Receive updates as JSON Feed and RSS

Scope the crawl so it does not run away

Why query params can break everything

Pick the content format you actually need

Polling vs webhooks

What the webhook sends

Resending a webhook

Common failure modes (and what usually fixes them)

How to Convert Any Website to an RSS Feed

Table of Contents

Table of Contents

How to Convert Any Website to an RSS Feed

What is being built

Choose the right source URL (this matters more than tools)

Create the feed (POST /v2/feed)

Receive updates as JSON Feed and RSS

Scope the crawl so it does not run away

Why query params can break everything

Pick the content format you actually need

Polling vs webhooks

What the webhook sends

Resending a webhook

Common failure modes (and what usually fixes them)

How to Convert Any Website to an RSS Feed

Table of Contents

Table of Contents

How to Convert Any Website to an RSS Feed

What is being built

Choose the right source URL (this matters more than tools)

Create the feed (POST /v2/feed)

Receive updates as JSON Feed and RSS

Scope the crawl so it does not run away

Why query params can break everything

Pick the content format you actually need

Polling vs webhooks

What the webhook sends

Resending a webhook

Common failure modes (and what usually fixes them)

Related reading

How to Convert Any Website to an RSS Feed

Table of Contents

Table of Contents

How to Convert Any Website to an RSS Feed

What is being built

Choose the right source URL (this matters more than tools)

Create the feed (POST /v2/feed)

Receive updates as JSON Feed and RSS

Scope the crawl so it does not run away

Why query params can break everything

Pick the content format you actually need

Polling vs webhooks

What the webhook sends

Resending a webhook

Common failure modes (and what usually fixes them)

Related reading