Webcrawler API LogoWebCrawler API
PricingDocsBlogSign inSign Up
Webcrawler API LogoWebCrawler API

Tools

  • Website to Markdown
  • llms.txt Generator
  • HTML to Readability

Resources

  • Blog
  • Docs
  • Glossary
  • Changelog

Follow us

  • Github
  • X (Twitter)
  • Postman
  • Swagger

Legal

  • Privacy Policy
  • Terms & Conditions
  • Refund Policy

Made in Netherlands 🇳🇱
2023-2026   ©103Labs
    RSSAtomJSON FeedTutorialWeb Scraping

    How to Convert Any Website to an RSS Feed

    Need updates from a site you do not control? Create a WebCrawlerAPI feed for any URL, then read changes as JSON Feed or Atom (RSS-style) from simple endpoints.

    Written byAndrew
    Published onFeb 6, 2026

    Table of Contents

    • How to Convert Any Website to an RSS Feed
    • What is being built
    • Choose the right source URL (this matters more than tools)
    • Create the feed (POST /v2/feed)
    • Receive updates as JSON Feed and RSS
    • Scope the crawl so it does not run away
    • Why query params can break everything
    • Pick the content format you actually need
    • Polling vs webhooks
    • What the webhook sends
    • Resending a webhook
    • Common failure modes (and what usually fixes them)
    • Related reading

    Table of Contents

    • How to Convert Any Website to an RSS Feed
    • What is being built
    • Choose the right source URL (this matters more than tools)
    • Create the feed (POST /v2/feed)
    • Receive updates as JSON Feed and RSS
    • Scope the crawl so it does not run away
    • Why query params can break everything
    • Pick the content format you actually need
    • Polling vs webhooks
    • What the webhook sends
    • Resending a webhook
    • Common failure modes (and what usually fixes them)
    • Related reading

    How to Convert Any Website to an RSS Feed

    Sometimes updates are needed from a website you do not own. A vendor blog. A changelog. A docs page that quietly changes. New posts and changelog updates are still wanted, but there is no official feed. In real life, that is when a site is converted to an RSS feed so updates can be pulled programmatically via an API.

    It can be done with WebCrawlerAPI feeds: a feed is created with POST https://api.webcrawlerapi.com/v2/feed, then changes are read as JSON Feed 1.1 from GET https://api.webcrawlerapi.com/v2/feed/:id/json, or as Atom 1.0 (RSS-style) from GET https://api.webcrawlerapi.com/v2/feed/:id/rss.

    # 1) Create feed
    curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed" \
      -H "Authorization: Bearer <YOUR_API_KEY>" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/changelog",
        "name": "Example Changelog",
        "scrape_type": "markdown",
        "items_limit": 10,
        "max_depth": 1,
        "respect_robots_txt": false,
        "main_content_only": true
      }'
    
    # 2) Receive updates as JSON Feed 1.1
    # Content-Type: application/feed+json; charset=utf-8
    curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/json?page=1&page_size=50" \
      -H "Authorization: Bearer <YOUR_API_KEY>"
    
    # 3) Receive updates as Atom 1.0 (RSS-style)
    # Content-Type: application/atom+xml; charset=utf-8
    curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/rss?page=1&page_size=50" \
      -H "Authorization: Bearer <YOUR_API_KEY>"
    

    Tiny response examples:

    {
      "version": "https://jsonfeed.org/version/1.1",
      "title": "WebCrawlerAPI Feed: example.com",
      "items": [
        {
          "id": "item123",
          "url": "https://example.com/changelog/v1-2-3",
          "title": "v1.2.3",
          "summary": "New page discovered",
          "date_modified": "2026-02-06T10:00:00Z",
          "_webcrawlerapi": {
            "change_type": "new",
            "content_url": "https://cdn.webcrawlerapi.com/content/..."
          }
        }
      ]
    }
    
    <feed xmlns="http://www.w3.org/2005/Atom">
      <title>WebCrawlerAPI Feed: example.com</title>
      <entry>
        <id>urn:webcrawlerapi:feeditem:item123</id>
        <title>New: v1.2.3</title>
        <updated>2026-02-06T10:00:00Z</updated>
        <link href="https://example.com/changelog/v1-2-3" rel="alternate" />
        <summary type="text">New page discovered</summary>
      </entry>
    </feed>
    

    What is being built

    Once a feed is set up, a stable output is produced that can be plugged into:

    • an RSS reader (Atom endpoint)
    • a Slack bot (webhook)
    • a cron job (poll JSON feed)
    • a database sync (store item IDs and change types)

    This is the practical way to get RSS-style output without relying on the site owner.

    Choose the right source URL (this matters more than tools)

    Most failures are caused by choosing the wrong URL.

    If the goal is to convert a web page to an RSS feed, the “page” should be a listing page that changes over time, not the homepage.

    What usually works:

    • Blog updates: /blog, not /
    • Changelog updates: /changelog, /releases, /updates
    • Docs notes: a “What’s new” index page
    • Security advisories: advisory index page, not a single CVE page

    What should be avoided:

    • pages with infinite filters and sorts (faceted navigation)
    • internal search pages that change per user/session
    • URLs with tracking parameters (utm_*, fbclid, etc.)

    In other words: start from the cleanest “index of updates” page you can find.

    Create the feed (POST /v2/feed)

    Only one field is required: url.

    Useful optional fields:

    • scrape_type: markdown (default), cleaned, or html
    • items_limit: max pages crawled per run (default: 10)
    • max_depth: link-follow depth from the seed URL (0-10)
    • whitelist_regexp: only URLs that match are crawled
    • blacklist_regexp: URLs that match are skipped
    • respect_robots_txt: robots.txt is respected when set to true (default: false)
    • main_content_only: boilerplate is removed when set to true (default: false)
    • webhook_url: changes are pushed to your server when set

    This is a practical starting payload for a changelog feed:

    curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed" \
      -H "Authorization: Bearer <YOUR_API_KEY>" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/changelog",
        "name": "Example Changelog",
        "scrape_type": "markdown",
        "items_limit": 20,
        "max_depth": 1,
        "respect_robots_txt": true,
        "main_content_only": true,
        "webhook_url": "https://yourserver.com/webhook"
      }'
    

    The returned id is the only thing needed for the read endpoints.

    Receive updates as JSON Feed and RSS

    Two formats are supported:

    • JSON Feed 1.1: easiest for code
    • Atom 1.0 (RSS-style): easiest for RSS readers
    # JSON Feed
    curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/json" \
      -H "Authorization: Bearer <YOUR_API_KEY>"
    
    # Atom (RSS-style)
    curl -sS "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/rss" \
      -H "Authorization: Bearer <YOUR_API_KEY>"
    

    Scope the crawl so it does not run away

    In practice, this usually means: “watch a small slice of the site, and only the pages that matter”.

    Scoping is where it is won or lost:

    • max_depth should be kept low (often 0 or 1)
    • whitelist_regexp should be used to keep only the pages that matter
    • blacklist_regexp should be used to block traps (search, tags, query strings)

    Why query params can break everything

    Many sites generate endless URL variants:

    • ?page=2
    • ?sort=newest
    • ?tag=security
    • ?price_min=...&price_max=...

    That creates an infinite URL space. Budget is wasted. Duplicates appear. “New item” detection gets noisy.

    Practical approach:

    1. Start with max_depth: 0 or 1.
    2. Add a whitelist_regexp that matches only the pages you want.
    3. Add a blacklist_regexp that blocks obvious traps.

    Pick the content format you actually need

    scrape_type controls the stored content format:

    • markdown (default): good for reading and diffing
    • cleaned: good when “just the text” is needed
    • html: good when structure is needed (tables, code blocks, rich layout)

    main_content_only can be enabled when nav/footers are too noisy. It is helpful, but it is not magic.

    If you are building a crawler yourself, this decision usually lives in the Parser stage.

    Polling vs webhooks

    Updates can be consumed by polling the feed endpoints. Or they can be pushed via a webhook.

    webhook_url is useful when:

    • latency matters (alerts should arrive quickly)
    • many feeds are tracked and fewer cron jobs are desired
    • a webhook receiver already exists

    Even with webhooks, item processing should be made idempotent by storing the JSON Feed item id, then reconciling by id on every fetch.

    What the webhook sends

    When webhook_url is set, an HTTP POST request is sent after a feed run completes. The request body is JSON Feed 1.1 (the same shape as GET /v2/feed/:id/json), so the same parser can be reused.

    Two practical details should be known:

    • Only new and changed items are pushed to the webhook.
    • unavailable items are tracked in the feed, but are not pushed to the webhook.

    This is why the webhook should be treated as a trigger, not as a database. When the webhook fires, the JSON feed can be fetched and reconciled by item.id.

    Resending a webhook

    If your endpoint was down, the last completed feed run can be replayed:

    curl -sS -X POST "https://api.webcrawlerapi.com/v2/feed/<FEED_ID>/webhook/resend" \
      -H "Authorization: Bearer <YOUR_API_KEY>"
    

    Common failure modes (and what usually fixes them)

    • JavaScript rendering: a different source URL is chosen if possible; otherwise a rendering approach is required
    • Missing dates: list pages with visible dates are preferred; detail pages may need to be crawled
    • Duplicates: query params are blocked; canonical paths are whitelisted
    • Pagination: a low depth is used and a whitelist is applied; full historical mirrors are avoided
    • 403/429 blocks: scope is reduced and crawling is slowed; robots.txt is respected when needed

    Related reading

    • How to crawl the website with Python
    • How to Build a Web Crawler