Webcrawler API LogoWebCrawlerAPI
API

GET /job/:id/markdown

Download combined markdown output for a completed markdown crawl

Endpoint to retrieve all markdown content from a completed crawl job as a single combined file. This endpoint downloads all successful markdown job items in parallel, combines them with URL separators, caches the result in R2, and serves it as a single markdown file.

Method: GET

Request example

curl --request GET \
  --url https://api.webcrawlerapi.com/v1/job/46c7b8ff-eb5e-4ebb-96f1-2685334c07d7/markdown \
  --header 'Authorization: Bearer <YOUR TOKEN>' \
  --output combined.md

Response format

  • Content-Type: text/markdown; charset=utf-8
  • Each page is separated by a block showing the source URL:
    • Separator: ----
    • URL line: url: <page_url>
    • Separator: ----
    • Followed by blank line and the markdown content for that page
  • Pages are included only if the item finished successfully and has markdown_content_url.

Example response

----
url: https://docs.example.com/getting-started
----

# Getting Started

Welcome to the docs...


----
url: https://docs.example.com/faq
----

# FAQ

Common questions and answers...

Requirements

For this endpoint to work successfully, the job must meet these requirements:

  1. Markdown Type: The job must have been created with scrape_type: "markdown" (default)
  2. Completed Status: The job status must be done
  3. Successful Items: At least one job item must have completed successfully with markdown content

Error responses

400 Bad Request

{
  "error": "Job is not a markdown type",
  "message": "This endpoint only supports jobs with markdown scrape type"
}

The job was created with a different scrape type (e.g., html or cleaned).

401 Unauthorized

{
  "error": "Access denied"
}

The job does not belong to your organization.

404 Not Found

{
  "error": "Job not found"
}

The job ID does not exist.

{
  "error": "No markdown content available",
  "message": "No successful items with markdown content found"
}

The job exists but has no successful items with markdown content.

422 Unprocessable Entity

{
  "error": "Job not finished",
  "message": "Job must be in 'done' status to generate markdown file",
  "status": "in_progress"
}

The job is still processing. Wait for it to complete before requesting the combined markdown.

500 Internal Server Error

{
  "error": "Failed to download markdown content",
  "errors": [
    "https://example.com/page1: connection timeout",
    "https://example.com/page2: network error"
  ]
}

All markdown downloads failed. Individual error messages are provided in the errors array.

Performance & Caching

  • Parallel Downloads: The endpoint downloads all job items in parallel for optimal performance
  • Caching: The combined markdown is cached in Cloudflare R2 at content/{org_id}/{job_id}-raw.md
  • Cache Hit: Subsequent requests serve the cached file immediately without re-downloading
  • Partial Success: If some items fail to download but at least one succeeds, the endpoint returns the available content

Use Cases

This endpoint is useful for:

  1. Batch Processing: Get all crawled content in a single request instead of downloading individual items
  2. Data Analysis: Process entire website content at once for analysis or indexing
  3. Backup: Archive the complete crawl results in a single file
  4. RAG Applications: Feed combined content into vector databases or AI models
  5. Documentation Extraction: Extract and combine documentation from multiple pages

Example Workflow

# Step 1: Create a crawl job
curl --request POST \
  --url https://api.webcrawlerapi.com/v1/crawl \
  --header 'Authorization: Bearer <YOUR TOKEN>' \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "scrape_type": "markdown",
    "items_limit": 10
  }'

# Response: {"id": "job-id-here"}

# Step 2: Wait for job to complete (poll /v1/job/{id} until status is "done")

# Step 3: Download combined markdown
curl --request GET \
  --url https://api.webcrawlerapi.com/v1/job/job-id-here/markdown \
  --header 'Authorization: Bearer <YOUR TOKEN>' \
  --output website-content.md

Notes

  • Only items with status: "done", is_success: true, and a valid markdown_content_url are included
  • Failed items are silently skipped and do not appear in the combined output
  • The separator format (----\nurl: {url}\n----\n\n) makes it easy to parse and split the combined file back into individual pages
  • Large jobs may take a moment on first request due to parallel downloads, but subsequent requests are instant due to caching
  • The markdown format preserves all formatting, links, and structure from the original pages