GET /job/:id/markdown
Download combined markdown output for a completed markdown crawl
Endpoint to retrieve all markdown content from a completed crawl job as a single combined file. This endpoint downloads all successful markdown job items in parallel, combines them with URL separators, caches the result in R2, and serves it as a single markdown file.
Method: GET
Request example
curl --request GET \
--url https://api.webcrawlerapi.com/v1/job/46c7b8ff-eb5e-4ebb-96f1-2685334c07d7/markdown \
--header 'Authorization: Bearer <YOUR TOKEN>' \
--output combined.mdResponse format
- Content-Type:
text/markdown; charset=utf-8 - Each page is separated by a block showing the source URL:
- Separator:
---- - URL line:
url: <page_url> - Separator:
---- - Followed by blank line and the markdown content for that page
- Separator:
- Pages are included only if the item finished successfully and has
markdown_content_url.
Example response
----
url: https://docs.example.com/getting-started
----
# Getting Started
Welcome to the docs...
----
url: https://docs.example.com/faq
----
# FAQ
Common questions and answers...Requirements
For this endpoint to work successfully, the job must meet these requirements:
- Markdown Type: The job must have been created with
scrape_type: "markdown"(default) - Completed Status: The job status must be
done - Successful Items: At least one job item must have completed successfully with markdown content
Error responses
400 Bad Request
{
"error": "Job is not a markdown type",
"message": "This endpoint only supports jobs with markdown scrape type"
}The job was created with a different scrape type (e.g., html or cleaned).
401 Unauthorized
{
"error": "Access denied"
}The job does not belong to your organization.
404 Not Found
{
"error": "Job not found"
}The job ID does not exist.
{
"error": "No markdown content available",
"message": "No successful items with markdown content found"
}The job exists but has no successful items with markdown content.
422 Unprocessable Entity
{
"error": "Job not finished",
"message": "Job must be in 'done' status to generate markdown file",
"status": "in_progress"
}The job is still processing. Wait for it to complete before requesting the combined markdown.
500 Internal Server Error
{
"error": "Failed to download markdown content",
"errors": [
"https://example.com/page1: connection timeout",
"https://example.com/page2: network error"
]
}All markdown downloads failed. Individual error messages are provided in the errors array.
Performance & Caching
- Parallel Downloads: The endpoint downloads all job items in parallel for optimal performance
- Caching: The combined markdown is cached in Cloudflare R2 at
content/{org_id}/{job_id}-raw.md - Cache Hit: Subsequent requests serve the cached file immediately without re-downloading
- Partial Success: If some items fail to download but at least one succeeds, the endpoint returns the available content
Use Cases
This endpoint is useful for:
- Batch Processing: Get all crawled content in a single request instead of downloading individual items
- Data Analysis: Process entire website content at once for analysis or indexing
- Backup: Archive the complete crawl results in a single file
- RAG Applications: Feed combined content into vector databases or AI models
- Documentation Extraction: Extract and combine documentation from multiple pages
Example Workflow
# Step 1: Create a crawl job
curl --request POST \
--url https://api.webcrawlerapi.com/v1/crawl \
--header 'Authorization: Bearer <YOUR TOKEN>' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com",
"scrape_type": "markdown",
"items_limit": 10
}'
# Response: {"id": "job-id-here"}
# Step 2: Wait for job to complete (poll /v1/job/{id} until status is "done")
# Step 3: Download combined markdown
curl --request GET \
--url https://api.webcrawlerapi.com/v1/job/job-id-here/markdown \
--header 'Authorization: Bearer <YOUR TOKEN>' \
--output website-content.mdNotes
- Only items with
status: "done",is_success: true, and a validmarkdown_content_urlare included - Failed items are silently skipped and do not appear in the combined output
- The separator format (
----\nurl: {url}\n----\n\n) makes it easy to parse and split the combined file back into individual pages - Large jobs may take a moment on first request due to parallel downloads, but subsequent requests are instant due to caching
- The markdown format preserves all formatting, links, and structure from the original pages