Crawling Agent (Wagent)

AI-powered agent that autonomously browses websites and extracts structured data based on a prompt.

Wagent is an AI-powered service that autonomously browses websites, follows links, and extracts structured data — all from a plain-language prompt. You describe what you want; the agent figures out how to get it.

Unlike standard scraping that fetches a single URL, the agent acts like a researcher: it reads pages, decides which links are relevant, follows them, and returns structured JSON.

Use cases

Find customers, partners, or contact info across a company's website or public directories
Extract product lists, pricing tiers, feature sets, or press mentions from competitor sites
Gather articles, job listings, events, or announcements from sites that lack an API
Verify that published information (pricing, availability, specs) matches what you expect
Collect industry data, reviews, or survey results across many pages in one request

Quick start

Find customers on a website

curl --request POST \
  --url https://api.webcrawlerapi.com/v1/agent \
  --header 'Authorization: Bearer <YOUR_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "Find all customers listed on the website",
    "urls": ["https://www.mintlify.com/"],
    "max_spend_usd": 0.5
  }'

Response when done:

{
  "id": "ar_abc123",
  "status": "done",
  "data": {
    "customers": [
      "Coinbase", "Anaconda", "Anthropic", "AT&T", "Browserbase",
      "Fidelity", "Cognition", "Decagon", "Dub", "Kalshi",
      "HubSpot", "Loops", "Lovable", "Meter", "Metronome",
      "Laravel", "Mirage", "Ollama", "PayPal", "Perplexity",
      "Layers", "Pinecone", "Planetscale", "Replit", "Resend",
      "Zapier", "Together AI", "Vercel", "Worldcoin", "X"
    ]
  }
}

Scrape only the provided URLs (no link following)

curl --request POST \
  --url https://api.webcrawlerapi.com/v1/agent \
  --header 'Authorization: Bearer <YOUR_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "Extract all job listings with title, location, and department for Anthropic",
    "urls": ["https://www.anthropic.com/"],
    "max_spend_usd": 1,
    "seed_urls_only": true
  }'

Extract with a strict schema

curl --request POST \
  --url https://api.webcrawlerapi.com/v1/agent \
  --header 'Authorization: Bearer <YOUR_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "gpt-5.4-mini",
  "prompt": "Extract all job listings with title, location, and department for Anthropic",
  "urls": ["https://example.com/careers"],
  "max_spend_usd": 1.0,
  "output_schema": {
    "type": "object",
    "additionalProperties": false,
    "properties": {
      "jobs": {
        "type": "array",
        "items": {
          "type": "object",
          "additionalProperties": false,
          "properties": {
            "title": { "type": "string" },
            "location": { "type": "string" },
            "department": { "type": "string" }
          },
          "required": ["title", "location", "department"]
        }
      }
    },
    "required": ["jobs"]
  }
}'

Parameters

Parameter	Required	Description
`prompt`	yes	Natural language instruction describing what to extract or find
`max_spend_usd`	yes	Maximum budget in USD the agent may spend on this run. Must be > 0
`urls`	no	Seed URLs the agent starts from. If omitted, the agent uses only the prompt
`seed_urls_only`	no	When `true`, processes only the provided URLs without following links. Default `false`
`output_schema`	no	JSON Schema describing the expected shape of the result
`model`	no	LLM model to use. See available models

Writing good prompts

The agent output is always JSON. This means prompts should describe a data structure, not a free-text answer.

Examples

✅ Good — specific, structured, tells the agent exactly what fields to return:

Find all customers listed on the website. Return a JSON object with a 'customers' field containing an array of company names.

✅ Good — uses output_schema to enforce shape instead of relying on the prompt alone:

Prompt: "Find all pricing plans" + output_schema with plans[].name, plans[].price_usd, plans[].features[]

✅ Good — instructs the agent on how to handle uncertainty:

Extract all blog post titles and their publication dates. If the date is not visible, set the 'date' field to null.

❌ Bad — vague, no structure, agent will guess a response shape:

Summarize the website

❌ Bad — asks for a plain-text yes/no instead of a JSON-safe value:

Is there a free plan? Put only 'yes' or 'no' in the response

✅ Good version of the above — uses a field with enumerated options:

Check if there is a free plan. Respond with a JSON object with a 'free_plan_available' field set to either 'yes' or 'no'.

❌ Bad — too broad a scope without a budget or focus:

Get everything from the website

The more specific the prompt, the better the result. Tell the agent what fields you expect, what to do when data is missing, and which pages are most relevant.

Controlling output shape with `output_schema`

Use output_schema when you need a guaranteed response structure. The agent will fill in your schema rather than inventing its own shape.

Example — enforce a list of customers under a named field:

{
   "output_schema": {
    "type": "object",
    "additionalProperties": false,
    "properties": {
      "jobs": {
        "type": "array",
        "items": {
          "type": "object",
          "additionalProperties": false,
          "properties": {
            "title": { "type": "string" },
            "location": { "type": "string" },
            "department": { "type": "string" }
          },
          "required": ["title", "location", "department"]
        }
      }
    },
    "required": ["jobs"]
  }
}

This guarantees the response data field contains { "jobs": [...] } instead of a free-form object. Read more about Structured Outputs.

"jobs": [
    {
        "department": "AI Research & Engineering",
        "location": "San Francisco, CA",
        "title": "[Expression of Interest] Research Manager, Interpretability"
    }
]

Response fields

The data field is populated once status is done. All other fields are present immediately.

Field	Description
`id`	Unique identifier of the agent run (`ar_...`)
`status`	`queued` → `processing` → `done` or `failed`
`prompt`	Prompt from the request
`model`	LLM model used
`urls`	Seed URLs provided
`max_spend_usd`	Spending cap set for this run
`balance_used_usd`	Actual amount spent
`data`	Extracted result — always JSON, present when `status` is `done`
`success`	`true` when run completed with non-empty data
`error_reason`	Human-readable Agent-generated error message if run failed
`trace`	Agent reasoning trace (if available)
`llm_requests`	List of individual LLM calls made during the run
`created_at`	ISO 8601 creation timestamp
`updated_at`	ISO 8601 last-updated timestamp

Async flow

Agent runs are asynchronous. After submitting a run you receive an id immediately. Poll for results:

curl --request GET \
  --url https://api.webcrawlerapi.com/v1/agent/job/<RUN_ID> \
  --header 'Authorization: Bearer <YOUR_API_KEY>'

Keep polling until status is done or failed.

API reference

POST /v1/agent — start an agent run
GET /v1/agent/job/{id} — get run status and results
GET /v1/agent/jobs — list all agent runs

Crawling Agent (Wagent)

On this page