Webcrawler API LogoWebCrawlerAPI

Crawling Agent (Wagent)

AI-powered agent that autonomously browses websites and extracts structured data based on a prompt.

Wagent is an AI-powered service that autonomously browses websites, follows links, and extracts structured data — all from a plain-language prompt. You describe what you want; the agent figures out how to get it.

Unlike standard scraping that fetches a single URL, the agent acts like a researcher: it reads pages, decides which links are relevant, follows them, and returns structured JSON.

Use cases

  • Find customers, partners, or contact info across a company's website or public directories
  • Extract product lists, pricing tiers, feature sets, or press mentions from competitor sites
  • Gather articles, job listings, events, or announcements from sites that lack an API
  • Verify that published information (pricing, availability, specs) matches what you expect
  • Collect industry data, reviews, or survey results across many pages in one request

Quick start

Find customers on a website

curl --request POST \
  --url https://api.webcrawlerapi.com/v1/agent \
  --header 'Authorization: Bearer <YOUR_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "Find all customers listed on the website",
    "urls": ["https://www.mintlify.com/"],
    "max_spend_usd": 0.5
  }'

Response when done:

{
  "id": "ar_abc123",
  "status": "done",
  "data": {
    "customers": [
      "Coinbase", "Anaconda", "Anthropic", "AT&T", "Browserbase",
      "Fidelity", "Cognition", "Decagon", "Dub", "Kalshi",
      "HubSpot", "Loops", "Lovable", "Meter", "Metronome",
      "Laravel", "Mirage", "Ollama", "PayPal", "Perplexity",
      "Layers", "Pinecone", "Planetscale", "Replit", "Resend",
      "Zapier", "Together AI", "Vercel", "Worldcoin", "X"
    ]
  }
}
curl --request POST \
  --url https://api.webcrawlerapi.com/v1/agent \
  --header 'Authorization: Bearer <YOUR_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "Extract all job listings with title, location, and department for Anthropic",
    "urls": ["https://www.anthropic.com/"],
    "max_spend_usd": 1,
    "seed_urls_only": true
  }'

Extract with a strict schema

curl --request POST \
  --url https://api.webcrawlerapi.com/v1/agent \
  --header 'Authorization: Bearer <YOUR_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "gpt-5.4-mini",
  "prompt": "Extract all job listings with title, location, and department for Anthropic",
  "urls": ["https://example.com/careers"],
  "max_spend_usd": 1.0,
  "output_schema": {
    "type": "object",
    "additionalProperties": false,
    "properties": {
      "jobs": {
        "type": "array",
        "items": {
          "type": "object",
          "additionalProperties": false,
          "properties": {
            "title": { "type": "string" },
            "location": { "type": "string" },
            "department": { "type": "string" }
          },
          "required": ["title", "location", "department"]
        }
      }
    },
    "required": ["jobs"]
  }
}'

Parameters

ParameterRequiredDescription
promptyesNatural language instruction describing what to extract or find
max_spend_usdyesMaximum budget in USD the agent may spend on this run. Must be > 0
urlsnoSeed URLs the agent starts from. If omitted, the agent uses only the prompt
seed_urls_onlynoWhen true, processes only the provided URLs without following links. Default false
output_schemanoJSON Schema describing the expected shape of the result
modelnoLLM model to use. See available models

Writing good prompts

The agent output is always JSON. This means prompts should describe a data structure, not a free-text answer.

Examples

Good — specific, structured, tells the agent exactly what fields to return:

Find all customers listed on the website. Return a JSON object with a 'customers' field containing an array of company names.

Good — uses output_schema to enforce shape instead of relying on the prompt alone:

Prompt: "Find all pricing plans" + output_schema with plans[].name, plans[].price_usd, plans[].features[]

Good — instructs the agent on how to handle uncertainty:

Extract all blog post titles and their publication dates. If the date is not visible, set the 'date' field to null.

Bad — vague, no structure, agent will guess a response shape:

Summarize the website

Bad — asks for a plain-text yes/no instead of a JSON-safe value:

Is there a free plan? Put only 'yes' or 'no' in the response

Good version of the above — uses a field with enumerated options:

Check if there is a free plan. Respond with a JSON object with a 'free_plan_available' field set to either 'yes' or 'no'.

Bad — too broad a scope without a budget or focus:

Get everything from the website

The more specific the prompt, the better the result. Tell the agent what fields you expect, what to do when data is missing, and which pages are most relevant.

Controlling output shape with output_schema

Use output_schema when you need a guaranteed response structure. The agent will fill in your schema rather than inventing its own shape.

Example — enforce a list of customers under a named field:

{
   "output_schema": {
    "type": "object",
    "additionalProperties": false,
    "properties": {
      "jobs": {
        "type": "array",
        "items": {
          "type": "object",
          "additionalProperties": false,
          "properties": {
            "title": { "type": "string" },
            "location": { "type": "string" },
            "department": { "type": "string" }
          },
          "required": ["title", "location", "department"]
        }
      }
    },
    "required": ["jobs"]
  }
}

This guarantees the response data field contains { "jobs": [...] } instead of a free-form object. Read more about Structured Outputs.

"jobs": [
    {
        "department": "AI Research & Engineering",
        "location": "San Francisco, CA",
        "title": "[Expression of Interest] Research Manager, Interpretability"
    }
]

Response fields

The data field is populated once status is done. All other fields are present immediately.

FieldDescription
idUnique identifier of the agent run (ar_...)
statusqueuedprocessingdone or failed
promptPrompt from the request
modelLLM model used
urlsSeed URLs provided
max_spend_usdSpending cap set for this run
balance_used_usdActual amount spent
dataExtracted result — always JSON, present when status is done
successtrue when run completed with non-empty data
error_reasonHuman-readable Agent-generated error message if run failed
traceAgent reasoning trace (if available)
llm_requestsList of individual LLM calls made during the run
created_atISO 8601 creation timestamp
updated_atISO 8601 last-updated timestamp

Async flow

Agent runs are asynchronous. After submitting a run you receive an id immediately. Poll for results:

curl --request GET \
  --url https://api.webcrawlerapi.com/v1/agent/job/<RUN_ID> \
  --header 'Authorization: Bearer <YOUR_API_KEY>'

Keep polling until status is done or failed.

API reference