Crawling Agent (Wagent)
AI-powered agent that autonomously browses websites and extracts structured data based on a prompt.
Wagent is an AI-powered service that autonomously browses websites, follows links, and extracts structured data — all from a plain-language prompt. You describe what you want; the agent figures out how to get it.
Unlike standard scraping that fetches a single URL, the agent acts like a researcher: it reads pages, decides which links are relevant, follows them, and returns structured JSON.
Use cases
- Find customers, partners, or contact info across a company's website or public directories
- Extract product lists, pricing tiers, feature sets, or press mentions from competitor sites
- Gather articles, job listings, events, or announcements from sites that lack an API
- Verify that published information (pricing, availability, specs) matches what you expect
- Collect industry data, reviews, or survey results across many pages in one request
Quick start
Find customers on a website
curl --request POST \
--url https://api.webcrawlerapi.com/v1/agent \
--header 'Authorization: Bearer <YOUR_API_KEY>' \
--header 'Content-Type: application/json' \
--data '{
"prompt": "Find all customers listed on the website",
"urls": ["https://www.mintlify.com/"],
"max_spend_usd": 0.5
}'Response when done:
{
"id": "ar_abc123",
"status": "done",
"data": {
"customers": [
"Coinbase", "Anaconda", "Anthropic", "AT&T", "Browserbase",
"Fidelity", "Cognition", "Decagon", "Dub", "Kalshi",
"HubSpot", "Loops", "Lovable", "Meter", "Metronome",
"Laravel", "Mirage", "Ollama", "PayPal", "Perplexity",
"Layers", "Pinecone", "Planetscale", "Replit", "Resend",
"Zapier", "Together AI", "Vercel", "Worldcoin", "X"
]
}
}Scrape only the provided URLs (no link following)
curl --request POST \
--url https://api.webcrawlerapi.com/v1/agent \
--header 'Authorization: Bearer <YOUR_API_KEY>' \
--header 'Content-Type: application/json' \
--data '{
"prompt": "Extract all job listings with title, location, and department for Anthropic",
"urls": ["https://www.anthropic.com/"],
"max_spend_usd": 1,
"seed_urls_only": true
}'Extract with a strict schema
curl --request POST \
--url https://api.webcrawlerapi.com/v1/agent \
--header 'Authorization: Bearer <YOUR_API_KEY>' \
--header 'Content-Type: application/json' \
--data '{
"model": "gpt-5.4-mini",
"prompt": "Extract all job listings with title, location, and department for Anthropic",
"urls": ["https://example.com/careers"],
"max_spend_usd": 1.0,
"output_schema": {
"type": "object",
"additionalProperties": false,
"properties": {
"jobs": {
"type": "array",
"items": {
"type": "object",
"additionalProperties": false,
"properties": {
"title": { "type": "string" },
"location": { "type": "string" },
"department": { "type": "string" }
},
"required": ["title", "location", "department"]
}
}
},
"required": ["jobs"]
}
}'Parameters
| Parameter | Required | Description |
|---|---|---|
prompt | yes | Natural language instruction describing what to extract or find |
max_spend_usd | yes | Maximum budget in USD the agent may spend on this run. Must be > 0 |
urls | no | Seed URLs the agent starts from. If omitted, the agent uses only the prompt |
seed_urls_only | no | When true, processes only the provided URLs without following links. Default false |
output_schema | no | JSON Schema describing the expected shape of the result |
model | no | LLM model to use. See available models |
Writing good prompts
The agent output is always JSON. This means prompts should describe a data structure, not a free-text answer.
Examples
✅ Good — specific, structured, tells the agent exactly what fields to return:
Find all customers listed on the website. Return a JSON object with a 'customers' field containing an array of company names.
✅ Good — uses output_schema to enforce shape instead of relying on the prompt alone:
Prompt:
"Find all pricing plans"+output_schemawithplans[].name,plans[].price_usd,plans[].features[]
✅ Good — instructs the agent on how to handle uncertainty:
Extract all blog post titles and their publication dates. If the date is not visible, set the 'date' field to null.
❌ Bad — vague, no structure, agent will guess a response shape:
Summarize the website
❌ Bad — asks for a plain-text yes/no instead of a JSON-safe value:
Is there a free plan? Put only 'yes' or 'no' in the response
✅ Good version of the above — uses a field with enumerated options:
Check if there is a free plan. Respond with a JSON object with a 'free_plan_available' field set to either 'yes' or 'no'.
❌ Bad — too broad a scope without a budget or focus:
Get everything from the website
The more specific the prompt, the better the result. Tell the agent what fields you expect, what to do when data is missing, and which pages are most relevant.
Controlling output shape with output_schema
Use output_schema when you need a guaranteed response structure. The agent will fill in your schema rather than
inventing its own shape.
Example — enforce a list of customers under a named field:
{
"output_schema": {
"type": "object",
"additionalProperties": false,
"properties": {
"jobs": {
"type": "array",
"items": {
"type": "object",
"additionalProperties": false,
"properties": {
"title": { "type": "string" },
"location": { "type": "string" },
"department": { "type": "string" }
},
"required": ["title", "location", "department"]
}
}
},
"required": ["jobs"]
}
}This guarantees the response data field contains { "jobs": [...] } instead of a free-form object.
Read more about Structured Outputs.
"jobs": [
{
"department": "AI Research & Engineering",
"location": "San Francisco, CA",
"title": "[Expression of Interest] Research Manager, Interpretability"
}
]Response fields
The data field is populated once status is done. All other fields are present immediately.
| Field | Description |
|---|---|
id | Unique identifier of the agent run (ar_...) |
status | queued → processing → done or failed |
prompt | Prompt from the request |
model | LLM model used |
urls | Seed URLs provided |
max_spend_usd | Spending cap set for this run |
balance_used_usd | Actual amount spent |
data | Extracted result — always JSON, present when status is done |
success | true when run completed with non-empty data |
error_reason | Human-readable Agent-generated error message if run failed |
trace | Agent reasoning trace (if available) |
llm_requests | List of individual LLM calls made during the run |
created_at | ISO 8601 creation timestamp |
updated_at | ISO 8601 last-updated timestamp |
Async flow
Agent runs are asynchronous. After submitting a run you receive an id immediately. Poll for results:
curl --request GET \
--url https://api.webcrawlerapi.com/v1/agent/job/<RUN_ID> \
--header 'Authorization: Bearer <YOUR_API_KEY>'Keep polling until status is done or failed.
API reference
- POST /v1/agent — start an agent run
- GET /v1/agent/job/{id} — get run status and results
- GET /v1/agent/jobs — list all agent runs