Webcrawler API LogoWebCrawlerAPI
Guides

Structured Outputs with Prompts

Define JSON schemas to structure AI responses when using prompts for data extraction

Structured Outputs with Prompts

Structured Outputs ensure that AI-generated responses adhere to a JSON schema you define. This feature eliminates the need to validate or retry incorrectly formatted responses, making it perfect for extracting structured data from web pages.

Benefits

  • Reliable type-safety: No need to validate or retry incorrectly formatted responses
  • Consistent formatting: The AI output will always match your defined structure
  • Simpler implementation: Define your schema once and get predictable results every time

How It Works

When you provide a prompt, the /v2/scrape endpoint returns a JSON object in structured_data instead of markdown or HTML. Add an optional response_schema to enforce a strict JSON schema for the response. The schema follows the JSON Schema format used by OpenAI Structured Outputs.

Basic Example

Extract product information with a guaranteed structure:

curl --request POST \
  --url https://api.webcrawlerapi.com/v2/scrape \
  --header 'Authorization: Bearer YOUR_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/product/widget",
    "prompt": "Extract product details from this page",
    "response_schema": {
      "type": "object",
      "properties": {
        "product_name": {"type": "string"},
        "price": {"type": "number"},
        "in_stock": {"type": "boolean"},
        "description": {"type": "string"}
      },
      "required": ["product_name", "price", "in_stock"],
      "additionalProperties": false
    }
  }'

Response:

{
  "success": true,
  "status": "done",
  "page_status_code": 200,
  "page_title": "Premium Widget",
  "structured_data": {
    "product_name": "Premium Widget",
    "price": 29.99,
    "in_stock": true,
    "description": "A high-quality widget for all your needs"
  }
}

Schema Format

Your response_schema must be a valid JSON Schema object. OpenAI structured outputs are strict, so we recommend following these conventions to avoid schema validation errors:

  • type: Use "object" at the root level
  • properties: Define the structure of your data
  • required: Include required property names for predictable output
  • additionalProperties: Set to false to keep the output strict

Supported Types

  • string - Text data
  • number - Numeric values (integers or decimals)
  • boolean - True/false values
  • object - Nested objects
  • array - Lists of items
  • enum - Predefined set of values

Advanced Examples

Nested Objects

Extract business information with address details:

{
  "type": "object",
  "properties": {
    "business_name": {"type": "string"},
    "phone": {"type": "string"},
    "address": {
      "type": "object",
      "properties": {
        "street": {"type": "string"},
        "city": {"type": "string"},
        "state": {"type": "string"},
        "postal_code": {"type": "string"}
      },
      "required": ["street", "city"],
      "additionalProperties": false
    }
  },
  "required": ["business_name", "address"],
  "additionalProperties": false
}

Arrays of Objects

Extract multiple products from a listing page:

{
  "type": "object",
  "properties": {
    "products": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "price": {"type": "number"},
          "rating": {"type": "number"}
        },
        "required": ["name", "price"],
        "additionalProperties": false
      }
    }
  },
  "required": ["products"],
  "additionalProperties": false
}

Enum Constraints

Restrict values to predefined options:

{
  "type": "object",
  "properties": {
    "product_name": {"type": "string"},
    "category": {
      "type": "string",
      "enum": ["electronics", "clothing", "books", "home"]
    },
    "condition": {
      "type": "string",
      "enum": ["new", "used", "refurbished"]
    }
  },
  "required": ["product_name", "category", "condition"],
  "additionalProperties": false
}

Optional Fields

Use null union types for optional fields:

{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "email": {"type": ["string", "null"]},
    "phone": {"type": ["string", "null"]}
  },
  "required": ["name", "email", "phone"],
  "additionalProperties": false
}

Even though all fields are in the required array, email and phone can be null if the information isn't available.

Schema Constraints

To ensure performance and reliability, structured outputs have these limitations:

  • Maximum properties: 5,000 object properties total
  • Nesting depth: Maximum 10 levels of nested objects
  • Enum values: Maximum 1,000 enum values across all enum properties
  • String length: Total string length of all property names, enum values, and const values cannot exceed 120,000 characters

Error Handling

Invalid Schema

If your schema is invalid, you'll receive an error from the AI model:

{
  "success": false,
  "error_code": "invalid_schema",
  "error_message": "Invalid response schema format"
}

No Prompt Provided

The response_schema parameter only works when a prompt is also provided. If you include a schema without a prompt, it will be ignored.

Prompt Without a Schema

If you send a prompt without response_schema, the API still returns structured_data, but uses JSON-object mode instead of strict schema validation.

LLM Refusal

In rare cases, the AI may refuse to process content for safety reasons. You'll receive a refusal message explaining why.

Pricing

Structured outputs cost the same as regular prompts: $0.002 per request with prompt (in addition to the base crawling cost).

SDK Support

JavaScript/TypeScript

import WebcrawlerAPI from 'webcrawlerapi';

const client = new WebcrawlerAPI({ apiKey: 'YOUR_API_KEY' });

const response = await client.scrapeUrl({
  url: 'https://example.com/product',
  prompt: 'Extract product details',
  response_schema: {
    type: 'object',
    properties: {
      name: { type: 'string' },
      price: { type: 'number' },
      in_stock: { type: 'boolean' }
    },
    required: ['name', 'price', 'in_stock'],
    additionalProperties: false
  }
});

console.log(response.structured_data);

Python

from webcrawlerapi import WebcrawlerAPI

client = WebcrawlerAPI(api_key='YOUR_API_KEY')

response = client.scrape_url(
    url='https://example.com/product',
    prompt='Extract product details',
    response_schema={
        'type': 'object',
        'properties': {
            'name': {'type': 'string'},
            'price': {'type': 'number'},
            'in_stock': {'type': 'boolean'}
        },
        'required': ['name', 'price', 'in_stock'],
        'additionalProperties': False
    }
)

print(response['structured_data'])

Best Practices

  1. Clear property names: Use descriptive, self-documenting property names
  2. Specific prompts: Combine schemas with clear, specific prompts for best results
  3. Start simple: Begin with basic schemas and add complexity as needed
  4. Test iteratively: Test your schemas with sample pages to refine the structure
  5. Handle nulls: Use null unions for optional data that may not always be present