S3 Upload
How to upload crawled data directly to Amazon S3 or compatible storage
The S3 Upload action allows you to automatically upload the crawled data to your Amazon S3 bucket or any S3-compatible storage service. This is particularly useful for integrating the crawl results directly into your data pipeline without requiring an additional step to download and then upload the data.
Usage
Security Warning: We temporarily store your S3 credentials while the job is processing. All credentials are automatically removed immediately after the job completes.
To use the S3 upload action, include an actions array in your request with an action of type upload_s3. This action requires several parameters to authenticate and specify the destination in your S3 bucket.
Required Parameters
| Parameter | Type | Description |
|---|---|---|
type | string | Must be set to upload_s3 |
path | string | The file path/key where the data will be stored in your bucket |
access_key_id | string | Your S3 access key ID |
secret_access_key | string | Your S3 secret access key |
bucket | string | The name of your S3 bucket |
endpoint | string | The S3 endpoint URL (especially needed for S3-compatible services) |
If you haven't created the bucket in the us-east-1 AWS region, please, specify your bucket region through an endpoint in a format like https://s3.{your-region}.amazonaws.com.
Example Request
curl -i --request POST \
--url https://api.webcrawlerapi.com/v1/crawl \
--header 'Authorization: Bearer YOUR_API_KEY' \
--data '{
"url": "https://books.toscrape.com/",
"scrape_type": "markdown",
"items_limit": 20,
"actions": [
{
"type": "upload_s3",
"path": "/testupload",
"access_key_id": "<ACCESS_KEY>",
"secret_access_key": "<SECRET_KEY>",
"bucket": "mybucket",
"endpoint": "https://s3.eu-west-1.amazonaws.com"
}
]
}'const s3Upload = {
"type": "upload_s3",
"path": "/testupload",
"access_key_id": "ACCESS_KEY",
"secret_access_key": "<SECRET_KEY>",
"bucket": "mybucket",
"endpoint": "https://s3.eu-west-1.amazonaws.com"
};
try {
// async way - promise will be resolved with all the data
const syncJob = await client.crawl({
"url": "https://books.toscrape.com/",
"scrape_type": "markdown",
"items_limit": 20,
"allow_subdomains": false,
}, s3Upload);
console.log(`Job ID: ${syncJob.id}`);
} catch (error) {
console.error("Error uploading to S3:", error);
}s3_action = UploadS3Action(
path="/testupload",
access_key_id="<ACCESS_KEY>",
secret_access_key="<SECRET_KEY>",
bucket="mybucket",
endpoint="https://s3.eu-west-1.amazonaws.com"
)
# Start a synchronous crawling job (blocks until completion)
print("Starting crawling job...")
job = crawler.crawl(
url="https://books.toscrape.com/",
scrape_type="markdown",
items_limit=20,
allow_subdomains=True,
actions=s3_action, # Add the S3 upload action
max_polls=100 # Maximum number of status checks
)
print(f"Job completed with ID: {job.id}")Response
When the S3 upload action is successfully executed, the response will include information about the upload:
{
"id": "5f7b1b7b-7b7b-4b7b-8b7b-7b7b7b7b7b7b",
"actions": [
{
"type": "upload_s3",
"status": "success",
"path": "/testupload"
}
]
}Compatible Storage Services
This action works with:
- Amazon S3
- Cloudflare R2
- DigitalOcean Spaces
- Backblaze B2
- Any other S3-compatible storage service
Error Handling
If there's an error with the S3 upload, the action's status will be set to error with a message explaining the issue:
{
"error_code": "invalid_request",
"error_message": "invalid S3 credentials: operation error S3: PutObject, https response error StatusCode: 403, api error InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records."
}If you upload files to a private, non-accessible bucket, subsequent attempts to retrieve the content using the file URL might fail. Ensure that you have proper permissions set up for accessing the uploaded files if you need to retrieve them later.