🦜🔗WebcrawlerAPI LangChain Integration
The WebCrawlerAPI LangChain integration allows you to seamlessly convert websites and webpages into markdown or cleaned content format, making it perfect for LLM data processing pipelines. This integration requires no subscription and provides a straightforward way to incorporate web crawling capabilities into your LangChain document processing workflow.
Installation
First, obtain your API key from WebCrawlerAPI, then install the package using pip:
pip install webcrawlerapi-langchain
Usage
Basic Loading
The simplest way to use the WebCrawlerAPI loader is through the basic loading method:
from webcrawlerapi_langchain import WebCrawlerAPILoader
# Initialize the loader
loader = WebCrawlerAPILoader(
url="https://example.com",
api_key="your-api-key",
scrape_type="markdown",
items_limit=10
)
# Load documents
documents = loader.load()
# Use documents in your LangChain pipeline
for doc in documents:
print(doc.page_content[:100])
print(doc.metadata)
Advanced Loading Methods
The SDK supports multiple loading patterns to suit different use cases:
Async Loading
For asynchronous operations:
# Async loading
documents = await loader.aload()
Lazy Loading
When dealing with large datasets:
# Lazy loading
for doc in loader.lazy_load():
print(doc.page_content[:100])
Async Lazy Loading
Combining asynchronous and lazy loading:
# Async lazy loading
async for doc in loader.alazy_load():
print(doc.page_content[:100])
Configuration Options
The WebCrawlerAPILoader accepts the following configuration parameters:
Parameter | Type | Description |
---|---|---|
url | string | The target URL to crawl |
api_key | string | Your WebCrawlerAPI API key |
scrape_type | string | Type of scraping (options: "html", "cleaned", "markdown") |
items_limit | integer | Maximum number of pages to crawl |
whitelist_regexp | string | Regex pattern for URL whitelist |
blacklist_regexp | string | Regex pattern for URL blacklist |
Best Practices
- Rate Limiting: Be mindful of API rate limits when crawling multiple pages.
- Error Handling: Always implement proper error handling for network issues and API responses.
- Content Type: Choose the appropriate
scrape_type
based on your LLM's requirements:- Use "markdown" for structured content
- Use "cleaned" for plain text
- Use "html" for raw HTML content
Example Use Cases
Document QA System
from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from webcrawlerapi_langchain import WebCrawlerAPILoader
# Load documents from a website
loader = WebCrawlerAPILoader(
url="https://docs.example.com",
api_key="your-api-key",
scrape_type="markdown"
)
documents = loader.load()
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
Error Handling
The loader implements robust error handling for common scenarios:
try:
loader = WebCrawlerAPILoader(
url="https://example.com",
api_key="your-api-key"
)
documents = loader.load()
except WebCrawlerAPIError as e:
print(f"API Error: {e}")
except ValidationError as e:
print(f"Configuration Error: {e}")
except Exception as e:
print(f"Unexpected Error: {e}")
Support
For additional support or to report issues, please visit the WebCrawlerAPI documentation (opens in a new tab) or the GitHub repository (opens in a new tab).