docs
SDKs and Code Examples
🦜🔗 LangChain

🦜🔗WebcrawlerAPI LangChain Integration

The WebCrawlerAPI LangChain integration allows you to seamlessly convert websites and webpages into markdown or cleaned content format, making it perfect for LLM data processing pipelines. This integration requires no subscription and provides a straightforward way to incorporate web crawling capabilities into your LangChain document processing workflow.

Installation

First, obtain your API key from WebCrawlerAPI, then install the package using pip:

pip install webcrawlerapi-langchain

Usage

Basic Loading

The simplest way to use the WebCrawlerAPI loader is through the basic loading method:

from webcrawlerapi_langchain import WebCrawlerAPILoader
 
# Initialize the loader
loader = WebCrawlerAPILoader(
    url="https://example.com",
    api_key="your-api-key",
    scrape_type="markdown",
    items_limit=10
)
 
# Load documents
documents = loader.load()
 
# Use documents in your LangChain pipeline
for doc in documents:
    print(doc.page_content[:100])
    print(doc.metadata)

Advanced Loading Methods

The SDK supports multiple loading patterns to suit different use cases:

Async Loading

For asynchronous operations:

# Async loading
documents = await loader.aload()

Lazy Loading

When dealing with large datasets:

# Lazy loading
for doc in loader.lazy_load():
    print(doc.page_content[:100])

Async Lazy Loading

Combining asynchronous and lazy loading:

# Async lazy loading
async for doc in loader.alazy_load():
    print(doc.page_content[:100])

Configuration Options

The WebCrawlerAPILoader accepts the following configuration parameters:

ParameterTypeDescription
urlstringThe target URL to crawl
api_keystringYour WebCrawlerAPI API key
scrape_typestringType of scraping (options: "html", "cleaned", "markdown")
items_limitintegerMaximum number of pages to crawl
whitelist_regexpstringRegex pattern for URL whitelist
blacklist_regexpstringRegex pattern for URL blacklist

Best Practices

  1. Rate Limiting: Be mindful of API rate limits when crawling multiple pages.
  2. Error Handling: Always implement proper error handling for network issues and API responses.
  3. Content Type: Choose the appropriate scrape_type based on your LLM's requirements:
    • Use "markdown" for structured content
    • Use "cleaned" for plain text
    • Use "html" for raw HTML content

Example Use Cases

Document QA System

from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from webcrawlerapi_langchain import WebCrawlerAPILoader
 
# Load documents from a website
loader = WebCrawlerAPILoader(
    url="https://docs.example.com",
    api_key="your-api-key",
    scrape_type="markdown"
)
documents = loader.load()
 
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
 
# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

Error Handling

The loader implements robust error handling for common scenarios:

try:
    loader = WebCrawlerAPILoader(
        url="https://example.com",
        api_key="your-api-key"
    )
    documents = loader.load()
except WebCrawlerAPIError as e:
    print(f"API Error: {e}")
except ValidationError as e:
    print(f"Configuration Error: {e}")
except Exception as e:
    print(f"Unexpected Error: {e}")

Support

For additional support or to report issues, please visit the WebCrawlerAPI documentation (opens in a new tab) or the GitHub repository (opens in a new tab).