LangChain Integration

Seamlessly convert websites and webpages into markdown format for LLM data processing pipelines

The WebCrawlerAPI LangChain integration allows you to seamlessly convert websites and webpages into markdown or cleaned content format, making it perfect for LLM data processing pipelines. This integration requires no subscription and provides a straightforward way to incorporate web crawling capabilities into your LangChain document processing workflow.

Installation

First, obtain your API key from WebCrawlerAPI, then install the package using pip:

pip install webcrawlerapi-langchain

Usage

Basic Loading

The simplest way to use the WebCrawlerAPI loader is through the basic loading method:

from webcrawlerapi_langchain import WebCrawlerAPILoader

# Initialize the loader
loader = WebCrawlerAPILoader(
    url="https://example.com",
    api_key="your-api-key",
    scrape_type="markdown",
    items_limit=10
)

# Load documents
documents = loader.load()

# Use documents in your LangChain pipeline
for doc in documents:
    print(doc.page_content[:100])
    print(doc.metadata)

Advanced Loading Methods

The SDK supports multiple loading patterns to suit different use cases:

Async Loading

For asynchronous operations:

# Async loading
documents = await loader.aload()

Lazy Loading

When dealing with large datasets:

# Lazy loading
for doc in loader.lazy_load():
    print(doc.page_content[:100])

Async Lazy Loading

Combining asynchronous and lazy loading:

# Async lazy loading
async for doc in loader.alazy_load():
    print(doc.page_content[:100])

Configuration Options

The WebCrawlerAPILoader accepts the following configuration parameters:

Parameter	Type	Description
`url`	string	The target URL to crawl
`api_key`	string	Your WebCrawlerAPI API key
`scrape_type`	string	Type of scraping (options: "html", "cleaned", "markdown")
`items_limit`	integer	Maximum number of pages to crawl
`whitelist_regexp`	string	Regex pattern for URL whitelist
`blacklist_regexp`	string	Regex pattern for URL blacklist

Best Practices

Rate Limiting: Be mindful of API rate limits when crawling multiple pages.
Error Handling: Always implement proper error handling for network issues and API responses.
Content Type: Choose the appropriate scrape_type based on your LLM's requirements:
- Use "markdown" for structured content
- Use "cleaned" for plain text
- Use "html" for raw HTML content

Example Use Cases

Document QA System

You can find working code example in the GitHub Book Information Extractor with LangChain and WebcrawlerAPI

from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from webcrawlerapi_langchain import WebCrawlerAPILoader

# Load documents from a website
loader = WebCrawlerAPILoader(
    url="https://docs.example.com",
    api_key="your-api-key",
    scrape_type="markdown"
)
documents = loader.load()

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)

# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

Error Handling

The loader implements robust error handling for common scenarios:

try:
    loader = WebCrawlerAPILoader(
        url="https://example.com",
        api_key="your-api-key"
    )
    documents = loader.load()
except WebCrawlerAPIError as e:
    print(f"API Error: {e}")
except ValidationError as e:
    print(f"Configuration Error: {e}")
except Exception as e:
    print(f"Unexpected Error: {e}")

Support

For additional support or to report issues, please visit the WebCrawlerAPI documentation or the GitHub repository.

LangChain Integration

On this page