Webcrawler API LogoWebCrawler API
PricingDocsBlogSign inSign Up
Webcrawler API LogoWebCrawler API

Tools

  • Website to Markdown
  • llms.txt Generator

Resources

  • Blog
  • Docs
  • Changelog

Follow us

  • Github
  • X (Twitter)
  • Postman

Legal

  • Privacy Policy
  • Terms & Conditions
  • Refund Policy

Made in Netherlands 🇳🇱
2023-2025   ©103Labs
    10 min read to read

    Clean crawled or scraped data with BeatuifulSoup in Python

    After crawling or scraping the webpage, the data may need to be cleaned. In this article, we provide a solution and code for using BeautifulSoup to remove unneeded content.

    Written byAndrew
    Published onMay 27, 2024

    Table of Contents

    • Write clean-up function using BeautifulSoup in Python
    • Use BeautifulSoup in Docker

    Table of Contents

    • Write clean-up function using BeautifulSoup in Python
    • Use BeautifulSoup in Docker

    Write clean-up function using BeautifulSoup in Python

    After crawling or scraping the webpage, the data may need to be cleaned. In this article, we provide a solution and code for using BeautifulSoup to remove unneeded content. In other words it allows you to turn this:

    <html>
        <head><title>Sample HTML</title></head>
        <body>
            <p>This is a <b>sample</b> HTML content.</p>
            <p>It has multiple lines and <i>tags</i>.</p>
        </body>
    	<script>some javascript</script>
    </html>
    

    Into this:

    Sample HTML
    This is a sample HTML content.
    It has multiple lines and tags.
    

    Why clean data is important

    Dirty data can lead to incorrect analysis, false insights, and wasted time. Here are some reasons to clean crawled content:

    1. Remove irrelevant information: webpages usually contain HTML tags, styles, scripts, media, etc. You do not always need all this. Most likely, you need exactly the opposite: valuable data.
    2. Reduce noise: without cleaning, data contains too much noise. This can reduce the accuracy of your data analysis or training AI models based on this data.
    3. Reduce size: cleaning data also can significantly reduce the size required for storage.

    Cleaning Data with BeautifulSoup

    BeautifulSoup is a powerful and easy-to-use Python library for parsing and cleaning HTML and XML documents. It is particularly great for scraping and cleaning crawled data. Here's an example of how to clean crawled data with BeautifulSoup:

    First, install BeatifulSoup4:

    pip install beautifulsoup4
    

    Write clean-up funtion:

    from bs4 import BeautifulSoup
    import os
    
    def clean_html():
        soup = BeautifulSoup(HTML_CONTENT, 'html.parser')
        clean_text = soup.get_text()
        clean_text = '
    '.join([line for line in clean_text.split('
    ') if line.strip()])
        return clean_text
    
    

    Add some test data:

    HTML_CONTENT = """
        <html>
        <head><title>Sample HTML</title></head>
        <body>
            <p>This is a <b>sample</b> HTML content.</p>
            <p>It has multiple lines and <i>tags</i>.</p>
        </body>
        <script>some javascript</script>
        </html>
        """
    

    And a runner to test:

    def main():
        cleaned = clean_html()
        print(cleaned)
    

    The final code looks like this:

    #clean.py
    from bs4 import BeautifulSoup
    
    
    def clean_html():
        HTML_CONTENT = """
        <html>
        <head><title>Sample HTML</title></head>
        <body>
            <p>This is a <b>sample</b> HTML content.</p>
            <p>It has multiple lines and <i>tags</i>.</p>
        </body>
        <script>some javascript</script>
        </html>
        """
    
        soup = BeautifulSoup(HTML_CONTENT, 'html.parser')
        clean_text = soup.get_text()
        clean_text = '
    '.join([line for line in clean_text.split('
    ') if line.strip()])
        return clean_text
    
    
    def main():
        cleaned = clean_html()
        print(cleaned)
    
    
    if __name__ == '__main__':
        main()
    

    Now you can run it an see cleaned test data:

    python clean.py
    
    # Output:
    # Sample HTML
    # This is a sample HTML content.
    # It has multiple lines and tags.
    

    Use BeautifulSoup in Docker

    If you don't want to write code yourself or you are using other than Python language you can run docker with BeautifulSoup and make a request there.

    Run docker:

    docker pull n10ty/beautifulsoup-api
    docker run -p5000:5000 n10ty/beautifulsoup-api
    

    Make a request:

    curl --request POST 
      --url http://localhost:5000/clean 
      --data '<html>
        <head><title>Sample HTML</title></head>
        <body>
            <p>This is a <b>sample</b> HTML content.</p>
            <p>It has multiple lines and <i>tags</i>.</p>
        </body>
        <script>some javascript</script>
        </html>'
    
    # Output:
    # Sample HTML
    # This is a sample HTML content.
    # It has multiple lines and tags.