Clean crawled or scraped data with BeatuifulSoup in Python

Write clean-up function using BeautifulSoup in Python

After crawling or scraping the webpage, the data may need to be cleaned. In this article, we provide a solution and code for using BeautifulSoup to remove unneeded content. In other words it allows you to turn this:

<html>
    <head><title>Sample HTML</title></head>
    <body>
        <p>This is a <b>sample</b> HTML content.</p>
        <p>It has multiple lines and <i>tags</i>.</p>
    </body>
  <script>some javascript</script>
</html>

Into this:

Sample HTML
This is a sample HTML content.
It has multiple lines and tags.

Why clean data is important

Dirty data can lead to incorrect analysis, false insights, and wasted time. Here are some reasons to clean crawled content:

Remove irrelevant information: webpages usually contain HTML tags, styles, scripts, media, etc. You do not always need all this. Most likely, you need exactly the opposite: valuable data.
Reduce noise: without cleaning, data contains too much noise. This can reduce the accuracy of your data analysis or training AI models based on this data.
Reduce size: cleaning data also can significantly reduce the size required for storage.

Cleaning Data with BeautifulSoup

BeautifulSoup is a powerful and easy-to-use Python library for parsing and cleaning HTML and XML documents. It is particularly great for scraping and cleaning crawled data. Here’s an example of how to clean crawled data with BeautifulSoup:

First, install BeatifulSoup4:

pip install beautifulsoup4

Write clean-up funtion:

from bs4 import BeautifulSoup
import os

def clean_html():
    soup = BeautifulSoup(HTML_CONTENT, 'html.parser')
    clean_text = soup.get_text()
    clean_text = '\n'.join([line for line in clean_text.split('\n') if line.strip()])
    return clean_text

Add some test data:

HTML_CONTENT = """
    <html>
    <head><title>Sample HTML</title></head>
    <body>
        <p>This is a <b>sample</b> HTML content.</p>
        <p>It has multiple lines and <i>tags</i>.</p>
    </body>
    <script>some javascript</script>
    </html>
    """

And a runner to test:

def main():
    cleaned = clean_html()
    print(cleaned)

The final code looks like this:

from bs4 import BeautifulSoup


def clean_html():
    HTML_CONTENT = """
    <html>
    <head><title>Sample HTML</title></head>
    <body>
        <p>This is a <b>sample</b> HTML content.</p>
        <p>It has multiple lines and <i>tags</i>.</p>
    </body>
    <script>some javascript</script>
    </html>
    """

    soup = BeautifulSoup(HTML_CONTENT, 'html.parser')
    clean_text = soup.get_text()
    clean_text = '\n'.join([line for line in clean_text.split('\n') if line.strip()])
    return clean_text


def main():
    cleaned = clean_html()
    print(cleaned)


if __name__ == '__main__':
    main()

Now you can run it an see cleaned test data:

python clean.py

# Output:
# Sample HTML
# This is a sample HTML content.
# It has multiple lines and tags.

Use BeautifulSoup in Docker

If you don’t want to write code yourself or you are using other than Python language you can run docker with BeautifulSoup and make a request there.

Run docker:

docker pull n10ty/beautifulsoup-api
docker run -p5000:5000 n10ty/beautifulsoup-api

Make a request:

curl --request POST \
  --url http://localhost:5000/clean \
  --data '<html>
    <head><title>Sample HTML</title></head>
    <body>
        <p>This is a <b>sample</b> HTML content.</p>
        <p>It has multiple lines and <i>tags</i>.</p>
    </body>
    <script>some javascript</script>
    </html>'

# Output:
# Sample HTML
# This is a sample HTML content.
# It has multiple lines and tags.