Write clean-up function using BeautifulSoup in Python
After crawling or scraping the webpage, the data may need to be cleaned. In this article, we provide a solution and code for using BeautifulSoup to remove unneeded content. In other words it allows you to turn this:
<html>
<head><title>Sample HTML</title></head>
<body>
<p>This is a <b>sample</b> HTML content.</p>
<p>It has multiple lines and <i>tags</i>.</p>
</body>
<script>some javascript</script>
</html>
Into this:
Sample HTML
This is a sample HTML content.
It has multiple lines and tags.
Why clean data is important
Dirty data can lead to incorrect analysis, false insights, and wasted time. Here are some reasons to clean crawled content:
- Remove irrelevant information: webpages usually contain HTML tags, styles, scripts, media, etc. You do not always need all this. Most likely, you need exactly the opposite: valuable data.
- Reduce noise: without cleaning, data contains too much noise. This can reduce the accuracy of your data analysis or training AI models based on this data.
- Reduce size: cleaning data also can significantly reduce the size required for storage.
Cleaning Data with BeautifulSoup
BeautifulSoup is a powerful and easy-to-use Python library for parsing and cleaning HTML and XML documents. It is particularly great for scraping and cleaning crawled data. Here's an example of how to clean crawled data with BeautifulSoup:
First, install BeatifulSoup4:
pip install beautifulsoup4
Write clean-up funtion:
from bs4 import BeautifulSoup
import os
def clean_html():
soup = BeautifulSoup(HTML_CONTENT, 'html.parser')
clean_text = soup.get_text()
clean_text = '\n'.join([line for line in clean_text.split('\n') if line.strip()])
return clean_text
Add some test data:
HTML_CONTENT = """
<html>
<head><title>Sample HTML</title></head>
<body>
<p>This is a <b>sample</b> HTML content.</p>
<p>It has multiple lines and <i>tags</i>.</p>
</body>
<script>some javascript</script>
</html>
"""
And a runner to test:
def main():
cleaned = clean_html()
print(cleaned)
The final code looks like this:
#clean.py
from bs4 import BeautifulSoup
def clean_html():
HTML_CONTENT = """
<html>
<head><title>Sample HTML</title></head>
<body>
<p>This is a <b>sample</b> HTML content.</p>
<p>It has multiple lines and <i>tags</i>.</p>
</body>
<script>some javascript</script>
</html>
"""
soup = BeautifulSoup(HTML_CONTENT, 'html.parser')
clean_text = soup.get_text()
clean_text = '\n'.join([line for line in clean_text.split('\n') if line.strip()])
return clean_text
def main():
cleaned = clean_html()
print(cleaned)
if __name__ == '__main__':
main()
Now you can run it an see cleaned test data:
python clean.py
# Output:
# Sample HTML
# This is a sample HTML content.
# It has multiple lines and tags.
Use BeautifulSoup in Docker
If you don't want to write code yourself or you are using other than Python language you can run docker with BeautifulSoup and make a request there.
Run docker:
docker pull n10ty/beautifulsoup-api
docker run -p5000:5000 n10ty/beautifulsoup-api
Make a request:
curl --request POST \
--url http://localhost:5000/clean \
--data '<html>
<head><title>Sample HTML</title></head>
<body>
<p>This is a <b>sample</b> HTML content.</p>
<p>It has multiple lines and <i>tags</i>.</p>
</body>
<script>some javascript</script>
</html>'
# Output:
# Sample HTML
# This is a sample HTML content.
# It has multiple lines and tags.