Clean and normalize HTML.
Project description
Clean and normalize HTML. Preserve embeddings (e.g. Twitter, Instagram, etc)
Quick start
Installation
Install the library with pip:
pip install clear-html
Usage
Example usage with lxml:
from lxml.html import fromstring
from clear_html import clean_node, cleaned_node_to_html
html="""
<div style="color:blue" id="main_content">
Some text to be
<div>cleaned up!</div>
</div>
"""
node = fromstring(html)
cleaned_node = clean_node(node)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)
Example usage with Parsel:
from parsel import Selector
from clear_html import clean_node, cleaned_node_to_html
selector = Selector(text="""<html>
<body>
<h1>Hello!</h1>
<div style="color:blue" id="main_content">
Some text to be
<div>cleaned up!</div>
</div>
</body>
</html>""")
selector = selector.css("#main_content")
cleaned_node = clean_node(selector[0].root)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)
Both of the different approaches above would print the following:
<article>
<p>Some text to be</p>
<p>cleaned up!</p>
</article>
Other interesting functions:
cleaned_node_to_text: convert the cleaned node to plain text
formatted_text.clean_doc: low level method to control more aspects of the cleaning up
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
clear_html-0.4.1.tar.gz
(23.9 kB
view details)
Built Distribution
File details
Details for the file clear_html-0.4.1.tar.gz
.
File metadata
- Download URL: clear_html-0.4.1.tar.gz
- Upload date:
- Size: 23.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 711957bb03b0729caa257679e15881f9e0eeea27236b5c18eac1e75b8af06b06 |
|
MD5 | f9bcf9d2d62dc0724fab546af717b67d |
|
BLAKE2b-256 | 7a28d08437394b1b28e46fd804a99b3ba2e6dc3a1103ac14b097f04ea442bb26 |
File details
Details for the file clear_html-0.4.1-py3-none-any.whl
.
File metadata
- Download URL: clear_html-0.4.1-py3-none-any.whl
- Upload date:
- Size: 24.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a270ed4d78bda7f8d9e308c7c4fa5ebe2bdcf39280730a448064ad677a0a76cf |
|
MD5 | 55f9c42f64099028b74c08742ed731da |
|
BLAKE2b-256 | d11c349aa7cf8ac99c27a9afd1b27f4c1e5a9a913ae0b6f3fdc988e60b56116c |