Clean and normalize HTML.
Project description
Clean and normalize HTML. Preserve embeddings (e.g. Twitter, Instagram, etc)
Quick start
Installation
Install the library with pip:
pip install clear-html
Usage
Example usage with lxml:
from lxml.html import fromstring
from clear_html import clean_node, cleaned_node_to_html
html="""
<div style="color:blue" id="main_content">
Some text to be
<div>cleaned up!</div>
</div>
"""
node = fromstring(html)
cleaned_node = clean_node(node)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)
Example usage with Parsel:
from parsel import Selector
from clear_html import clean_node, cleaned_node_to_html
selector = Selector(text="""<html>
<body>
<h1>Hello!</h1>
<div style="color:blue" id="main_content">
Some text to be
<div>cleaned up!</div>
</div>
</body>
</html>""")
selector = selector.css("#main_content")
cleaned_node = clean_node(selector[0].root)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)
Both of the different approaches above would print the following:
<article>
<p>Some text to be</p>
<p>cleaned up!</p>
</article>
Other interesting functions:
cleaned_node_to_text: convert the cleaned node to plain text
formatted_text.clean_doc: low level method to control more aspects of the cleaning up
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
clear-html-0.4.0.tar.gz
(23.9 kB
view details)
Built Distribution
File details
Details for the file clear-html-0.4.0.tar.gz
.
File metadata
- Download URL: clear-html-0.4.0.tar.gz
- Upload date:
- Size: 23.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3231ffcaf660a6417c743b4a7a349976eef97984f7c189b3993e8020333165cf |
|
MD5 | eed8bcec6c539aff7b3a0e1ed9ee2e09 |
|
BLAKE2b-256 | 4250c6ea475787c40fd0248cc8697aa8e44d63b685502d7ab944a87ba7e4259f |
File details
Details for the file clear_html-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: clear_html-0.4.0-py3-none-any.whl
- Upload date:
- Size: 24.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c89ece4d989e54e008457cefec3daeb4127f14e690bbe3ace31588a54fa57157 |
|
MD5 | c71c3b49ba301f0c72dba0e70942bf56 |
|
BLAKE2b-256 | 665e8d7a2983e7ad5d355c6778429860d5e8f886d9f0f1643802d1432a05b4d7 |