Skip to main content

Clean and normalize HTML.

Project description

PyPI Version Supported Python Versions Build Status Coverage report

Clean and normalize HTML. Preserve embeddings (e.g. Twitter, Instagram, etc)

Quick start

Installation

Install the library with pip:

pip install clear-html

Usage

Example usage with lxml:

from lxml.html import fromstring
from clear_html import clean_node, cleaned_node_to_html

html="""
        <div style="color:blue" id="main_content">
            Some text to be
            <div>cleaned up!</div>
        </div>
     """
node = fromstring(html)
cleaned_node = clean_node(node)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)

Example usage with Parsel:

from parsel import Selector
from clear_html import clean_node, cleaned_node_to_html

selector = Selector(text="""<html>
                            <body>
                                <h1>Hello!</h1>
                                <div style="color:blue" id="main_content">
                                    Some text to be
                                    <div>cleaned up!</div>
                                </div>
                            </body>
                            </html>""")
selector = selector.css("#main_content")
cleaned_node = clean_node(selector[0].root)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)

Both of the different approaches above would print the following:

<article>

<p>Some text to be</p>

<p>cleaned up!</p>

</article>

Other interesting functions:

  • cleaned_node_to_text: convert the cleaned node to plain text

  • formatted_text.clean_doc: low level method to control more aspects of the cleaning up

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clear-html-0.4.0.tar.gz (23.9 kB view details)

Uploaded Source

Built Distribution

clear_html-0.4.0-py3-none-any.whl (24.8 kB view details)

Uploaded Python 3

File details

Details for the file clear-html-0.4.0.tar.gz.

File metadata

  • Download URL: clear-html-0.4.0.tar.gz
  • Upload date:
  • Size: 23.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.5

File hashes

Hashes for clear-html-0.4.0.tar.gz
Algorithm Hash digest
SHA256 3231ffcaf660a6417c743b4a7a349976eef97984f7c189b3993e8020333165cf
MD5 eed8bcec6c539aff7b3a0e1ed9ee2e09
BLAKE2b-256 4250c6ea475787c40fd0248cc8697aa8e44d63b685502d7ab944a87ba7e4259f

See more details on using hashes here.

File details

Details for the file clear_html-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: clear_html-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 24.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.5

File hashes

Hashes for clear_html-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c89ece4d989e54e008457cefec3daeb4127f14e690bbe3ace31588a54fa57157
MD5 c71c3b49ba301f0c72dba0e70942bf56
BLAKE2b-256 665e8d7a2983e7ad5d355c6778429860d5e8f886d9f0f1643802d1432a05b4d7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page