Skip to main content

A small library to smartly extract text from html and eventually rebuild html

Project description

pypi travis coveralls

Boned html is a small python library.

It helps you extract text from an html (in the form of a lxml tree), process this text, to classify it, reinject text in the html with specific css classes.

The typical use is for anotating an html with classes. For example you are categorizing text, and you want the user to visualize those categories on the original html.

The text will be extracted in a smart way: it won’t stop at semantic tags (<i>, <em>, etc.) but at other tags (<h1>, <p>, etc.).

As you reinject the text, semantic tags will be added back to text, and general html layout will be respected.

Installation

pip install boned-html

Usage

The fonctionalities are provided by the class boned_html.Chunker with methods:

  • chunk_tree to get text chunks from an lxml tree.

  • unchunk to put back chunks together providing css classes for pieces of text.

A quick example: imagine we have a function to detect a tel number value in a sentence:

>>> import re
>>> from itertools import cycle
>>> def get_tel(text):
...    splits = re.split(r"(\+?(?:\d\s*){8,13})", text)
...    return list(zip(splits, cycle([None, "tel"])))
>>> get_tel("call +33 00 00 00 00")
[('call ', None), ('+33 00 00 00 00', 'tel'), ('', None)]

And an html:

>>> html = '''
... <html>
...   <head><title>call +33 00 00 00 00</title></head>
...   <body>
...     <p>To get an operator <em>call</em></p>
...     <p><b>call</b> <em>(country) +33</em> 00 00 00 00</p>
...   </body>
... </html>
... '''

We chunk:

>>> import lxml.html
>>> from boned_html import HtmlBoner
>>> tree = lxml.html.fromstring(html)
>>> boned = HtmlBoner(tree)

We evaluate each text and assign “tel” class to it if there is a telephone:

>>> for i, text in enumerate(boned):
...     if text is not None:
...         boned.set_classes(i, get_tel(text))

We now rebuild the tree:

>>> boned.tree
<Element html ...>
>>> print(boned)
<html>
  <head><title>call +33 00 00 00 00</title></head>
  <body>
    <p>To get an operator <em>call</em></p>
    <p><b>call</b> <em>(country) </em><span class="tel" id="chunk-6-1"><em>+33</em> 00 00 00 00</span></p>
  </body>
</html>

We have a specific span around our number, also opening and closure of em tag was handled, and phone number in head/title remains the same.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boned-html-0.2.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

boned_html-0.2-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file boned-html-0.2.tar.gz.

File metadata

  • Download URL: boned-html-0.2.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for boned-html-0.2.tar.gz
Algorithm Hash digest
SHA256 69b1e42a3ef14217b531c47635b355cdeddd716684af1da2d284d4649cf4caff
MD5 099be9ed384bb245b3f9f24f315ad472
BLAKE2b-256 9ae2d54fb05a126bbb6047a7062206aa3fa773b204ce3e982e29afbf46cd8d7d

See more details on using hashes here.

File details

Details for the file boned_html-0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for boned_html-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e75ba1ebfd018660f9ac9dc12f0fe8e3a8ce84b840add48c92cb98748c82de9b
MD5 9545412bc910a236fe4d41c13154e809
BLAKE2b-256 5e543252b547ac81f7933be843b62432e7a8a6bd83ed05652d98860a97cee53b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page