inscriptis - HTML to text converter.
Project description
inscriptis
A python based HTML to text conversion library, command line client and Web service with support for nested tables and a subset of CSS. Please take a look at the Rendering document for a demonstration of inscriptis' conversion quality.
Table of Contents
- Requirements and installation
- Command line client
- Python library
- Web service
- Fine tuning
- Testing, benchmarking and evaluation
- Changelog
Requirements and installation
Requirements
- Python 3.5+ (preferred) or Python 2.7+
- lxml
- requests
Installation
sudo python3 setup.py install
Command line client
The command line client converts text files or text retrieved from Web pages to the corresponding text representation.
Command line parameters
usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-l] [-d] input
Converts HTML from file or url to a clean text version
positional arguments:
input Html input either from a file or an url (default:stdin)
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file (default:stdout).
-e ENCODING, --encoding ENCODING
Content encoding for files (default:utf-8)
-i, --display-image-captions
Display image captions (default:false).
-l, --display-link-targets
Display link targets (default:false).
-d, --deduplicate-image-captions
Deduplicate image captions (default:false).
--indentation
How to handle indentation (extended or standard; default: extended)
Examples
# convert the given page to text and output the result to the screen
inscript.py https://www.fhgr.ch
# convert the file to text and save the output to output.txt
inscript.py fhgr.html -o fhgr.txt
# convert the text provided via stdin and save the output to output.txt
echo '<body><p>Make it so!</p>></body>' | inscript.py -o output.txt
Python library
Embedding inscriptis into your code is easy, as outlined below:
import urllib.request
from inscriptis import get_text
url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
print(text)
Flask Web Service
The Flask Web Service translates HTML pages to the corresponding plain text.
Additional Requirements
- python3-flask
Startup
export FLASK_APP="web-service.py"
python3 -m flask run
Usage
The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified
in the Content-Type
header (UTF-8
in the example below).
curl -X POST -H "Content-Type: text/html; encoding=UTF8" -d @test.html http://localhost:5000/get_text
Fine tuning
The following options are available for fine tuning the way inscriptis translates HTML to text.
-
More rigorous indentation: call
inscriptis.get_text()
with the parameterindentation='extended'
to also use indentation for tags such as<div>
and<span>
that do not provide indentation in their standard definition. This strategy is the default ininscript.py
and many other tools such as lynx. If you do not want extended indentation you can use the parameterindentation='standard'
instead. -
Overwriting the default CSS definition: inscriptis uses CSS definitions that are maintained in
inscriptis.css.CSS
for rendering HTML tags. You can override these definitions (and therefore change the rendering) as outlined below:from lxml.html import fromstring from inscriptis.css import DEFAULT_CSS, HtmlElement from inscriptis.html_properties import Display # create a custom CSS based on the default style sheet and change the rendering of `div` and `span` elements css = DEFAULT_CSS.copy() css['div'] = HtmlElement('div', display=Display.block, padding=2) css['span'] = HtmlElement('span', prefix=' ', suffix=' ') html_tree = fromstring(html) # create a parser using the custom css parser = Inscriptis(html_tree, display_images=display_images, deduplicate_captions=deduplicate_captions, display_links=display_links, css=css) text = parser.get_text()
Testing, benchmarking and evaluation
Unit tests
Test cases concerning the html to text conversion are located in the tests/html
directory and consist of two files:
test-name.html
andtest-name.txt
the latter one containing the reference text output for the given html file.
Text conversion output comparison and speed benchmarking
inscriptis offers a small benchmarking script that can compare different HTML to text conversion approaches.
The script will run the different approaches on a list of URLs, url_list.txt
, and save the text output into a time stamped folder in benchmarking/benchmarking_results
for manual comparison.
Additionally the processing speed of every approach per URL is measured and saved in a text file called speed_comparisons.txt
in the respective time stamped folder.
To run the benchmarking script execute run_benchmarking.py
from within the folder benchmarking
.
In def pipeline()
set the which HTML -> Text algorithms to be executed by modifying
run_lynx = True
run_justext = True
run_html2text = True
run_beautifulsoup = True
run_inscriptis = True
In url_list.txt
the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
e.g.
http://www.informationscience.ch
https://en.wikipedia.org/wiki/Information_science
...
Changelog
see Release notes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file inscriptis-0.0.4.1.tar.gz
.
File metadata
- Download URL: inscriptis-0.0.4.1.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e88d5c74506eecec73b103bebdab1d191c8d51b2657dbf7f86396c37c416a76 |
|
MD5 | ffcfc7a0ca909e111e3c493b36d41322 |
|
BLAKE2b-256 | e4ee45c966a803cfc310a4a4be30615aea92930bc71748ca8c5e78e005b9faf4 |
File details
Details for the file inscriptis-0.0.4.1-py3-none-any.whl
.
File metadata
- Download URL: inscriptis-0.0.4.1-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94f311ea1f82cefe18a221fb39ef42c46c899770f966a63fe4bcca87ef9766f2 |
|
MD5 | d16436dcf015f46fb2c12626b500ef98 |
|
BLAKE2b-256 | a5f05caccb0f2ba77d0c1f88205d1220c7dc432089a47dd45160dbe92dc6d474 |