inscriptis - HTML to text converter.
Project description
A python based HTML to text conversion library, command line client and Web service with support for nested tables, a subset of CSS and optional support for providing an annotated output. Please take a look at the Rendering document for a demonstration of inscriptis’ conversion quality.
A Java port of inscriptis 1.x is available here.
This document provides a short introduction to Inscriptis. The full documentation is built automatically and published on Read the Docs.
Installation
At the command line:
$ pip install inscriptis
Or, if you don’t have pip installed:
$ easy_install inscriptis
If you want to install from the latest sources, you can do:
$ git clone https://github.com/weblyzard/inscriptis.git $ cd inscriptis $ python setup.py install
Python library
Embedding inscriptis into your code is easy, as outlined below:
import urllib.request
from inscriptis import get_text
url = "https://www.fhgr.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
print(text)
Standalone command line client
The command line client converts HTML files or text retrieved from Web pages to the corresponding text representation.
Command line parameters
The inscript.py command line client supports the following parameters:
usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR] [--indentation INDENTATION] [-v] [input] Convert the given HTML document to text. positional arguments: input Html input either from a file or a URL (default:stdin). optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT Output file (default:stdout). -e ENCODING, --encoding ENCODING Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs). -i, --display-image-captions Display image captions (default:false). -d, --deduplicate-image-captions Deduplicate image captions (default:false). -l, --display-link-targets Display link targets (default:false). -a, --display-anchor-urls Deduplicate image captions (default:false). -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES Path to an optional JSON file containing rules for annotating the retrieved text. -p POSTPROCESSOR, --postprocessor POSTPROCESSOR Optional component for postprocessing the result (html, surface, xml). --indentation INDENTATION How to handle indentation (extended or strict; default: extended). -v, --version display version information
Examples
HTML to text conversion
convert the given page to text and output the result to the screen:
$ inscript.py https://www.fhgr.ch
convert the file to text and save the output to output.txt:
$ inscript.py fhgr.html -o fhgr.txt
convert HTML provided via stdin and save the output to output.txt:
$ echo '<body><p>Make it so!</p>></body>' | inscript.py -o output.txt
HTML to annotated text conversion
convert and annotate HTML from a Web page using the provided annotation rules:
$ inscript.py https://www.fhgr.ch -r ./examples/annotation-profile.json
The annotation rules are specified in annotation-profile.json:
{
"h1": ["heading", "h1"],
"h2": ["heading", "h2"],
"b": ["emphasis"],
"div#class=toc": ["table-of-contents"],
"#class=FactBox": ["fact-box"],
"#cite": ["citation"]
}
The dictionary maps an HTML tag and/or attribute to the annotations inscriptis should provide for them. In the example above, for instance, the tag h1 yields the annotations heading and h1, a div tag with a class that contains the value toc results in the annotation table-of-contents, and all tags with a cite attribute are annotated with citation.
Given these annotation rules the HTML file
<h1>Chur</h1>
<b>Chur</b> is the capital and largest town of the Swiss canton of the
Grisons and lies in the Grisonian Rhine Valley.
yields the following JSONL output
{"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
of the Grisons and lies in the Grisonian Rhine Valley.",
"label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]}
The provided list of labels contains all annotated text elements with their start index, end index and the assigned label.
Annotation postprocessors
Annotation postprocessors enable the post processing of annotations to formats that are suitable for you particular application. Post processors can be specified with the -p or –postprocessor command line argument:
$ inscript.py https://www.fhgr.ch \ -r ./examples/annotation-profile.json \ -p tag
Output:
{"text": " Chur\n\n Chur is the capital and largest town of the Swiss
canton of the Grisons and lies in the Grisonian Rhine Valley.",
"label": [[0, 6, "heading"], [8, 14, "emphasis"]],
"tag": "<heading>Chur</heading>\n\n<emphasis>Chur</emphasis> is the
capital and largest town of the Swiss canton of the Grisons and
lies in the Grisonian Rhine Valley."}
Currently, inscriptis supports the following postprocessors:
surface: returns an additional mapping between the annotation’s surface form and its label:
['heading': 'Chur', 'emphasis': 'Chur']
xml: returns an additional annotated text version:
<?xml version="1.0" encoding="UTF-8" ?> <heading>Chur</heading> <emphasis>Chur</emphasis> is the capital and largest town of the Swiss canton of the Grisons and lies in the Grisonian Rhine Valley.
html: creates an HTML file which contains the converted text and highlights all annotations as outlined below:
Web Service
The Flask Web Service translates HTML pages to the corresponding plain text.
Additional Requirements
python3-flask
Startup
Start the inscriptis Web service with the following command:
$ export FLASK_APP="web-service.py" $ python3 -m flask run
Usage
The Web services receives the HTML file in the request body and returns the corresponding text. The file’s encoding needs to be specified in the Content-Type header (UTF-8 in the example below):
$ curl -X POST -H "Content-Type: text/html; encoding=UTF8" \ --data-binary @test.html http://localhost:5000/get_text
The service also supports a version call:
$ curl http://localhost:5000/version
Advanced topics
Annotated text
Inscriptis can provide annotations alongside the extracted text which allows downstream components to draw upon semantics that have only been available in the original HTML file.
The extracted text and annotations can be exported in different formats, including the popular JSONL format which is used by doccano.
Example output:
{"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
of the Grisons and lies in the Grisonian Rhine Valley.",
"label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]}
The output above is produced, if inscriptis is run with the following annotation rules:
{
"h1": ["heading", "h1"],
"b": ["emphasis"],
}
The code below demonstrates how inscriptis’ annotation capabilities can be used within a program:
import urllib.request
from inscriptis import get_annotated_text, ParserConfig
url = "https://www.fhgr.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')
rules = {'h1': ['heading', 'h1'],
'h2': ['heading', 'h2'],
'b': ['emphasis'],
'table': ['table']
}
output = get_annotated_text(html, ParserConfig(annotation_rules=rules)
print("Text:", output['text'])
print("Annotations:", output['label'])
Fine tuning
The following options are available for fine tuning inscriptis’ HTML rendering:
More rigorous indentation: call inscriptis.get_text() with the parameter indentation=’extended’ to also use indentation for tags such as <div> and <span> that do not provide indentation in their standard definition. This strategy is the default in inscript.py and many other tools such as lynx. If you do not want extended indentation you can use the parameter indentation=’standard’ instead.
Overwriting the default CSS definition: inscriptis uses CSS definitions that are maintained in inscriptis.css.CSS for rendering HTML tags. You can override these definitions (and therefore change the rendering) as outlined below:
from lxml.html import fromstring
from inscriptis.css_profiles import CSS_PROFILES, HtmlElement
from inscriptis.html_properties import Display
from inscriptis.model.config import ParserConfig
# create a custom CSS based on the default style sheet and change the
# rendering of `div` and `span` elements
css = CSS_PROFILES['strict'].copy()
css['div'] = HtmlElement(display=Display.block, padding=2)
css['span'] = HtmlElement(prefix=' ', suffix=' ')
html_tree = fromstring(html)
# create a parser using a custom css
config = ParserConfig(css=css)
parser = Inscriptis(html_tree, config)
text = parser.get_text()
Changelog
A full list of changes can be found in the release notes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file inscriptis-2.0rc2.tar.gz
.
File metadata
- Download URL: inscriptis-2.0rc2.tar.gz
- Upload date:
- Size: 30.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.4.2 requests/2.25.1 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8be702d06867abd09ebd7504ec857499bae7d143ca9dd748f20ec7cdc732e876 |
|
MD5 | 416acf1265fce34aee69b0ec0c8d5f0b |
|
BLAKE2b-256 | 1918f74cc1d0ee0eb28bc7216cdc57201c7ca94781c3ae20d2d9ca153e713ce9 |