Skip to main content

inscriptis - HTML to text converter.

Project description

Supported python versions Maintainability Coverage Build status Documentation status PyPI version PyPI downloads https://joss.theoj.org/papers/10.21105/joss.03557/status.svg

A python based HTML to text conversion library, command line client and Web service with support for nested tables, a subset of CSS and optional support for providing an annotated output.

Inscriptis is particularly well suited for applications that require high-performance, high-quality (i.e., layout-aware) text representations of HTML content, and will aid knowledge extraction and data science tasks conducted upon Web data.

Please take a look at the Rendering document for a demonstration of inscriptis’ conversion quality.

A Java port of inscriptis 1.x has been published by x28.

This document provides a short introduction to Inscriptis.

Statement of need - why inscriptis?

  1. Inscriptis provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements.

    Conversion quality becomes a factor once you need to move beyond simple HTML snippets. Non-specialized approaches and less sophisticated libraries do not correctly interpret HTML semantics and, therefore, fail to properly convert constructs such as itemizations, enumerations, and tables.

    Beautiful Soup’s get_text() function, for example, converts the following HTML enumeration to the string firstsecond.

    <ul>
      <li>first</li>
      <li>second</li>
    <ul>

    Inscriptis, in contrast, not only returns the correct output

    * first
    * second

    but also supports much more complex constructs such as nested tables and also interprets a subset of HTML (e.g., align, valign) and CSS (e.g., display, white-space, margin-top, vertical-align, etc.) attributes that determine the text alignment. Any time the spatial alignment of text is relevant (e.g., for many knowledge extraction tasks, the computation of word embeddings and language models, and sentiment analysis) an accurate HTML to text conversion is essential.

  2. Inscriptis supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These rules might be used to

    • provide downstream knowledge extraction components with additional information that may be leveraged to improve their respective performance.

    • assist manual document annotation processes (e.g., for qualitative analysis or gold standard creation). Inscriptis supports multiple export formats such as XML, annotated HTML and the JSONL format that is used by the open source annotation tool doccano.

    • enabling the use of Inscriptis for tasks such as content extraction (i.e., extract task-specific relevant content from a Web page) which rely on information on the HTML document’s structure.

Installation

At the command line:

$ pip install inscriptis

Or, if you don’t have pip installed:

$ easy_install inscriptis

If you want to install from the latest sources, you can do:

$ git clone https://github.com/weblyzard/inscriptis.git
$ cd inscriptis
$ python setup.py install

Python library

Embedding inscriptis into your code is easy, as outlined below:

import urllib.request
from inscriptis import get_text

url = "https://www.fhgr.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

Standalone command line client

The command line client converts HTML files or text retrieved from Web pages to the corresponding text representation.

Command line parameters

The inscript.py command line client supports the following parameters:

usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR] [--indentation INDENTATION]
                   [--table-cell-separator TABLE_CELL_SEPARATOR] [-v]
                   [input]

Convert the given HTML document to text.

positional arguments:
  input                 Html input either from a file or a URL (default:stdin).

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file (default:stdout).
  -e ENCODING, --encoding ENCODING
                        Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs).
  -i, --display-image-captions
                        Display image captions (default:false).
  -d, --deduplicate-image-captions
                        Deduplicate image captions (default:false).
  -l, --display-link-targets
                        Display link targets (default:false).
  -a, --display-anchor-urls
                        Display anchor URLs (default:false).
  -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES
                        Path to an optional JSON file containing rules for annotating the retrieved text.
  -p POSTPROCESSOR, --postprocessor POSTPROCESSOR
                        Optional component for postprocessing the result (html, surface, xml).
  --indentation INDENTATION
                        How to handle indentation (extended or strict; default: extended).
  --table-cell-separator TABLE_CELL_SEPARATOR
                        Separator to use between table cells (default: three spaces).
  -v, --version         display version information

HTML to text conversion

convert the given page to text and output the result to the screen:

$ inscript.py https://www.fhgr.ch

convert the file to text and save the output to fhgr.txt:

$ inscript.py fhgr.html -o fhgr.txt

convert the file using strict indentation (i.e., minimize indentation and extra spaces) and save the output to fhgr-layout-optimized.txt:

$ inscript.py --indentation strict fhgr.html -o fhgr-layout-optimized.txt

convert HTML provided via stdin and save the output to output.txt:

$ echo "<body><p>Make it so!</p></body>" | inscript.py -o output.txt

HTML to annotated text conversion

convert and annotate HTML from a Web page using the provided annotation rules.

Download the example annotation-profile.json and save it to your working directory:

$ inscript.py https://www.fhgr.ch -r annotation-profile.json

The annotation rules are specified in annotation-profile.json:

{
 "h1": ["heading", "h1"],
 "h2": ["heading", "h2"],
 "b": ["emphasis"],
 "div#class=toc": ["table-of-contents"],
 "#class=FactBox": ["fact-box"],
 "#cite": ["citation"]
}

The dictionary maps an HTML tag and/or attribute to the annotations inscriptis should provide for them. In the example above, for instance, the tag h1 yields the annotations heading and h1, a div tag with a class that contains the value toc results in the annotation table-of-contents, and all tags with a cite attribute are annotated with citation.

Given these annotation rules the HTML file

<h1>Chur</h1>
<b>Chur</b> is the capital and largest town of the Swiss canton of the
Grisons and lies in the Grisonian Rhine Valley.

yields the following JSONL output

{"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
          of the Grisons and lies in the Grisonian Rhine Valley.",
 "label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]}

The provided list of labels contains all annotated text elements with their start index, end index and the assigned label.

Annotation postprocessors

Annotation postprocessors enable the post processing of annotations to formats that are suitable for your particular application. Post processors can be specified with the -p or --postprocessor command line argument:

$ inscript.py https://www.fhgr.ch \
        -r ./examples/annotation-profile.json \
        -p surface

Output:

{"text": "  Chur\n\n  Chur is the capital and largest town of the Swiss
          canton of the Grisons and lies in the Grisonian Rhine Valley.",
 "label": [[0, 6, "heading"], [8, 14, "emphasis"]],
 "tag": "<heading>Chur</heading>\n\n<emphasis>Chur</emphasis> is the
        capital and largest town of the Swiss canton of the Grisons and
        lies in the Grisonian Rhine Valley."}

Currently, inscriptis supports the following postprocessors:

  • surface: returns a list of mapping between the annotation’s surface form and its label:

    [
       ['heading', 'Chur'],
       ['emphasis': 'Chur']
    ]
  • xml: returns an additional annotated text version:

    <?xml version="1.0" encoding="UTF-8" ?>
    <heading>Chur</heading>
    
    <emphasis>Chur</emphasis> is the capital and largest town of the Swiss
    canton of the Grisons and lies in the Grisonian Rhine Valley.
  • html: creates an HTML file which contains the converted text and highlights all annotations as outlined below:

Annotations extracted from the Wikipedia entry for Chur with the ``--postprocess html`` postprocessor.

Snippet of the rendered HTML file created with the following command line options and annotation rules:

inscript.py --annotation-rules ./wikipedia.json \
            --postprocessor html \
            https://en.wikipedia.org/wiki/Chur.html

Annotation rules encoded in the wikipedia.json file:

{
  "h1": ["heading"],
  "h2": ["heading"],
  "h3": ["subheading"],
  "h4": ["subheading"],
  "h5": ["subheading"],
  "i": ["emphasis"],
  "b": ["bold"],
  "table": ["table"],
  "th": ["tableheading"],
  "a": ["link"]
}

Web Service

The Flask Web Service translates HTML pages to the corresponding plain text.

Run the Web Service on your host system

Provide additional requirement python3-flask, then start the inscriptis Web service with the following command:

$ export FLASK_APP="inscriptis.service.web"
$ python3 -m flask run

Run the Web Service with Docker

The docker definition can be found here:

$ docker pull ghcr.io/weblyzard/inscriptis:latest
$ docker run -n inscriptis ghcr.io/weblyzard/inscriptis:latest

Run as Kubernetes Deployment

The helm chart for deployment on a kubernetes cluster is located in the inscriptis-helm repository.

Use the Web Service

The Web services receives the HTML file in the request body and returns the corresponding text. The file’s encoding needs to be specified in the Content-Type header (UTF-8 in the example below):

$ curl -X POST  -H "Content-Type: text/html; encoding=UTF8"  \
        --data-binary @test.html  http://localhost:5000/get_text

The service also supports a version call:

$ curl http://localhost:5000/version

Example annotation profiles

The following section provides a number of example annotation profiles illustrating the use of Inscriptis’ annotation support. The examples present the used annotation rules and an image that highlights a snippet with the annotated text on the converted web page, which has been created using the HTML postprocessor as outlined in Section annotation postprocessors.

Wikipedia tables and table metadata

The following annotation rules extract tables from Wikipedia pages, and annotate table headings that are typically used to indicate column or row headings.

{
   "table": ["table"],
   "th": ["tableheading"],
   "caption": ["caption"]
}

The figure below outlines an example table from Wikipedia that has been annotated using these rules.

Table and table metadata annotations extracted from the Wikipedia entry for Chur.

References to entities, missing entities and citations from Wikipedia

This profile extracts references to Wikipedia entities, missing entities and citations. Please note that the profile isn’t perfect, since it also annotates [ edit ] links.

{
   "a#title": ["entity"],
   "a#class=new": ["missing"],
   "class=reference": ["citation"]
}

The figure shows entities and citations that have been identified on a Wikipedia page using these rules.

Metadata on entries, missing entries and citations extracted from the Wikipedia entry for Chur.

Posts and post metadata from the XDA developer forum

The annotation rules below, extract posts with metadata on the post’s time, user and the user’s job title from the XDA developer forum.

{
    "article#class=message-body": ["article"],
    "li#class=u-concealed": ["time"],
    "#itemprop=name": ["user-name"],
    "#itemprop=jobTitle": ["user-title"]
}

The figure illustrates the annotated metadata on posts from the XDA developer forum.

Posts and post metadata extracted from the XDA developer forum.

Code and metadata from Stackoverflow pages

The rules below extracts code and metadata on users and comments from Stackoverflow pages.

{
   "code": ["code"],
   "#itemprop=dateCreated": ["creation-date"],
   "#class=user-details": ["user"],
   "#class=reputation-score": ["reputation"],
   "#class=comment-date": ["comment-date"],
   "#class=comment-copy": ["comment-comment"]
}

Applying these rules to a Stackoverflow page on text extraction from HTML yields the following snippet:

Code and metadata from Stackoverflow pages.

Advanced topics

Annotated text

Inscriptis can provide annotations alongside the extracted text which allows downstream components to draw upon semantics that have only been available in the original HTML file.

The extracted text and annotations can be exported in different formats, including the popular JSONL format which is used by doccano.

Example output:

{"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
          of the Grisons and lies in the Grisonian Rhine Valley.",
 "label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]}

The output above is produced, if inscriptis is run with the following annotation rules:

{
 "h1": ["heading", "h1"],
 "b": ["emphasis"],
}

The code below demonstrates how inscriptis’ annotation capabilities can be used within a program:

import urllib.request
from inscriptis import get_annotated_text, ParserConfig

url = "https://www.fhgr.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

rules = {'h1': ['heading', 'h1'],
         'h2': ['heading', 'h2'],
         'b': ['emphasis'],
         'table': ['table']
        }

output = get_annotated_text(html, ParserConfig(annotation_rules=rules)
print("Text:", output['text'])
print("Annotations:", output['label'])

Fine tuning

The following options are available for fine tuning inscriptis’ HTML rendering:

  1. More rigorous indentation: call inscriptis.get_text() with the parameter indentation='extended' to also use indentation for tags such as <div> and <span> that do not provide indentation in their standard definition. This strategy is the default in inscript.py and many other tools such as Lynx. If you do not want extended indentation you can use the parameter indentation='standard' instead.

  2. Overwriting the default CSS definition: inscriptis uses CSS definitions that are maintained in inscriptis.css.CSS for rendering HTML tags. You can override these definitions (and therefore change the rendering) as outlined below:

    from lxml.html import fromstring
    from inscriptis.css_profiles import CSS_PROFILES, HtmlElement
    from inscriptis.html_properties import Display
    from inscriptis.model.config import ParserConfig

    # create a custom CSS based on the default style sheet and change the
    # rendering of `div` and `span` elements
    css = CSS_PROFILES['strict'].copy()
    css['div'] = HtmlElement(display=Display.block, padding=2)
    css['span'] = HtmlElement(prefix=' ', suffix=' ')

    html_tree = fromstring(html)
    # create a parser using a custom css
    config = ParserConfig(css=css)
    parser = Inscriptis(html_tree, config)  usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR]
                   [--indentation INDENTATION] [-v]
                   [input]

Convert the given HTML document to text.

positional arguments:
  input                 Html input either from a file or a URL (default:stdin).

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file (default:stdout).
  -e ENCODING, --encoding ENCODING
                        Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs).
  -i, --display-image-captions
                        Display image captions (default:false).
  -d, --deduplicate-image-captions
                        Deduplicate image captions (default:false).
  -l, --display-link-targets
                        Display link targets (default:false).
  -a, --display-anchor-urls
                        Display anchor URLs (default:false).
  -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES
                        Path to an optional JSON file containing rules for annotating the retrieved text.
  -p POSTPROCESSOR, --postprocessor POSTPROCESSOR
                        Optional component for postprocessing the result (html, surface, xml).
  --indentation INDENTATION
                        How to handle indentation (extended or strict; default: extended).
  -v, --version         display version information
    text = parser.get_text()

Custom HTML tag handling

If the fine-tuning options discussed above are not sufficient, you may even override Inscriptis’ handling of start and end tags as outlined below:

inscriptis = Inscriptis(html, config)

inscriptis.start_tag_handler_dict['a'] = my_handle_start_a
inscriptis.end_tag_handler_dict['a'] = my_handle_end_a
text = inscriptis.get_text()

In the example the standard HTML handlers for the a tag are overwritten with custom versions (i.e., my_handle_start_a and my_handle_end_a). You may define custom handlers for any tag, regardless of whether it already exists in start_tag_handler_dict or end_tag_handler_dict.

Optimizing memory consumption

Inscriptis uses the Python lxml library which prefers to reuse memory rather than release it to the operating system. This behavior might lead to an increased memory consumption, if you use inscriptis within a Web service that parses very complex HTML pages.

The following code mitigates this problem on Unix systems by manually forcing lxml to release the allocated memory:

import ctypes
def trim_memory() -> int:
   libc = ctypes.CDLL("libc.so.6")
   return libc.malloc_trim(0)

Citation

There is a Journal of Open Source Software paper you can cite for Inscriptis:

@article{Weichselbraun2021,
  doi = {10.21105/joss.03557},
  url = {https://doi.org/10.21105/joss.03557},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {66},
  pages = {3557},
  author = {Albert Weichselbraun},
  title = {Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web},
  journal = {Journal of Open Source Software}
}

Changelog

A full list of changes can be found in the release notes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inscriptis-2.3.2.tar.gz (40.5 kB view details)

Uploaded Source

Built Distribution

inscriptis-2.3.2-py3-none-any.whl (41.2 kB view details)

Uploaded Python 3

File details

Details for the file inscriptis-2.3.2.tar.gz.

File metadata

  • Download URL: inscriptis-2.3.2.tar.gz
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6

File hashes

Hashes for inscriptis-2.3.2.tar.gz
Algorithm Hash digest
SHA256 4a272b7e8edf45eab453839df2acdcaaa0d388be413638e16222367b4e75274a
MD5 72cf8bdfe6b583b9c884543dec757890
BLAKE2b-256 1bf3c7766ef909900289a7e23282c41e83d34d5d52c717aef8d0cf049b65ec93

See more details on using hashes here.

File details

Details for the file inscriptis-2.3.2-py3-none-any.whl.

File metadata

  • Download URL: inscriptis-2.3.2-py3-none-any.whl
  • Upload date:
  • Size: 41.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6

File hashes

Hashes for inscriptis-2.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2353a39169821fd4fa21ee8348c2b38f0a067218988a58c7cb3375d68e95e16e
MD5 9cd2fc0895c8a2e26087e0a013ad5397
BLAKE2b-256 305f5f8338f423d072a1dab1c73b398d8cede25f192ca7b9f0314e4e088df612

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page