Skip to main content

inscriptis - HTML to text converter.

Project description

inscriptis

Build Status

A python based HTML to text conversion library, command line client and Web service with support for nested tables and a subset of CSS. Please take a look at the Rendering document for a demonstration of inscriptis' conversion quality.

Table of Contents
  1. Requirements and installation
  2. Command line client
  3. Python library
  4. Web service
  5. Fine tuning
  6. Testing, benchmarking and evaluation
  7. Changelog

Requirements and installation

Requirements

  • Python 3.5+ (preferred) or Python 2.7+
  • lxml
  • requests

Installation

sudo python3 setup.py install

Command line client

The command line client converts text files or text retrieved from Web pages to the corresponding text representation.

Command line parameters

usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-l] [-d] input

Converts HTML from file or url to a clean text version

positional arguments:
  input                 Html input either from a file or an url (default:stdin)

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file (default:stdout).
  -e ENCODING, --encoding ENCODING
                        Content encoding for files (default:utf-8)
  -i, --display-image-captions
                        Display image captions (default:false).
  -l, --display-link-targets
                        Display link targets (default:false).
  -d, --deduplicate-image-captions
                        Deduplicate image captions (default:false).
  --indentation
                        How to handle indentation (extended or standard; default: extended)

Examples

# convert the given page to text and output the result to the screen
inscript.py https://www.fhgr.ch

# convert the file to text and save the output to output.txt
inscript.py fhgr.html -o fhgr.txt

# convert the text provided via stdin and save the output to output.txt
echo '<body><p>Make it so!</p>></body>' | inscript.py -o output.txt 

Python library

Embedding inscriptis into your code is easy, as outlined below:

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

Flask Web Service

The Flask Web Service translates HTML pages to the corresponding plain text.

Additional Requirements

  • python3-flask

Startup

export FLASK_APP="web-service.py"
python3 -m flask run

Usage

The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified in the Content-Type header (UTF-8 in the example below).

curl -X POST  -H "Content-Type: text/html; encoding=UTF8" -d @test.html  http://localhost:5000/get_text

Fine tuning

The following options are available for fine tuning the way inscriptis translates HTML to text.

  1. More rigorous indentation: call inscriptis.get_text() with the parameter indentation='extended' to also use indentation for tags such as <div> and <span> that do not provide indentation in their standard definition. This strategy is the default in inscript.py and many other tools such as lynx. If you do not want extended indentation you can use the parameter indentation='standard' instead.

  2. Overwriting the default CSS definition: inscriptis uses CSS definitions that are maintained in inscriptis.css.CSS for rendering HTML tags. You can override these definitions (and therefore change the rendering) as outlined below:

    from lxml.html import fromstring
    
    from inscriptis.css import DEFAULT_CSS, HtmlElement
    from inscriptis.html_properties import Display
    
    # create a custom CSS based on the default style sheet and change the rendering of `div` and `span` elements
    css = DEFAULT_CSS.copy()
    css['div'] = HtmlElement('div', display=Display.block, padding=2)
    css['span'] = HtmlElement('span', prefix=' ', suffix=' ')
    
    html_tree = fromstring(html)
    # create a parser using the custom css
    parser = Inscriptis(html_tree,
                        display_images=display_images,
                        deduplicate_captions=deduplicate_captions,
                        display_links=display_links,
                        css=css)
    text = parser.get_text()
    

Testing, benchmarking and evaluation

Unit tests

Test cases concerning the html to text conversion are located in the tests/html directory and consist of two files:

  1. test-name.html and
  2. test-name.txt

the latter one containing the reference text output for the given html file.

Text conversion output comparison and speed benchmarking

inscriptis offers a small benchmarking script that can compare different HTML to text conversion approaches. The script will run the different approaches on a list of URLs, url_list.txt, and save the text output into a time stamped folder in benchmarking/benchmarking_results for manual comparison. Additionally the processing speed of every approach per URL is measured and saved in a text file called speed_comparisons.txt in the respective time stamped folder.

To run the benchmarking script execute run_benchmarking.py from within the folder benchmarking. In def pipeline() set the which HTML -> Text algorithms to be executed by modifying

run_lynx = True
run_justext = True
run_html2text = True
run_beautifulsoup = True
run_inscriptis = True

In url_list.txt the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://) e.g.

http://www.informationscience.ch
https://en.wikipedia.org/wiki/Information_science
...

Changelog

see Release notes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inscriptis-0.0.4.1.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

inscriptis-0.0.4.1-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file inscriptis-0.0.4.1.tar.gz.

File metadata

  • Download URL: inscriptis-0.0.4.1.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.6.8

File hashes

Hashes for inscriptis-0.0.4.1.tar.gz
Algorithm Hash digest
SHA256 3e88d5c74506eecec73b103bebdab1d191c8d51b2657dbf7f86396c37c416a76
MD5 ffcfc7a0ca909e111e3c493b36d41322
BLAKE2b-256 e4ee45c966a803cfc310a4a4be30615aea92930bc71748ca8c5e78e005b9faf4

See more details on using hashes here.

File details

Details for the file inscriptis-0.0.4.1-py3-none-any.whl.

File metadata

  • Download URL: inscriptis-0.0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.6.8

File hashes

Hashes for inscriptis-0.0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 94f311ea1f82cefe18a221fb39ef42c46c899770f966a63fe4bcca87ef9766f2
MD5 d16436dcf015f46fb2c12626b500ef98
BLAKE2b-256 a5f05caccb0f2ba77d0c1f88205d1220c7dc432089a47dd45160dbe92dc6d474

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page