Skip to main content

inscriptis - HTML to text converter.

Project description

PyPI - Python Version PyPI version Coverage Build status Documentation Status

A python based HTML to text conversion library, command line client and Web service with support for nested tables and a subset of CSS. Please take a look at the Rendering document for a demonstration of inscriptis’ conversion quality.

Documentation

The full documentation is built automatically and published on Read the Docs.

Table of Contents

  1. Installation

  2. Python library

  3. Standalone command line client

  4. Web service

  5. Fine tuning

  6. Changelog

Installation

At the command line:

$ pip install inscriptis

Or, if you don’t have pip installed:

$ easy_install inscriptis

If you want to install from the latest sources, you can do:

$ git clone https://github.com/weblyzard/inscriptis.git
$ cd inscriptis
$ python setup.py install

Python library

Embedding inscriptis into your code is easy, as outlined below:

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

Standalone command line client

The command line client converts HTML files or text retrieved from Web pages to the corresponding text representation.

Command line parameters

The inscript.py command line client supports the following parameters:

usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-l] [-d] input

Converts HTML from file or url to a clean text version

positional arguments:
  input                 Html input either from a file or an url (default:stdin)

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file (default:stdout).
  -e ENCODING, --encoding ENCODING
                        Content encoding for files (default:utf-8)
  -i, --display-image-captions
                        Display image captions (default:false).
  -l, --display-link-targets
                        Display link targets (default:false).
  -d, --deduplicate-image-captions
                        Deduplicate image captions (default:false).
  --indentation
                        How to handle indentation (extended or standard; default: extended)

Examples

convert the given page to text and output the result to the screen:

$ inscript.py https://www.fhgr.ch

convert the file to text and save the output to output.txt:

$ inscript.py fhgr.html -o fhgr.txt

convert text provided via stdin and save the output to output.txt:

$ echo '<body><p>Make it so!</p>></body>' | inscript.py -o output.txt

Web Service

The Flask Web Service translates HTML pages to the corresponding plain text.

Additional Requirements

  • python3-flask

Startup

Start the inscriptis Web service with the following command:

$ export FLASK_APP="web-service.py"
$ python3 -m flask run

Usage

The Web services receives the HTML file in the request body and returns the corresponding text. The file’s encoding needs to be specified in the Content-Type header (UTF-8 in the example below):

$ curl -X POST  -H "Content-Type: text/html; encoding=UTF8" -d @test.html  http://localhost:5000/get_text

Fine tuning

The following options are available for fine tuning inscriptis’ HTML rendering:

  1. More rigorous indentation: call inscriptis.get_text() with the parameter indentation=’extended’ to also use indentation for tags such as <div> and <span> that do not provide indentation in their standard definition. This strategy is the default in inscript.py and many other tools such as lynx. If you do not want extended indentation you can use the parameter indentation=’standard’ instead.

  2. Overwriting the default CSS definition: inscriptis uses CSS definitions that are maintained in inscriptis.css.CSS for rendering HTML tags. You can override these definitions (and therefore change the rendering) as outlined below:

    from lxml.html import fromstring
    from inscriptis.css_profiles import CSS_PROFILES, HtmlElement
    from inscriptis.html_properties import Display
    
    # create a custom CSS based on the default style sheet and change the rendering of `div` and `span` elements
    css = CSS_PROFILES['strict'].copy()
    css['div'] = HtmlElement('div', display=Display.block, padding=2)
    css['span'] = HtmlElement('span', prefix=' ', suffix=' ')
    
    html_tree = fromstring(html)
    # create a parser using the custom css
    parser = Inscriptis(html_tree,
                        display_images=display_images,
                        deduplicate_captions=deduplicate_captions,
                        display_links=display_links,
                        css=css)
    text = parser.get_text()

Changelog

A full list of changes can be found in the release notes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inscriptis-1.0.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

inscriptis-1.0-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file inscriptis-1.0.tar.gz.

File metadata

  • Download URL: inscriptis-1.0.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.6.9

File hashes

Hashes for inscriptis-1.0.tar.gz
Algorithm Hash digest
SHA256 b1cce440534b18beb12f066900f65deff5f1f75d617c7858a3aff9586caaf33c
MD5 dd9ed0198cbce2e786a26225256dee13
BLAKE2b-256 0fc9cce08677cd24487ca75de0d61f4f3c5b286d3951e144a858285cdf8272ae

See more details on using hashes here.

File details

Details for the file inscriptis-1.0-py3-none-any.whl.

File metadata

  • Download URL: inscriptis-1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.6.9

File hashes

Hashes for inscriptis-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 20d737c57ffc7202b6656c5939df40fb9d6c13f310ded1c460c7bf0f48b9c417
MD5 57da167739581f096012422df77f9434
BLAKE2b-256 9cc40e5417a294aff07c39d11e3a7cca5cc3156cc0fa4172d4088fcaf0baa438

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page