Skip to main content

inscriptis - HTML to text converter.

Project description

# inscriptis

A python based HTML to text converter with minimal support for CSS.

### Requirements
* Python 3.4+ (preferred) or Python 2.7+
* lxml

### Usage

#### Command line
The command line client converts text files or text retrieved from Web pages to the
corresponding text representation.

***Installation***
```bash
sudo python3 setup.py install
```

***Command line parameters***
```bash
usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-l] [-d] input

Converts HTML from file or url to a clean text version

positional arguments:
input Html input either from a file or an url (default:stdin)

optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file (default:stdout).
-e ENCODING, --encoding ENCODING
Content encoding for files (default:utf-8)
-i, --display-image-captions
Display image captions (default:false).
-l, --display-link-targets
Display link targets (default:false).
-d, --deduplicate-image-captions
Deduplicate image captions (default:false).
```

***Examples***
```
# convert the given page to text and output the result to the screen
inscript.py http://www.htwchur.ch

# convert the file to text and save the output to output.txt
inscript.py htwchur.html -o htwchur.txt

# convert the text provided via stdin and save the output to output.txt
echo '<body><p>Make it so!</p>></body>' | inscript.py -o htwchur.txt
```


#### Library

```python
import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)
```

### Unit tests

Test cases concerning the html to text conversion are located in the `tests/html` directory and consist of two files:

1. `test-name.html` and
2. `test-name.txt`

the latter one containing the reference text output for the given html file.

### Text convertion output comparison and speed benchmarking
inscriptis offers a small benchmarking script that can compare different HTML to txt convertion approaches.
The script will run the different approaches on a list of URLs, ```url_list.txt```, and save the text output into a time stamped folder in ```benchmarking/benchmarking_results``` for manual comparison.
Additionally the processing speed of every approach per URL is measured and saved in a text file called ```speed_comparisons.txt``` in the respective time stamped folder.

To run the benchmarking script execute ```run_benchmarking.py``` from within the folder ```benchmarking```.
In ```def pipeline()``` set the which HTML -> Text algorithms to be executed by modifying
```python
run_lynx = True
run_justext = True
run_html2text = True
run_beautifulsoup = True
run_inscriptis = True
```

In ```url_list.txt``` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
e.g.
```
http://www.informationscience.ch
https://en.wikipedia.org/wiki/Information_science
...
```

### Flask Web Service

The Flask Web Service translates HTML pages to the corresponding plain text.

#### Requirements

* python3-flask

#### Startup

```bash
export FLASK_APP="web-service.py"
python3 -m flask run
```

#### Usage
The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified
in the `Content-Type` header (`UTF-8` in the example below).

```bash
curl -X POST -H "Content-Type: text/html; encoding=UTF8" -d @test.html http://localhost:5000/get_text
```

### Changelog

see [Release notes](https://github.com/weblyzard/inscriptis/releases).


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inscriptis-0.0.3.5.tar.gz (9.5 kB view details)

Uploaded Source

Built Distributions

inscriptis-0.0.3.5-py3.6.egg (21.4 kB view details)

Uploaded Source

inscriptis-0.0.3.5-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file inscriptis-0.0.3.5.tar.gz.

File metadata

  • Download URL: inscriptis-0.0.3.5.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.15rc1

File hashes

Hashes for inscriptis-0.0.3.5.tar.gz
Algorithm Hash digest
SHA256 0bc03435a5091222d8a7e79d73efa2af3ba080cb188b49e3928d49ca497074ab
MD5 fc8ff103553398ef7272080817edf664
BLAKE2b-256 bcc7c2ee33777ff9edd5f711eba1b5da6034477ca45323478b3c86cb6c3b21d2

See more details on using hashes here.

File details

Details for the file inscriptis-0.0.3.5-py3.6.egg.

File metadata

  • Download URL: inscriptis-0.0.3.5-py3.6.egg
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.7

File hashes

Hashes for inscriptis-0.0.3.5-py3.6.egg
Algorithm Hash digest
SHA256 75116fd6c426b2052668de7936364803743b0adfa0f41b3fa7f8cef5d6ff3ba7
MD5 3e6167a45933320a04706ed706a08dfb
BLAKE2b-256 dbcfe11dd7c7b1fda9ba686cfa2d2c0843ff20e9e920fb2d4af3346f66fb5f33

See more details on using hashes here.

File details

Details for the file inscriptis-0.0.3.5-py3-none-any.whl.

File metadata

  • Download URL: inscriptis-0.0.3.5-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.10.0 pkginfo/1.2.1 requests/2.18.4 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.19.5 CPython/3.6.7

File hashes

Hashes for inscriptis-0.0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a73a9e535f561c70152d3bf6137929fba0d2312962c017fa7e47bbe0da4750b2
MD5 3b85c425d7d216bf0abac7acace70cf5
BLAKE2b-256 d5c716cce7340beb09c0fbf952dc5deefb1e612694b2f43733648674181c3b4a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page