inscriptis - HTML to text converter.
Project description
# inscriptis
A python based HTML to text converter with minimal support for CSS.
### Requirements
* Python 3.4+ (preferred) or Python 2.7+
* lxml
### Usage
#### Command line
The command line client converts text files or text retrieved from Web pages to the
corresponding text representation.
***Installation***
```bash
sudo python3 setup.py install
```
***Command line parameters***
```bash
usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-l] [-d] input
Converts HTML from file or url to a clean text version
positional arguments:
input Html input either from a file or an url (default:stdin)
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file (default:stdout).
-e ENCODING, --encoding ENCODING
Content encoding for files (default:utf-8)
-i, --display-image-captions
Display image captions (default:false).
-l, --display-link-targets
Display link targets (default:false).
-d, --deduplicate-image-captions
Deduplicate image captions (default:false).
```
***Examples***
```
# convert the given page to text and output the result to the screen
inscript.py http://www.htwchur.ch
# convert the file to text and save the output to output.txt
inscript.py htwchur.html -o htwchur.txt
# convert the text provided via stdin and save the output to output.txt
echo '<body><p>Make it so!</p>></body>' | inscript.py -o htwchur.txt
```
#### Library
```python
import urllib.request
from inscriptis import get_text
url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
print(text)
```
### Unit tests
Test cases concerning the html to text conversion are located in the `tests/html` directory and consist of two files:
1. `test-name.html` and
2. `test-name.txt`
the latter one containing the reference text output for the given html file.
### Text convertion output comparison and speed benchmarking
inscriptis offers a small benchmarking script that can compare different HTML to txt convertion approaches.
The script will run the different approaches on a list of URLs, ```url_list.txt```, and save the text output into a time stamped folder in ```benchmarking/benchmarking_results``` for manual comparison.
Additionally the processing speed of every approach per URL is measured and saved in a text file called ```speed_comparisons.txt``` in the respective time stamped folder.
To run the benchmarking script execute ```run_benchmarking.py``` from within the folder ```benchmarking```.
In ```def pipeline()``` set the which HTML -> Text algorithms to be executed by modifying
```python
run_lynx = True
run_justext = True
run_html2text = True
run_beautifulsoup = True
run_inscriptis = True
```
In ```url_list.txt``` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
e.g.
```
http://www.informationscience.ch
https://en.wikipedia.org/wiki/Information_science
...
```
### Flask Web Service
The Flask Web Service translates HTML pages to the corresponding plain text.
#### Requirements
* python3-flask
#### Startup
```bash
export FLASK_APP="web-service.py"
python3 -m flask run
```
#### Usage
The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified
in the `Content-Type` header (`UTF-8` in the example below).
```bash
curl -X POST -H "Content-Type: text/html; encoding=UTF8" -d @test.html http://localhost:5000/get_text
```
### Changelog
see [Release notes](https://github.com/weblyzard/inscriptis/releases).
A python based HTML to text converter with minimal support for CSS.
### Requirements
* Python 3.4+ (preferred) or Python 2.7+
* lxml
### Usage
#### Command line
The command line client converts text files or text retrieved from Web pages to the
corresponding text representation.
***Installation***
```bash
sudo python3 setup.py install
```
***Command line parameters***
```bash
usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-l] [-d] input
Converts HTML from file or url to a clean text version
positional arguments:
input Html input either from a file or an url (default:stdin)
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file (default:stdout).
-e ENCODING, --encoding ENCODING
Content encoding for files (default:utf-8)
-i, --display-image-captions
Display image captions (default:false).
-l, --display-link-targets
Display link targets (default:false).
-d, --deduplicate-image-captions
Deduplicate image captions (default:false).
```
***Examples***
```
# convert the given page to text and output the result to the screen
inscript.py http://www.htwchur.ch
# convert the file to text and save the output to output.txt
inscript.py htwchur.html -o htwchur.txt
# convert the text provided via stdin and save the output to output.txt
echo '<body><p>Make it so!</p>></body>' | inscript.py -o htwchur.txt
```
#### Library
```python
import urllib.request
from inscriptis import get_text
url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
print(text)
```
### Unit tests
Test cases concerning the html to text conversion are located in the `tests/html` directory and consist of two files:
1. `test-name.html` and
2. `test-name.txt`
the latter one containing the reference text output for the given html file.
### Text convertion output comparison and speed benchmarking
inscriptis offers a small benchmarking script that can compare different HTML to txt convertion approaches.
The script will run the different approaches on a list of URLs, ```url_list.txt```, and save the text output into a time stamped folder in ```benchmarking/benchmarking_results``` for manual comparison.
Additionally the processing speed of every approach per URL is measured and saved in a text file called ```speed_comparisons.txt``` in the respective time stamped folder.
To run the benchmarking script execute ```run_benchmarking.py``` from within the folder ```benchmarking```.
In ```def pipeline()``` set the which HTML -> Text algorithms to be executed by modifying
```python
run_lynx = True
run_justext = True
run_html2text = True
run_beautifulsoup = True
run_inscriptis = True
```
In ```url_list.txt``` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
e.g.
```
http://www.informationscience.ch
https://en.wikipedia.org/wiki/Information_science
...
```
### Flask Web Service
The Flask Web Service translates HTML pages to the corresponding plain text.
#### Requirements
* python3-flask
#### Startup
```bash
export FLASK_APP="web-service.py"
python3 -m flask run
```
#### Usage
The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified
in the `Content-Type` header (`UTF-8` in the example below).
```bash
curl -X POST -H "Content-Type: text/html; encoding=UTF8" -d @test.html http://localhost:5000/get_text
```
### Changelog
see [Release notes](https://github.com/weblyzard/inscriptis/releases).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
inscriptis-0.0.3.7.tar.gz
(11.2 kB
view details)
Built Distribution
File details
Details for the file inscriptis-0.0.3.7.tar.gz
.
File metadata
- Download URL: inscriptis-0.0.3.7.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f2831468e46e2971180e8137eafeb4d694f9602d1d2b7370c3c31dd8f37d94e |
|
MD5 | 6d2627ac82322c97f2e89882a0e6d097 |
|
BLAKE2b-256 | b6cfac1248ef14bc54dd3522b5eb65a6ffe327460aa0604b396236a8ac60bef7 |
File details
Details for the file inscriptis-0.0.3.7-py3-none-any.whl
.
File metadata
- Download URL: inscriptis-0.0.3.7-py3-none-any.whl
- Upload date:
- Size: 13.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.10.0 pkginfo/1.2.1 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 341890a56a695437103c02b2e40443099acb422808169bc7c350122da322de20 |
|
MD5 | 976f51f19ba94cf94be9b8c54ef58850 |
|
BLAKE2b-256 | 7b8a99143844810acc084684d74adb6c812a6f822d7da08b4ab93689425a49bd |