inscriptis - HTML to text converter.
Project description
# inscriptis
A python based HTML to text converter with minimal support for CSS.
### Requirements
* Python 3.4+ (preferred) or Python 2.7+
* lxml
### Usage
#### Command line
The command line client converts text files or text retrieved from Web pages to the
corresponding text representation.
***Installation***
```bash
sudo python3 setup.py install
```
***Command line parameters***
```bash
usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-l] [-d] input
Converts HTML from file or url to a clean text version
positional arguments:
input Html input either from a file or an url (default:stdin)
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file (default:stdout).
-e ENCODING, --encoding ENCODING
Content encoding for files (default:utf-8)
-i, --display-image-captions
Display image captions (default:false).
-l, --display-link-targets
Display link targets (default:false).
-d, --deduplicate-image-captions
Deduplicate image captions (default:false).
```
***Examples***
```
# convert the given page to text and output the result to the screen
inscript.py http://www.htwchur.ch
# convert the file to text and save the output to output.txt
inscript.py htwchur.html -o htwchur.txt
# convert the text provided via stdin and save the output to output.txt
echo '<body><p>Make it so!</p>></body>' | inscript.py -o htwchur.txt
```
#### Library
```python
import urllib.request
from inscriptis import get_text
url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
print(text)
```
### Unit tests
Test cases concerning the html to text conversion are located in the `tests/html` directory and consist of two files:
1. `test-name.html` and
2. `test-name.txt`
the latter one containing the reference text output for the given html file.
### Text convertion output comparison and speed benchmarking
inscriptis offers a small benchmarking script that can compare different HTML to txt convertion approaches.
The script will run the different approaches on a list of URLs, ```url_list.txt```, and save the text output into a time stamped folder in ```benchmarking/benchmarking_results``` for manual comparison.
Additionally the processing speed of every approach per URL is measured and saved in a text file called ```speed_comparisons.txt``` in the respective time stamped folder.
To run the benchmarking script execute ```run_benchmarking.py``` from within the folder ```benchmarking```.
In ```def pipeline()``` set the which HTML -> Text algorithms to be executed by modifying
```python
run_lynx = True
run_justext = True
run_html2text = True
run_beautifulsoup = True
run_inscriptis = True
```
In ```url_list.txt``` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
e.g.
```
http://www.informationscience.ch
https://en.wikipedia.org/wiki/Information_science
...
```
### Flask Web Service
The Flask Web Service translates HTML pages to the corresponding plain text.
#### Requirements
* python3-flask
#### Startup
```bash
export FLASK_APP="web-service.py"
python3 -m flask run
```
#### Usage
The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified
in the `Content-Type` header (`UTF-8` in the example below).
```bash
curl -X POST -H "Content-Type: text/html; encoding=UTF8" -d @test.html http://localhost:5000/get_text
```
### Changelog
see [Release notes](https://github.com/weblyzard/inscriptis/releases).
A python based HTML to text converter with minimal support for CSS.
### Requirements
* Python 3.4+ (preferred) or Python 2.7+
* lxml
### Usage
#### Command line
The command line client converts text files or text retrieved from Web pages to the
corresponding text representation.
***Installation***
```bash
sudo python3 setup.py install
```
***Command line parameters***
```bash
usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-l] [-d] input
Converts HTML from file or url to a clean text version
positional arguments:
input Html input either from a file or an url (default:stdin)
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file (default:stdout).
-e ENCODING, --encoding ENCODING
Content encoding for files (default:utf-8)
-i, --display-image-captions
Display image captions (default:false).
-l, --display-link-targets
Display link targets (default:false).
-d, --deduplicate-image-captions
Deduplicate image captions (default:false).
```
***Examples***
```
# convert the given page to text and output the result to the screen
inscript.py http://www.htwchur.ch
# convert the file to text and save the output to output.txt
inscript.py htwchur.html -o htwchur.txt
# convert the text provided via stdin and save the output to output.txt
echo '<body><p>Make it so!</p>></body>' | inscript.py -o htwchur.txt
```
#### Library
```python
import urllib.request
from inscriptis import get_text
url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
print(text)
```
### Unit tests
Test cases concerning the html to text conversion are located in the `tests/html` directory and consist of two files:
1. `test-name.html` and
2. `test-name.txt`
the latter one containing the reference text output for the given html file.
### Text convertion output comparison and speed benchmarking
inscriptis offers a small benchmarking script that can compare different HTML to txt convertion approaches.
The script will run the different approaches on a list of URLs, ```url_list.txt```, and save the text output into a time stamped folder in ```benchmarking/benchmarking_results``` for manual comparison.
Additionally the processing speed of every approach per URL is measured and saved in a text file called ```speed_comparisons.txt``` in the respective time stamped folder.
To run the benchmarking script execute ```run_benchmarking.py``` from within the folder ```benchmarking```.
In ```def pipeline()``` set the which HTML -> Text algorithms to be executed by modifying
```python
run_lynx = True
run_justext = True
run_html2text = True
run_beautifulsoup = True
run_inscriptis = True
```
In ```url_list.txt``` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
e.g.
```
http://www.informationscience.ch
https://en.wikipedia.org/wiki/Information_science
...
```
### Flask Web Service
The Flask Web Service translates HTML pages to the corresponding plain text.
#### Requirements
* python3-flask
#### Startup
```bash
export FLASK_APP="web-service.py"
python3 -m flask run
```
#### Usage
The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified
in the `Content-Type` header (`UTF-8` in the example below).
```bash
curl -X POST -H "Content-Type: text/html; encoding=UTF8" -d @test.html http://localhost:5000/get_text
```
### Changelog
see [Release notes](https://github.com/weblyzard/inscriptis/releases).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
inscriptis-0.0.3.5.tar.gz
(9.5 kB
view details)
Built Distributions
inscriptis-0.0.3.5-py3.6.egg
(21.4 kB
view details)
File details
Details for the file inscriptis-0.0.3.5.tar.gz
.
File metadata
- Download URL: inscriptis-0.0.3.5.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.15rc1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0bc03435a5091222d8a7e79d73efa2af3ba080cb188b49e3928d49ca497074ab |
|
MD5 | fc8ff103553398ef7272080817edf664 |
|
BLAKE2b-256 | bcc7c2ee33777ff9edd5f711eba1b5da6034477ca45323478b3c86cb6c3b21d2 |
File details
Details for the file inscriptis-0.0.3.5-py3.6.egg
.
File metadata
- Download URL: inscriptis-0.0.3.5-py3.6.egg
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75116fd6c426b2052668de7936364803743b0adfa0f41b3fa7f8cef5d6ff3ba7 |
|
MD5 | 3e6167a45933320a04706ed706a08dfb |
|
BLAKE2b-256 | dbcfe11dd7c7b1fda9ba686cfa2d2c0843ff20e9e920fb2d4af3346f66fb5f33 |
File details
Details for the file inscriptis-0.0.3.5-py3-none-any.whl
.
File metadata
- Download URL: inscriptis-0.0.3.5-py3-none-any.whl
- Upload date:
- Size: 13.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.10.0 pkginfo/1.2.1 requests/2.18.4 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.19.5 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a73a9e535f561c70152d3bf6137929fba0d2312962c017fa7e47bbe0da4750b2 |
|
MD5 | 3b85c425d7d216bf0abac7acace70cf5 |
|
BLAKE2b-256 | d5c716cce7340beb09c0fbf952dc5deefb1e612694b2f43733648674181c3b4a |