Skip to main content

Easy to use text extractor, from PDF, DOC, DOCX and other document types, using the awesome Textract, including if necessary using OCR (via Tesseract).

Project description

PyPI-Status PyPI-Versions LICENCE

Easy to use text extractor, from PDF, DOC, DOCX and other documents, including if necessary using OCR (via Tesseract).

This library can extract text from any type supported by Textract.

This library only exists because of the awesome work of the Textract team and Tesseract.

Screenshot

It runs under Python 2.7 (it was not tested nor developped with compatibility with Python 3 in mind, although it might work with some slight changes).

INSTALL

Please pip install textract and install tesseract v3 (not v4!) for your platform beforehand.

For DOC support (not DOCX as it is already supported natively), you will also need antiword installed in C:antiwordantiword.exe (for Linux and Mac you will need to change the path inside the script.)

LICENSE

easytextract was initially made by Stephen Larroque <LRQ3000> for the Coma Science Group - GIGA Consciousness - CHU de Liege, Belgium. The application is licensed under MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easytextract-1.0.0.tar.gz (2.8 MB view details)

Uploaded Source

File details

Details for the file easytextract-1.0.0.tar.gz.

File metadata

File hashes

Hashes for easytextract-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c074446a1bf4c2fa7ce38b006c4fd05ddeb6ac6411653fe91b80f76ac5ad9fe9
MD5 84044c3c4764e2125b6be1c3eb1e3039
BLAKE2b-256 d9efc9b07f34efa2174997a2349ace8daf7a8eddd085864e5f1189612be0024a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page