Smart text extraction from PDF documents
Project description
EDS-PDF
EDS-PDF provides a modular framework to extract text information from PDF documents.
You can use it out-of-the-box, or extend it to fit your specific use case. We provide a pipeline system and various utilities for visualizing and processing PDFs, as well as multiple components to build complex models:complex models:
- 📄 Extractors to parse PDFs (based on pdfminer, mupdf or poppler)
- 🎯 Classifiers to perform text box classification, in order to segment PDFs
- 🧩 Aggregators to produce an aggregated output from the detected text boxes
- 🧠 Trainable layers to incorporate machine learning in your pipeline (e.g., embedding building blocks or a trainable classifier)
Visit the :book: documentation for more information!
Getting started
Installation
Install the library with pip:
pip install edspdf
Extracting text
Let's build a simple PDF extractor that uses a rule-based classifier. There are two ways to do this, either by using the configuration system or by using the pipeline API.
Create a configuration file:
config.cfg
[pipeline]
pipeline = ["extractor", "classifier", "aggregator"]
[components.extractor]
@factory = "pdfminer-extractor"
[components.classifier]
@factory = "mask-classifier"
x0 = 0.2
x1 = 0.9
y0 = 0.3
y1 = 0.6
threshold = 0.1
[components.aggregator]
@factory = "simple-aggregator"
and load it from Python:
import edspdf
from pathlib import Path
model = edspdf.load("config.cfg") # (1)
Or create a pipeline directly from Python:
from edspdf import Pipeline
model = Pipeline()
model.add_pipe("pdfminer-extractor")
model.add_pipe(
"mask-classifier",
config=dict(
x0=0.2,
x1=0.9,
y0=0.3,
y1=0.6,
threshold=0.1,
),
)
model.add_pipe("simple-aggregator")
This pipeline can then be applied (for instance with this PDF):
# Get a PDF
pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes()
pdf = model(pdf)
body = pdf.aggregated_texts["body"]
text, style = body.text, body.properties
See the rule-based recipe for a step-by-step explanation of what is happening.
Citation
If you use EDS-PDF, please cite us as below.
@software{edspdf,
author = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
doi = {10.5281/zenodo.6902977},
license = {BSD-3-Clause},
title = {{EDS-PDF: Smart text extraction from PDF documents}},
url = {https://github.com/aphp/edspdf}
}
Acknowledgement
We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file edspdf-0.8.1.tar.gz
.
File metadata
- Download URL: edspdf-0.8.1.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c0002fee7c50a524e74cbba213c044b83df8bf258b5a2c69bd50800faa1647dc |
|
MD5 | 481ea71a0e91b6a84e9b7d711df22bca |
|
BLAKE2b-256 | 8089120a495d63439dfb015ac775b3b38971c47e7a6c096a2a3c7486c33112c1 |
File details
Details for the file edspdf-0.8.1-py3-none-any.whl
.
File metadata
- Download URL: edspdf-0.8.1-py3-none-any.whl
- Upload date:
- Size: 74.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18f0689c8a24e38e5202e8c3e58befafb3f2d6bfb614d1f86b351ed6cce9ada0 |
|
MD5 | cc78646fd5281dafc6fbb0797352e93e |
|
BLAKE2b-256 | 28f728e0b426738714bdc179ea2b42a5d95bf05b222baa778ce6e263479e2f50 |