Smart text extraction from PDF documents

These details have not been verified by PyPI

Project links

Project description

Tests

EDS-PDF

EDS-PDF provides a modular framework to extract text information from PDF documents.

You can use it out-of-the-box, or extend it to fit your specific use case. We provide a pipeline system and various utilities for visualizing and processing PDFs, as well as multiple components to build complex models:complex models:

📄 Extractors to parse PDFs (based on pdfminer, mupdf or poppler)
🎯 Classifiers to perform text box classification, in order to segment PDFs
🧩 Aggregators to produce an aggregated output from the detected text boxes
🧠 Trainable layers to incorporate machine learning in your pipeline (e.g., embedding building blocks or a trainable classifier)

Visit the :book: documentation for more information!

Getting started

Installation

Install the library with pip:

pip install edspdf

Extracting text

Let's build a simple PDF extractor that uses a rule-based classifier. There are two ways to do this, either by using the configuration system or by using the pipeline API.

Create a configuration file:

`config.cfg`

[pipeline]
pipeline = ["extractor", "classifier", "aggregator"]

[components.extractor]
@factory = "pdfminer-extractor"

[components.classifier]
@factory = "mask-classifier"
x0 = 0.2
x1 = 0.9
y0 = 0.3
y1 = 0.6
threshold = 0.1

[components.aggregator]
@factory = "simple-aggregator"

and load it from Python:

import edspdf
from pathlib import Path

model = edspdf.load("config.cfg")  # (1)

Or create a pipeline directly from Python:

from edspdf import Pipeline

model = Pipeline()
model.add_pipe("pdfminer-extractor")
model.add_pipe(
    "mask-classifier",
    config=dict(
        x0=0.2,
        x1=0.9,
        y0=0.3,
        y1=0.6,
        threshold=0.1,
    ),
)
model.add_pipe("simple-aggregator")

This pipeline can then be applied (for instance with this PDF):

# Get a PDF
pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes()
pdf = model(pdf)

body = pdf.aggregated_texts["body"]

text, style = body.text, body.properties

See the rule-based recipe for a step-by-step explanation of what is happening.

Citation

If you use EDS-PDF, please cite us as below.

@software{edspdf,
  author  = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
  doi     = {10.5281/zenodo.6902977},
  license = {BSD-3-Clause},
  title   = {{EDS-PDF: Smart text extraction from PDF documents}},
  url     = {https://github.com/aphp/edspdf}
}

Acknowledgement

We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.1

Mar 19, 2024

0.9.0

Feb 26, 2024

This version

0.8.1

Sep 26, 2023

0.8.0

Sep 7, 2023

0.7.0

Jun 9, 2023

0.5.3

Aug 31, 2022

0.5.2

Aug 30, 2022

0.5.1

Jul 26, 2022

0.5.0

Jul 25, 2022

0.5.0b0 pre-release

Jul 25, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edspdf-0.8.1.tar.gz (1.7 MB view details)

Uploaded Sep 26, 2023 Source

Built Distribution

edspdf-0.8.1-py3-none-any.whl (74.5 kB view details)

Uploaded Sep 26, 2023 Python 3

File details

Details for the file edspdf-0.8.1.tar.gz.

File metadata

Download URL: edspdf-0.8.1.tar.gz
Upload date: Sep 26, 2023
Size: 1.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for edspdf-0.8.1.tar.gz
Algorithm	Hash digest
SHA256	`c0002fee7c50a524e74cbba213c044b83df8bf258b5a2c69bd50800faa1647dc`
MD5	`481ea71a0e91b6a84e9b7d711df22bca`
BLAKE2b-256	`8089120a495d63439dfb015ac775b3b38971c47e7a6c096a2a3c7486c33112c1`

See more details on using hashes here.

File details

Details for the file edspdf-0.8.1-py3-none-any.whl.

File metadata

Download URL: edspdf-0.8.1-py3-none-any.whl
Upload date: Sep 26, 2023
Size: 74.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for edspdf-0.8.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`18f0689c8a24e38e5202e8c3e58befafb3f2d6bfb614d1f86b351ed6cce9ada0`
MD5	`cc78646fd5281dafc6fbb0797352e93e`
BLAKE2b-256	`28f728e0b426738714bdc179ea2b42a5d95bf05b222baa778ce6e263479e2f50`