Skip to main content

Tools for extracting tabular data from PDFs, using pdfminer

Project description

Have some tabluar data locked away in PDF format? Like the financial information at my esteemed place of current employment, which looks roughly like this:

example/ui-financials.png

Tabula not plausible for the volume of information you’re needing to extract? (Thousands of pages in my case.) This package may be what you’re looking for. I should note that this is a simple tool aimed at very structured data. Tabula can handle far messsier situations than this package. Misaligned cell heights? Word-wrapped cells? Spanning cells? You’re better off with Tabula. Computer-generated report PDFs that urgently want to be in a SQLite database? You’ve come to the right place.

Reading UIUC Financials

If you came here looking to read financial statements at UIUC, there’s a page just for you.

Package Overview

This package builds on pdfminer to make it easy to absorb computer-generated tabular data in PDF form and produce JSON-like lists of row dictionaries. The basic workflow is as follows:

# identify top of table
top_y0 = find_attr_group_matching(
        ["Last Name", "First Name"], "y0", page_it.lines)

# extract text snippets making up table body
table_lines = [l for l in page_it.lines if l.y0 < top_y0]

# extract header text snippets
headers = [l for l in page_it.lines if abs(l.y0 - top_y0) < 5]

# extract table
rows = find_row_table(headers, table_lines)
rows = merge_overlapping_rows(rows, "y0", "y1")

This will leave rows to be a data structure roughly like the following:

{'Amount ': TL('           60.00 '), 'Last Name': TL('Lidstad'), 'Address': TL('62\xa0Mississippi\xa0River\xa0Blvd\xa0N'), 'First Name': TL('Dick\xa0&\xa0Peg'), 'City': TL('Saint\xa0Paul'), 'State': TL('MN'), 'Zip': TL('55104'), 'Occupation': TL('retired'), 'Date': TL('10/12/2012')}
{'Amount ': TL('           60.00 '), 'Last Name': TL('Strom'), 'Address': TL('1229\xa0Hague\xa0Ave'), 'First Name': TL('Pam'), 'City': TL('St.\xa0Paul'), 'State': TL('MN'), 'Zip': TL('55104'), 'Date': TL('9/12/2012')}
{'Amount ': TL('           60.00 '), 'Last Name': TL('Seeba'), 'Address': TL('1399\xa0Sheldon\xa0St'), 'First Name': TL('Louise\xa0&\xa0Paul'), 'City': TL('Saint\xa0Paul'), 'State': TL('MN'), 'Zip': TL('55108'), 'Occupation': TL('BOE'), 'Employer': TL('City\xa0of\xa0Saint\xa0Paul'), 'Date': TL('10/12/2012')}
{'Amount ': TL('           60.00 '), 'Last Name': TL('Schumacher\xa0/\xa0Bales'), 'First Name': TL('Douglas\xa0L.\xa0/\xa0Patricia\xa0948\xa0County\xa0Rd.\xa0D\xa0W'), 'City': TL('Saint\xa0Paul'), 'State': TL('MN'), 'Zip': TL('55126'), 'Date': TL('10/13/2012')}
{'Amount ': TL('           75.00 '), 'Last Name': TL('Abrams'), 'Address': TL('238\xa08th\xa0St\xa0east'), 'First Name': TL('Marjorie'), 'City': TL('St\xa0Paul'), 'State': TL('MN'), 'Zip': TL('55101'), 'Occupation': TL('Retired'), 'Employer': TL('Retired'), 'Date': TL('8/8/2012')}

See this demo for a minimal, fully functional example. There is some documentation in the source code. In addition, there are some (sparsely documented) facilities for inserting the obtained data into a SQLite3 database and the full script I use to make my financial info digestible.

The package is Python 3-only. Install using:

pip install pdf2data

https://github.com/inducer/pdf2data

Copyright 2019 Andreas Kloeckner

Released under the MIT License

In terms of support, if this doesn’t do what you need, you’re likely to be on your own. I’m happy to take patches, but I’m unlikely to have to time to fix your use case.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2data-2019.1.tar.gz (82.1 kB view details)

Uploaded Source

File details

Details for the file pdf2data-2019.1.tar.gz.

File metadata

  • Download URL: pdf2data-2019.1.tar.gz
  • Upload date:
  • Size: 82.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.29.0 CPython/3.7.4

File hashes

Hashes for pdf2data-2019.1.tar.gz
Algorithm Hash digest
SHA256 c3933138bd67b3791571ea781cbc34ccbc457eead3de9bb4d8f51ec3aa01f726
MD5 a055e751be03629724f5f7456a71987c
BLAKE2b-256 b8724421e84576046b53e0c024b3538b456e7bc675e8a8829d02ff5f31665135

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page