parser and transforms for GROBID-flavor TEI-XML

These details have not been verified by PyPI

Project links

Homepage

Project description

`grobid_tei_xml`: Python parser and transforms for GROBID-flavor TEI-XML

This is a simple python library for parsing the TEI-XML structured documents returned by GROBID, a machine learning tool for extracting text and bibliographic metadata from research article PDFs.

TEI-XML is a standard format, and there exist other libraries to parse entire documents and work with annotated text. This library is focused specifically on extracting "header" metadata from document (eg, title, authors, journal name, volume, issue), content in flattened text form (full abstract and body text as single strings, for things like search indexing), and structured citation metadata.

Quickstart

grobid_tei_xml works with Python 3, using only the standard library. It does not talk to the GROBID HTTP API or read files off disk on it's own, but see examples below. The library is packaged on pypi.org.

Install using pip, usually within a virtualenv:

pip install grobid_tei_xml

The main entry points are the function process_document_xml(xml_text) and process_citations_xml(xml_text), which return python dataclass objects. The helper method .to_dict() can be useful for, eg, serializing these objects to JSON.

Usage Examples

Read an XML file from disk, parse it, and print to stdout as JSON:

import json
import grobid_tei_xml

xml_path = "./tests/files/small.xml"

with open(xml_path, 'r') as xml_file:
    doc = grobid_tei_xml.parse_document_xml(xml_file.read())

print(json.dumps(doc.to_dict(), indent=2))

Use requests to download a PDF from the web, submit to GROBID (via HTTP API), parse the TEI-XML response with grobid_tei_xml, and print some metadata fields:

import requests
import grobid_tei_xml

pdf_resp = requests.get("https://arxiv.org/pdf/1802.01168v3")
pdf_resp.raise_for_status()

grobid_resp = requests.post(
    "https://cloud.science-miner.com/grobid/api/processFulltextDocument",
    files={
        'input': pdf_resp.content,
        'consolidate_Citations': 0,
        'includeRawCitations': 1,
    },
    timeout=60.0,
)
grobid_resp.raise_for_status()

doc = grobid_tei_xml.parse_document_xml(grobid_resp.text)

print("title: " + doc.header.title)
print("authors: " + ", ".join([a.full_name for a in doc.header.authors]))
print("doi: " + str(doc.header.doi))
print("citation count: " + str(len(doc.citations)))
print("abstract: " + doc.abstract)

Use requests to submit a "raw" citation string to GROBID for extraction, parse the response with grobid_tei_xml, and print the structured output to stdout:

import requests
import grobid_tei_xml

raw_citation = "Kvenvolden K.A. and Field M.E. 1981. Thermogenic hydrocarbons in unconsolidated sediment of Eel River Basin, offshore northern California. AAPG Bulletin 65:1642-1646"

grobid_resp = requests.post(
    "https://cloud.science-miner.com/grobid/api/processCitation",
    data={
        'citations': raw_citation,
        'consolidateCitations': 0,
    },
    timeout=10.0,
)
grobid_resp.raise_for_status()

citation = grobid_tei_xml.parse_citations_xml(grobid_resp.text)[0]
print(citation)

License

This library is available under the permissive MIT License. See LICENSE.txt for a copy.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.3

Nov 5, 2021

This version

0.1.2

Oct 28, 2021

0.1.1

Oct 27, 2021

0.1.0

Oct 26, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grobid_tei_xml-0.1.2.tar.gz (10.0 kB view details)

Uploaded Oct 28, 2021 Source

Built Distribution

grobid_tei_xml-0.1.2-py2.py3-none-any.whl (14.1 kB view details)

Uploaded Oct 28, 2021 Python 2 Python 3

File details

Details for the file grobid_tei_xml-0.1.2.tar.gz.

File metadata

Download URL: grobid_tei_xml-0.1.2.tar.gz
Upload date: Oct 28, 2021
Size: 10.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for grobid_tei_xml-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`8a7fec9a9646ab887c8e8dbb48d0e39e2bc18e933db69957392fb9e872939f27`
MD5	`7e80defe1efabf6a11f5844c631675e0`
BLAKE2b-256	`36dcf981bd7de13cb68177d095d09d67374345179f7e36045550bccaadbf6dbb`

See more details on using hashes here.

File details

Details for the file grobid_tei_xml-0.1.2-py2.py3-none-any.whl.

File metadata

Download URL: grobid_tei_xml-0.1.2-py2.py3-none-any.whl
Upload date: Oct 28, 2021
Size: 14.1 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for grobid_tei_xml-0.1.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`96670ae6e1e31e281622391f111d32a0c4c73c539134775dfee0b7c7f26bacee`
MD5	`e63fb8ac2ce2edd77f744de1c1953ab7`
BLAKE2b-256	`7920114e60e030b97511345266078d30e75d7c20fd78e7181f813077b66818ce`

See more details on using hashes here.

grobid-tei-xml 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

`grobid_tei_xml`: Python parser and transforms for GROBID-flavor TEI-XML

Quickstart

Usage Examples

See Also

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

grobid-tei-xml 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

grobid_tei_xml: Python parser and transforms for GROBID-flavor TEI-XML

Quickstart

Usage Examples

See Also

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`grobid_tei_xml`: Python parser and transforms for GROBID-flavor TEI-XML