Skip to main content

Fuzzy matching utilities for scholarly metadata

Project description

fuzzycat: bibliographic fuzzy matching for fatcat.wiki

https://pypi-hypernode.com/project/fuzzycat/

This Python library contains routines for finding near-duplicate bibliographic entities (primarily research papers), and estimating whether two metadata records describe the same work (or variations of the same work). Some routines are designed to work "offline" with batches of billions of sorted metadata records, and others are designed to work "online" making queries against hosted web services and catalogs.

fuzzycat was originally developed by Martin Czygan at the Internet Archive, and is used in the construction of a citation graph and to identify duplicate records in the fatcat.wiki catalog and scholar.archive.org search index.

DISCLAIMER: this tool is still under development, as indicated by the "0" major version. The interface, semantics, and behavior are likely to be tweaked.

Quickstart

Inside a virtualenv (or similar), install with pip:

pip install fuzzycat

The fuzzycat.simple module contains high-level helpers which query Internet Archive hosted services:

import elasticsearch
from fuzzycat.simple import *

es_client = elasticsearch.Elasticsearch("https://search.fatcat.wiki:443")

# parses reference using GROBID (at https://grobid.qa.fatcat.wiki),
# then queries Elasticsearch (at https://search.fatcat.wiki),
# then scores candidates against latest catalog record fetched from
#  https://api.fatcat.wiki
best_match = closest_fuzzy_unstructured_match(
    """Cunningham HB, Weis JJ, Taveras LR, Huerta S. Mesh migration following abdominal hernia repair: a comprehensive review. Hernia. 2019 Apr;23(2):235-243. doi: 10.1007/s10029-019-01898-9. Epub 2019 Jan 30. PMID: 30701369.""",
    es_client=es_client)

print(best_match)
# FuzzyReleaseMatchResult(status=<Status.EXACT: 'exact'>, reason=<Reason.DOI: 'doi'>, release={...})

# same as above, but without the GROBID parsing, and returns multiple results
matches = close_fuzzy_biblio_matches(
    dict(
        title="Mesh migration following abdominal hernia repair: a comprehensive review",
        first_author="Cunningham",
        year=2019,
        journal="Hernia",
    ),
    es_client=es_client,
)

A CLI tool is included for processing records in UNIX stdin/stdout pipelines:

# print usage
python -m fuzzycat

Features and Use-Cases

The refcat project builds on top of this library to build a citation graph by processing billions of structured and unstructured reference records extracted from scholarly papers (note: jfor performance critical parts, some code has been ported to Go, albeit the test suite is shared between the Python and Go implementations).

Automated imports of metadata records into the fatcat catalog use fuzzycat to filter new metadata which look like duplicates of existing records from other sources.

In conjunction with standard command-line tools (like sort), fatcat bulk metadata snapshots can be clustered and reduced into groups to flag duplicate records for merging.

Extracted reference strings from any source (webpages, books, papers, wikis, databases, etc) can be resolved against the fatcat catalog of scholarly papers.

Support and Acknowledgements

Work on this software received support from the Andrew W. Mellon Foundation through multiple phases of the "Ensuring the Persistent Access of Open Access Journal Literature" project (see original announcement).

Additional acknowledgements at fatcat.wiki.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzycat-0.1.22.tar.gz (76.4 kB view details)

Uploaded Source

Built Distribution

fuzzycat-0.1.22-py3-none-any.whl (79.5 kB view details)

Uploaded Python 3

File details

Details for the file fuzzycat-0.1.22.tar.gz.

File metadata

  • Download URL: fuzzycat-0.1.22.tar.gz
  • Upload date:
  • Size: 76.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.25.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.7.8

File hashes

Hashes for fuzzycat-0.1.22.tar.gz
Algorithm Hash digest
SHA256 a6d5f0727f05091e16c9fbe0018ba6ac28cea50f825dfe5c8c0a48f39b31e19b
MD5 689ca97dea39b45c0f8f2b1ee83630b5
BLAKE2b-256 a4737ad4f92406c37fe18d4ebc1c742c5cdfe95cc01c9678424828ea5404d7bc

See more details on using hashes here.

File details

Details for the file fuzzycat-0.1.22-py3-none-any.whl.

File metadata

  • Download URL: fuzzycat-0.1.22-py3-none-any.whl
  • Upload date:
  • Size: 79.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.25.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.7.8

File hashes

Hashes for fuzzycat-0.1.22-py3-none-any.whl
Algorithm Hash digest
SHA256 c76b03c0cd3eb039d3137dd498b8d0695e0c745008c68547c416292848cfa4b7
MD5 7716e5f29bf560420a04634f49dec007
BLAKE2b-256 0614445c7c19c3992e7339cfd63b04343d51f4cdf1d89e07ed56dc9d60f0f379

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page