Skip to main content

Fuzzy matching utilities for scholarly metadata

Project description

fuzzycat (wip)

Fuzzy matching publications for fatcat.

Motivation

Most of the results on sites like Google Scholar group publications into clusters. Each cluster represents one publication, abstracted from its concrete representation as a link to a PDF.

We call the abstract publication work and the concrete instance a release. The goal is to group releases under works and to implement a versions feature.

This repository contains both generic code for matching as well as fatcat specific code using the fatcat openapi client.

Approach

There are probably a few assumption we can make:

  • If two strings are given, an exact string match does not mean equality (at all), e.g. "Acta geographica" has currently eight associated ISSN, and a title like "Buchbesprechungen" appears many hundreds of times.
  • ...
  • ...

Datasets

Matching approaches

Performance data point

Candidate generation via elasticsearch, 40 parallel queries, sustained speed at about 17857 queries per hour, that is around 5 queries/s.

$ time cat ~/data/researchgate/x04 | \
    parallel -j40 --pipe -N 1 ./fatcatx_rg_unmatched.py - \
    > ~/data/researchgate/x04_results.ndj
...
real    3409m16.442s
user    29177m5.516s
sys     4927m3.277s

Data issues

A republished article

There is "student BMJ" and "BMJ" - this (html) article (interview) has been first published on "sbmj" (Published 07 July 2011), then "bmj" (Published 10 August 2011).

Notes; Originally published as: Student BMJ 2011;19:d3983

It is essentially the same text, same title, author, just different DOI and probably a different recorded date.

Generic pattern "republication" duplicate:

  • metadata mostly same, except date and doi

Common title

Probably a few thousand very common short titles.

Some authors do this regularly:

Different DOI, so we know it is different.

More examples:

Title with extra data

Another example:

  • too long, original suggested citation seems to be:

Parker, S. and Kerrod, R. (2002), "Children’s) Space Busters (1st) Looking at Stars (2nd)", Reference Reviews, Vol. 16 No. 5, pp. 26-27. https://doi.org/10.1108/rr.2002.16.5.26.252

Sometimes a title will be ambiguous

For example given a title "Shakespeare in Tokyo" we would have to always return "ambiguous", as there are at least two separate publication with that name:

This is similar to journal names, where some journal names will always be ambiguous.

Versions

Sometimes, we have a couple of preprint versions, plus a published version (with a slightly different title):

Almost same

Duplication by different granularity

Partial titles

A metadata title might differ from the full title.

Here, the release points to two PDFs, one is an article, the other a weekly report (summary).

Exact duplicates

Difference in Subtitle (invisible)

Subtitle is not visible metadata, all same, except for the DOI and the page number. Different.

The "what a difference a char makes" case

Typically a yearly report, or "part 1", "part 2", like this:

DOI differs and could hard code some patterns.

Published to two sites

An article can have multiple DOI, e.g. when republished by a site that gives out DOI, e.g. researchgate. Example:

https://doi.org/10.11648/j.ijmsa.s.2015040201.15, https://doi.org/10.13140/rg.2.1.2398.3606

Probably many "10.13140" prefixed DOI has at least another DOI.

Some might be "rg-only", like this: https://fatcat.wiki/release/search?q=%22Marco+de+trabajo+basado+en+los+datos+enlazados+para%22

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzycat-0.1.2.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

fuzzycat-0.1.2-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file fuzzycat-0.1.2.tar.gz.

File metadata

  • Download URL: fuzzycat-0.1.2.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for fuzzycat-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0c31d0f51dbe1472aa37c9848928ec4bf4bb67ac9dfdcfd7572f23759e764aa4
MD5 f23e5208826c5a209ee58c2a577d3b52
BLAKE2b-256 e2697b61562a24c81dfef7a948e58bbe9f0efa9407396ea8967feb5f8b1e93b0

See more details on using hashes here.

File details

Details for the file fuzzycat-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: fuzzycat-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for fuzzycat-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 068317b0f2e07206fd1cab687eeea34f34b36bb6d932c8e27ec26a8b0479762b
MD5 5fa10e38ae7422221e4f67ef5e464173
BLAKE2b-256 880593b9735e89c2e6b708c312f79164592215cb7824527a499f242a492a7b40

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page