Fuzzy matching utilities for scholarly metadata
Project description
fuzzycat (wip)
Fuzzy matching publications for fatcat.
Motivation
Most of the results on sites like Google Scholar group publications into clusters. Each cluster represents one publication, abstracted from its concrete representation as a link to a PDF.
We call the abstract publication work and the concrete instance a release. The goal is to group releases under works and to implement a versions feature.
This repository contains both generic code for matching as well as fatcat specific code using the fatcat openapi client.
Approach
There are probably a few assumption we can make:
- If two strings are given, an exact string match does not mean equality (at all), e.g. "Acta geographica" has currently eight associated ISSN, and a title like "Buchbesprechungen" appears many hundreds of times.
- ...
- ...
Datasets
- release and container metadata from: https://archive.org/details/fatcat_bulk_exports_2020-08-05.
- issn journal level data, via issnlister
- abbreviation lists
Matching approaches
Performance data point
Candidate generation via elasticsearch, 40 parallel queries, sustained speed at about 17857 queries per hour, that is around 5 queries/s.
$ time cat ~/data/researchgate/x04 | \
parallel -j40 --pipe -N 1 ./fatcatx_rg_unmatched.py - \
> ~/data/researchgate/x04_results.ndj
...
real 3409m16.442s
user 29177m5.516s
sys 4927m3.277s
Data issues
A republished article
There is "student BMJ" and "BMJ" - this (html) article (interview) has been first published on "sbmj" (Published 07 July 2011), then "bmj" (Published 10 August 2011).
Notes; Originally published as: Student BMJ 2011;19:d3983
It is essentially the same text, same title, author, just different DOI and probably a different recorded date.
Generic pattern "republication" duplicate:
- metadata mostly same, except date and doi
Common title
Probably a few thousand very common short titles.
Some authors do this regularly:
Different DOI, so we know it is different.
More examples:
- https://fatcat.wiki/release/search?q=%22errata%22 (37680)
- https://fatcat.wiki/release/search?q=%22Einleitung%22 (68005)
- https://fatcat.wiki/release/search?q=%22Notes%22 (1507705)
- https://fatcat.wiki/release/search?q=%22Letters+to+the+Editor%22 (30976)
Title with extra data
- like ISBN, ISSN, price and all kind of extra metadata
- https://fatcat.wiki/release/search?q=title%3A%22ISBN%22
- titles typically get longer: https://fatcat.wiki/release/olxswrilxfci3ibb3bg5xhstr4
- some of these are actually "reviews", e.g. https://fatcat.wiki/release/4blc5mfc5bfaxkofuletqxuzp4
Another example:
- too long, original suggested citation seems to be:
Parker, S. and Kerrod, R. (2002), "Children’s) Space Busters (1st) Looking at Stars (2nd)", Reference Reviews, Vol. 16 No. 5, pp. 26-27. https://doi.org/10.1108/rr.2002.16.5.26.252
Sometimes a title will be ambiguous
For example given a title "Shakespeare in Tokyo" we would have to always return "ambiguous", as there are at least two separate publication with that name:
This is similar to journal names, where some journal names will always be ambiguous.
Versions
- same title, same authors, "vX" doi
- https://fatcat.wiki/release/search?q=%22Self-similarity+analysis+of+the+non-linear%22
Sometimes, we have a couple of preprint versions, plus a published version (with a slightly different title):
Almost same
- same author, maybe year
- different DOI
- title almost the same, e.g. MassIVE MSV000085583 - Aedes aegypti protein profile and proteome analysis
Duplication by different granularity
- https://fatcat.wiki/release/search?q=%22Volkshochschule+Leipzig%22 (20308)
- contains both yearly entries, as well as "DOI per page", https://fatcat.wiki/release/r734v367nza4tl37j6d74rfqo4; could group pages under "container" of yearly release?
- We have one container per release, currently.
Partial titles
A metadata title might differ from the full title.
Here, the release points to two PDFs, one is an article, the other a weekly report (summary).
Exact duplicates
Difference in Subtitle (invisible)
Subtitle is not visible metadata, all same, except for the DOI and the page number. Different.
The "what a difference a char makes" case
Typically a yearly report, or "part 1", "part 2", like this:
DOI differs and could hard code some patterns.
Published to two sites
An article can have multiple DOI, e.g. when republished by a site that gives out DOI, e.g. researchgate. Example:
https://doi.org/10.11648/j.ijmsa.s.2015040201.15, https://doi.org/10.13140/rg.2.1.2398.3606
Probably many "10.13140" prefixed DOI has at least another DOI.
Some might be "rg-only", like this: https://fatcat.wiki/release/search?q=%22Marco+de+trabajo+basado+en+los+datos+enlazados+para%22
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fuzzycat-0.1.2.tar.gz
.
File metadata
- Download URL: fuzzycat-0.1.2.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c31d0f51dbe1472aa37c9848928ec4bf4bb67ac9dfdcfd7572f23759e764aa4 |
|
MD5 | f23e5208826c5a209ee58c2a577d3b52 |
|
BLAKE2b-256 | e2697b61562a24c81dfef7a948e58bbe9f0efa9407396ea8967feb5f8b1e93b0 |
File details
Details for the file fuzzycat-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: fuzzycat-0.1.2-py3-none-any.whl
- Upload date:
- Size: 15.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 068317b0f2e07206fd1cab687eeea34f34b36bb6d932c8e27ec26a8b0479762b |
|
MD5 | 5fa10e38ae7422221e4f67ef5e464173 |
|
BLAKE2b-256 | 880593b9735e89c2e6b708c312f79164592215cb7824527a499f242a492a7b40 |