Fuzzy matching utilities for scholarly metadata
Project description
fuzzycat (wip)
Fuzzy matching publications for fatcat.
Example Run
Run any clustering algorithm.
$ time python -m fuzzycat cluster -t tsandcrawler < data/sample10m.json | \
zstd -c9 > sample_cluster.json.zst
2020-11-18 00:19:48.194 DEBUG __main__ - run_cluster:
{"key_fail": 0, "key_ok": 9999938, "key_empty": 62, "key_denylist": 0, "num_clusters": 9040789}
real 75m23.045s
user 95m14.455s
sys 3m39.121s
Run verification.
$ time zstdcat -T0 sample_cluster.json.zst | python -m fuzzycat verify > sample_verify.txt
real 7m56.713s
user 8m50.703s
sys 0m29.262s
Example results over 10M docs:
{
"miss.appendix": 176,
"miss.blacklisted": 12124,
"miss.blacklisted_fragment": 9,
"miss.book_chapter": 46733,
"miss.component": 2173,
"miss.contrib_intersection_empty": 73592,
"miss.dataset_doi": 30806,
"miss.num_diff": 1,
"miss.release_type": 19767,
"miss.short_title": 16737,
"miss.subtitle": 11975,
"miss.title_filename": 87,
"miss.year": 123288,
"ok.arxiv_version": 90726,
"ok.dummy": 106196,
"ok.preprint_published": 10495,
"ok.slug_title_author_match": 47285,
"ok.title_author_match": 65685,
"ok.tokenized_authors": 7592,
"skip.container_name_blacklist": 20,
"skip.publisher_blacklist": 456,
"skip.too_large": 7430,
"skip.unique": 8808462,
"total": 9481815
}
A full run
Single threaded, 42h.
$ time zstdcat -T0 release_export_expanded.json.zst | \
TMPDIR=/bigger/tmp python -m fuzzycat cluster --tmpdir /bigger/tmp -t tsandcrawler | \
zstd -c9 > cluster_tsandcrawler.json.zst
{
"key_fail": 0,
"key_ok": 154202433,
"key_empty": 942,
"key_denylist": 0,
"num_clusters": 124321361
}
real 2559m7.880s
user 2605m41.347s
sys 118m38.141s
So, 29881072 (about 20%) docs in the potentially duplicated set.
Verification (about 15h):
$ time zstdcat -T0 cluster_tsandcrawler.json.zst | python -m fuzzycat verify | \
zstd -c9 > cluster_tsandcrawler_verified_3c7378.tsv.zst
...
real 927m28.631s
user 939m32.761s
sys 36m47.602s
Use cases
- take a release entity database dump as JSON lines and cluster releases (according to various algorithms)
- take cluster information and run a verification step (misc algorithms)
- create a dataset that contains grouping of releases under works
- command line tools to generate cache keys, e.g. to match reference strings to release titles (this needs some transparent setup, e.g. filling of a cache before ops)
Usage
Release clusters start with release entities json lines.
$ cat data/sample.json | python -m fuzzycat cluster -t title > out.json
Clustering 1M records (single core) takes about 64s (15K docs/s).
$ head -1 out.json
{
"k": "裏表紙",
"v": [
...
]
}
Using GNU parallel to make it faster.
$ cat data/sample.json | parallel -j 8 --pipe --roundrobin python -m fuzzycat.main cluster -t title
Interestingly, the parallel variants detects fewer clusters (because data is split and clusters are searched within each batch). TODO(miku): sort out sharding bug.
QA
10M release dataset
Notes on cadd28a version clustering (nysiis) and verification.
- 10M docs
- 9040789 groups
- 665447 verification pairs
3578378 OK.TITLE_AUTHOR_MATCH
2989618 Miss.CONTRIB_INTERSECTION_EMPTY
2731528 OK.SLUG_TITLE_AUTHOR_MATCH
2654787 Miss.YEAR
2434532 OK.WORK_ID
2050468 OK.DUMMY
1619330 Miss.SHARED_DOI_PREFIX
1145571 Miss.BOOK_CHAPTER
1023925 Miss.DATASET_DOI
934075 OK.DATACITE_RELATED_ID
868951 OK.DATACITE_VERSION
704154 OK.FIGSHARE_VERSION
682784 Miss.RELEASE_TYPE
607117 OK.TOKENIZED_AUTHORS
298928 OK.PREPRINT_PUBLISHED
270658 Miss.SUBTITLE
227537 Miss.SHORT_TITLE
196402 Miss.COMPONENT
163158 Miss.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW
122614 Miss.CUSTOM_PREFIX_10_7916
79687 OK.CUSTOM_IEEE_ARXIV
69648 OK.PMID_DOI_PAIR
46649 Miss.CUSTOM_PREFIX_10_14288
38598 OK.CUSTOM_BSI_UNDATED
15465 OK.DOI
13393 Miss.CUSTOM_IOP_MA_PATTERN
10378 Miss.CONTAINER
3045 Miss.BLACKLISTED
2504 Miss.BLACKLISTED_FRAGMENT
1574 Miss.TITLE_FILENAME
1273 Miss.APPENDIX
104 Miss.NUM_DIFF
4 OK.ARXIV_VERSION
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fuzzycat-0.1.9.tar.gz
.
File metadata
- Download URL: fuzzycat-0.1.9.tar.gz
- Upload date:
- Size: 67.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.2.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f064eb04eaef859101aad7eac2a5427c56c35374cedc2cffd1aa8f525395fdfd |
|
MD5 | 3fad6d9112f253d4cfdfec1e03088c45 |
|
BLAKE2b-256 | 4d530837c7808bf88bd730f3ca458ca882b56a1de79ff6109c036b15fe5ecd29 |
File details
Details for the file fuzzycat-0.1.9-py3-none-any.whl
.
File metadata
- Download URL: fuzzycat-0.1.9-py3-none-any.whl
- Upload date:
- Size: 68.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.2.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c138aa710f5ab653b8c1cca02d506a123ba2a084dda2c7524e382f1c5e61228 |
|
MD5 | 0b0cda2b8e92cf26148272d036c00cbb |
|
BLAKE2b-256 | 86d16f58c4bb2a96e659095c9e9b7dabd3148d7e97993568ae0ed8cf4011e7ce |