Make record linkages in followthemoney data.
Project description
nomenklatura
Nomenklatura de-duplicates and integrates different Follow the Money entities. It serves to clean up messy data and to find links between different datasets.
Design
This package will offer an implementation of an in-memory data deduplication framework centered around the FtM data model. The idea is the following workflow:
- Accept FtM-shaped entities from a given loader (e.g. a JSON file, or a database)
- Build an in-memory inverted index of the entities for blocking
- Generate merge candidates using the blocking index and FtM compare
- Provide a file-based storage format for merge challenges and decisions
- Provide a text-based user interface to let users make merge decisions
Later on, the following might be added:
- A web application to let users make merge decisions on the web
- An implementation of the OpenRefine Reconciliation API based on the blocking index
This will be done in typed Python 3.
Reading
- https://dedupe.readthedocs.org/en/latest/
- https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources
- https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
- https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API
Contact, contributions etc.
This codebase is licensed under the terms of an MIT license (see LICENSE).
We're keen for any contributions, bug fixes and feature suggestions, please use the GitHub issue tracker for this repository.
Nomenklatura is currently developed thanks to a Prototypefund grant for OpenSanctions. Previous iterations of the package were developed with support from Knight-Mozilla OpenNews and the Open Knowledge Foundation Labs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nomenklatura-0.1.0.tar.gz
.
File metadata
- Download URL: nomenklatura-0.1.0.tar.gz
- Upload date:
- Size: 13.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b9e89494f91eb005debb82c0758d34c1cf6e17819b8141eee888a964a57458d |
|
MD5 | d59b289531e5ba07e92b4fa4ae6a635a |
|
BLAKE2b-256 | e5192e7d3c5a3b84b4c199f1e818c3167ef83ca2c8b9d7360b760b5b657016e9 |
File details
Details for the file nomenklatura-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: nomenklatura-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9ab31b96820fe9731017814c3a2501dad130b99c249210a2b998780806d4d81 |
|
MD5 | 640f01cb9bdc168e953de433cc89af00 |
|
BLAKE2b-256 | 1982c2b1318668252388947ea6acdc86b78521b89ab6d626c92934ec9fe3f8dd |