Transform entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Project description
Entity Embed
Transform entities like companies, products, etc. into vectors to support scalable Entity Resolution using Approximate Nearest Neighbors.
Using Entity Embed, you can train a deep learning model to transform entities into vectors in an N-dimensional embedding space. Thanks to a contrastive loss, those vectors are organized to keep similar entities close and dissimilar entities far apart in this embedding space. Embedding entities enables scalable ANN search, which means finding thousands of candidate duplicate pairs of entities per second per CPU.
Entity Embed achieves Recall of ~0.99 with Pair-Entity ratio below 100 on a variety of datasets. Entity Embed aims for high recall at the expense of precision. Therefore, this library is suited for the Blocking/Indexing stage of an Entity Resolution pipeline. A scalabale and noise-tolerant Blocking procedure is often the main bottleneck for performance and quality on Entity Resolution pipelines, so this library aims to solve that. Note the ANN search on embedded entities returns several candidate pairs that must be filtered to find the best matching pairs, possibly with a pairwise classifier.
Entity Embed is based on and is a special case of the AutoBlock model described by Amazon.
⚠️ Warning: this project is under heavy development.
Documentation
https://entity-embed.readthedocs.io
Requirements
System
- MacOS or Linux (tested on latest MacOS and Ubuntu via GitHub Actions).
- Entity Embed can train and run on a powerful laptop. Tested on a system with 32 GBs of RAM, RTX 2070 Mobile (8 GB VRAM), i7-10750H (12 threads). With batch sizes smaller than 32 and few field types, it's possible to train and run even with 2 GB of VRAM.
Libraries
- Python: >= 3.6
- Numpy: >= 1.19.0
- PyTorch: >= 1.7.1
- PyTorch Lightning: >= 1.1.6
- N2: >= 0.1.7
And others, see requirements.txt.
Installation
pip install entity-embed
Examples
Run:
pip install -r requirements-examples.txt
Then check the example Jupyter Notebooks:
- Deduplication, when you have a single dirty dataset with duplicates: notebooks/Deduplication-Example.ipynb
- Record Linkage, when you have multiple clean datasets you need to link: notebooks/Record-Linkage-Example.ipynb
Releases
See CHANGELOG.md.
Credits
This project is maintained by open-source contributors and Vinta Software.
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage
project template.
Commercial Support
Vinta Software is always looking for exciting work, so if you need any commercial support, feel free to get in touch: contact@vinta.com.br
Citations
If you use Entity Embed in your research, please consider citing it.
BibTeX entry:
@software{entity-embed,
title = {{Entity Embed}: Scalable Entity Resolution using Approximate Nearest Neighbors.},
author = {Juvenal, Flávio and Vieira, Renato},
url = {https://github.com/vintasoftware/entity-embed},
version = {0.0.1},
date = {2021-03-30},
year = {2021}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file entity-embed-0.0.1.tar.gz
.
File metadata
- Download URL: entity-embed-0.0.1.tar.gz
- Upload date:
- Size: 49.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b5295396063fec1e5766e33d0ac38a4066c71037fdb6f901ff783590df01b08 |
|
MD5 | 3dc7cd01db6ae4fcd88085031d62fe0f |
|
BLAKE2b-256 | 38120a8118cc0d139de9209b0a3f57d0feb4b506f86d871353e9c12cc6b80c92 |
File details
Details for the file entity_embed-0.0.1-py2.py3-none-any.whl
.
File metadata
- Download URL: entity_embed-0.0.1-py2.py3-none-any.whl
- Upload date:
- Size: 34.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c205f917da9435933ce1c31f77857dd60dc0424c273f5b754d7df3d552746822 |
|
MD5 | 54fd67e61883b34f95ae8daead2a7f2f |
|
BLAKE2b-256 | b23e0dbb8af24a88f6d30f6d608e7f92eb46221b6e90483757ad8cefb7e91039 |