No project description provided
Project description
yente
yente
is an open source data match-making API. The service exposes HTTP endpoints to search, retrieve or match FollowTheMoney entities, including people, companies and vessels that are subject to international sanctions.
The API is built to provide access to OpenSanctions data, it can also be used to search and match other data, such as the ICIJ OffshoreLeaks.
While yente
is the open source core code base for the OpenSanctions API, it can also be run on-premises as a KYC appliance so that no customer data leaves the deployment context. The software is distributed as a Docker image with a pre-defined docker-compose.yml
configuration that also provisions the requisite ElasticSearch index.
- Self-hosted OpenSanctions API
- Documentation
- Commercial data licensing: https://www.opensanctions.org/licensing/
- Why is it called yente?
Usage
If you prefer a hosted (Software-as-a-Service, SaaS) API, check out the OpenSanctions API.
In order to deploy yente
on your own servers, we recommend you use docker-compose
(or another Docker orchestration tool) to pull and run the pre-built containers. For example, you can download the docker-compose.yml
in the repository and use it to boot an instance of the system:
mkdir -p yente && cd yente
wget https://raw.githubusercontent.com/opensanctions/yente/main/docker-compose.yml
docker-compose up
This will make the service available on Port 8000 of the local machine.
If you run the container in a cluster management system like Kubernetes, you will need to run both of the services defined in the compose file (the API and ElasticSearch instance). You may also need to assign the API container network policy permissions to fetch data from data.opensanctions.org
once every hour so that it can update itself.
Managing data updates
By default, yente
will check for an updated build of the OpenSanctions database published at data.opensanctions.org
every 30 minutes. If a fresh version is found, an indexing process will be spawned and load the data into the ElasticSearch index.
You can change this behaviour in two ways:
-
Specify a crontab for
YENTE_SCHEDULE
in your environment in order to run the auto-update process at a different interval. Setting the environment variableYENTE_AUTO_REINDEX
tofalse
will disable automatic data updates entirely. -
If you wish to manually run an indexing process, you can do so by calling the script
yente/reindex.py
. This command must be invoked inside the application container. For example, in a docker-compose based environment, the full command would be:docker-compose run app python3 yente/reindex.py
.
The production settings for api.ppensanctions.org use these two options in conjunction to move reindexing to a separate Kubernetes CronJob that allows for stricter resource management.
Settings
The API server has a few operations-related settings, which are passed as environment variables. The settings include:
YENTE_ENDPOINT_URL
the URL which should be used to generate external links back to the API server, e.g.https://yente.mycompany.com
.YENTE_MANIFEST
specify the path of themanifest.yml
that defines the datasets exposed by the service.YENTE_SCHEDULE
gives the frequency at which new data will be indexed as a a crontab.YENTE_AUTO_REINDEX
can be set tofalse
to disable automatic data updates and force data to be re-indexed only via the command line (yente reindex
).YENTE_UPDATE_TOKEN
should be set to a secret string. The token is used with aPOST
request to the/updatez
endpoint to force an immediate re-indexing of the data.YENTE_ELASTICSEARCH_URL
: Elasticsearch URL, defaults tohttp://localhost:9200
.YENTE_ELASTICSEARCH_INDEX
: Elasticsearch index, defaults toyente
.YENTE_ELASTICSEARCH_CLOUD_ID
: If you are using Elastic Cloud and want to use the ID rather than endpoint URL.YENTE_ELASTICSEARCH_USERNAME
: Elasticsearch username. Required if connection usingYENTE_ES_CLOUD_ID
.YENTE_ELASTICSEARCH_PASSWORD
: Elasticsearch password. Required if connection usingYENTE_ES_CLOUD_ID
.
Adding custom datasets
The default configuration of yente
will index and expose the datasets published by OpenSanctions every time they change. By adding a manifest file, you can change this behaviour in several ways:
- Index additional datasets that should be checked by the matching API. This might include large public datasets, or in-house data (such as a customer blocklist, local PEPs list, etc.) that you wish to vet alongside the OpenSanctions data.
- Index only a part of the OpenSanctions data, e.g. only the
sanctions
collection.
Side note: A dataset in yente
contains a set of entities. However, some datasets instead reference a list of other datasets which should be included in their scope. Datasets that contain other datasets are called collections. For example, the dataset us_ofac_sdn
(the US sanctions list) is included in the collections sanctions
and default
.
Defining these extra indexing options is handled via a YAML file you can supply for yente
. (The file needs to be accessible to the application, which may require the use of a Docker volume or a Kubernetes ConfigMap). The manifest file can also be configured as a HTTP/HTTPS URL which the yente application will download upon startup. An example manifest might look like this:
# Import external dataset specifications from OpenSanctions. This will fetch the dataset
# metadata from the given index and make them available to yente.
catalogs:
- # nb. replace `latest` with a date stamp (e.g. 20220419) to fetch historical
# OpenSanctions data for a particular day:
url: "https://data.opensanctions.org/datasets/latest/index.json"
# Limit the dataset scope of the entities which will be indexed into yente. Useful
# values include `default`, `sanctions` or `peps`.
scope: all
resource_name: entities.ftm.json
# The next section begins to specify non-OpenSanctions datasets that should be exposed
# in the API:
datasets:
# Example A: fetch a public dataset from a URL and include it in the default search
# scope defined upstream in OpenSanctions:
- name: offshoreleaks
title: ICIJ OffshoreLeaks
url: https://data.opensanctions.org/contrib/icij-offshoreleaks/full-oldb.json
# children:
# - all
# - offshore
# Example B: a local dataset from a path that is visible within the container used
# to run the service:
- name: blocklist
title: Customer Blocklist
path: /data/customer-blocklist.json
# Incrementing the version will force a re-indexing of the data. It must be
# given as a monotonic increasing number.
version: "20220419001"
# Example C: a combined collection that allows querying all entities in its member
# datasets at the same time:
- name: full
title: Full index
datasets:
- offshoreleaks
- blocklist
# include OpenSanctions collections:
- sanctions
# Example D: an extra sanctions list that appends itself to the members of the
# `sanctions` collection defined upstream in OpenSanctions:
- name: zz_extra_sanctions
title: Extra sanctions data
path: /data/extra-sanctions.json
collections:
- sanctions
In order for yente
to import a custom dataset, it must be formatted as a line-based JSON feed of FollowTheMoney entities. There are various ways to produce FtM data, but the most convenient is importing structured data via a mapping specification using the ftm
set of command-line tools. This allows reading data from a CSV file or SQL database and converting each row into entities. Don't forget to ftm aggregate
your custom data before indexing it in yente
!
Development
yente
is implemented in asynchronous, typed Python using the FastAPI framework. We're happy to see any bug fixes, improvements or extensions from the community. For local development without Docker, install the package into a fresh virtual Python environment like this:
git clone https://github.com/opensanctions/yente.git
cd yente
pip install -e .
This will install a broad range of dependencies, including numpy
, scikit-learn
and pyicu
, which are binary packages that may require a local build environment. For pyicu
in particular, refer to the package documentation.
Running the server
Once you've set the YENTE_ELASTICSEARCH_URL
environment variable to point to a running instance of ElasticSearch, you can run the web server like this:
python yente/server.py
License and Support
yente
is licensed according to the MIT license terms documented in LICENSE
. Using the service in a commercial context may require a data license for OpenSanctions data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file yente-3.2.0.tar.gz
.
File metadata
- Download URL: yente-3.2.0.tar.gz
- Upload date:
- Size: 39.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b33810c03a9bf1cbf35f9f2c3b67a5854c0865c46338c4b1628423013fcfa35 |
|
MD5 | 3d3afe60e14085d75cd8ed4466919f72 |
|
BLAKE2b-256 | 9ced7552e168185326d924c194b7497cc9c03e1ac805358061cdd7d6effad62b |
File details
Details for the file yente-3.2.0-py3-none-any.whl
.
File metadata
- Download URL: yente-3.2.0-py3-none-any.whl
- Upload date:
- Size: 40.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f094ffcc120fc4b9bff2e72d5f9497b5f5f41ce46baf9d564f9d3ab213a32573 |
|
MD5 | a0705ea81057101d53f1a81485f76bfb |
|
BLAKE2b-256 | 19e661ab53ac91490c0602eab9bc71358dd4a7ae1b65eb61d73b740768b297fc |