Fast semantic search and comparison
Project description
PumpkinPy - Semantic similarity implemented in python
About
PumpkinPy uses IC ordered bitmaps for fast ranking of genes and diseases (phenotypes are sorted by descending frequency and one-hot encoded). This is useful for larger ontologies such as Upheno and large datasets such as ranking all mouse genes given a set of input HPO terms. This approach was first used in OWLTools and OwlSim-v3.
The goal of this project was to build an implementation of the PhenoDigm algorithm in python. There are also implementations for common measures for distance and similarity (euclidean, cosine, Jin-Conrath, Resnik, jaccard)
Disclaimer: This is a side project needs more documentation and testing
Getting Started
Requires python 3.8+ and python3-dev to install pyroaring
Installing from pypi
pip install pumpkin_py
Building locally
To build locally first install poetry -
https://python-poetry.org/docs/#installation
Then run make:
make
Usage
Get a list of implemented similarity measures
from pumpkin_py import get_methods
get_methods()
['jaccard', 'cosine', 'phenodigm', 'symmetric_phenodigm', 'resnik', 'symmetric_resnik', 'ic_cosine', 'sim_gic']
Load closures and annotations
import gzip
from pathlib import Path
from pumpkin_py import build_ic_graph_from_closures, flat_to_annotations, search
closures = Path('.') / 'data' / 'hpo' / 'hp-closures.tsv.gz'
annotations = Path('.') / 'data' / 'hpo' / 'phenotype-annotations.tsv.gz'
root = "HP:0000118"
with gzip.open(annotations, 'rt') as annot_file:
annot_map = flat_to_annotations(annot_file)
with gzip.open(closures, 'rt') as closure_file:
graph = build_ic_graph_from_closures(closure_file, root, annot_map)
Search for the best matching disease given a phenotype profile
import pprint
from pumpkin_py import search
profile_a = (
"HP:0000403,HP:0000518,HP:0000565,HP:0000767,"
"HP:0000872,HP:0001257,HP:0001263,HP:0001290,"
"HP:0001629,HP:0002019,HP:0002072".split(',')
)
search_results = search(profile_a, annot_map, graph, 'phenodigm')
pprint.pprint(search_results.results[0:5])
[SimMatch(id='ORPHA:94125', rank=1, score=72.67599348696685),
SimMatch(id='ORPHA:79137', rank=2, score=71.57368233248252),
SimMatch(id='OMIM:619352', rank=3, score=70.98305459477629),
SimMatch(id='OMIM:618624', rank=4, score=70.94596234638497),
SimMatch(id='OMIM:617106', rank=5, score=70.83097366257857)]
Example scripts for fetching Monarch annotations and closures
Uses robot and sparql to generate closures and class labels
Annotation data is fetched from the latest Monarch release
- Requires >Java 8
cd data/monarch/ && make
PhenoDigm Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3649640/
Exomiser: https://github.com/exomiser/Exomiser
OWLTools: https://github.com/owlcollab/owltools
OWLSim-v3: https://github.com/monarch-initiative/owlsim-v3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pumpkin_py-0.0.2.tar.gz
.
File metadata
- Download URL: pumpkin_py-0.0.2.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.10 CPython/3.8.2 Linux/5.4.0-91-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 90c5249cb0b2cdbf63b1ef5427c3d3e2c1688c3bb39c75267b422b181901fc43 |
|
MD5 | a433b2dc1b33f928324f026598a4fb2f |
|
BLAKE2b-256 | 4f30c901e6e306ae054b49a4ab8dab1ba73a333cf81b58e229380434fbb9e967 |
File details
Details for the file pumpkin_py-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: pumpkin_py-0.0.2-py3-none-any.whl
- Upload date:
- Size: 23.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.10 CPython/3.8.2 Linux/5.4.0-91-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 89911821a0c9373a7261ce02fc9a924dc40d422fcd87136ae2d24b179e87f633 |
|
MD5 | 4eed5230bf91bfdc2133ea71efc816bf |
|
BLAKE2b-256 | ea7e500168dbf1bee45d11444c34ecf5b833314483dda7a6747c14dfa905e67a |