A pipeline for protein embedding generation and visualization
Project description
Bio Embeddings
The project includes:
- A pipeline that allows to embed a FASTA file choosing from various embedders (see below), and then project and visualize the embeddings on 3D plots.
- A web server that takes in sequences, embeds them and returns the embeddings OR visualizes the embedding spaces on interactive plots online.
- General purpose library to embed protein sequences in any python app.
Important information
- The
albert
model weights are not publicly available yet. You can request early access by opening an issue. - Please help us out by opening issues and submitting PRs as you see fit, this repository is actively being developed.
Install guides
You can install the package via PIP like so:
pip install bio-embeddings
Or directly from the source (e.g. to have the latest features):
pip install -U git+https://github.com/sacdallago/bio_embeddings.git
Additional dependencies and steps to run the webserver
If you want to run the webserver locally, you need to have some python backend deployment experience.
You'll need a couple of dependencies if you want to run the webserver locally: pip install dash celery pymongo flask-restx pyyaml
.
Additionally, you will need to have two instances of the app run (the backend and at least one celery worker), and both instances must be granted access to a MongoDB and a RabbitMQ or Redis store for celery.
Examples
We highly recommend you to check out the examples
folder for pipeline examples, and the notebooks
folder for post-processing pipeline runs and general purpose use of the embedders.
After having installed the package, you can:
-
Use the pipeline like:
bio_embeddings config.yml
A blueprint of the configuration file, and an example setup can be found in the
examples
directory of this repository. -
Use the general purpose embedder objects via python, e.g.:
from bio_embeddings import SeqVecEmbedder embedder = SeqVecEmbedder() embedding = embedder.embed("SEQVENCE")
More examples can be found in the
notebooks
folder of this repository.
Development status
-
Pipeline stages
- embed:
- SeqVec v1/v2 (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8)
- Fastext
- Glove
- Word2Vec
- UniRep (https://www.nature.com/articles/s41592-019-0598-1?sfns=mo)
- Albert (unpublished)
- project:
- t-SNE
- UMAP
- embed:
-
Web server:
- SeqVec
- Albert (unpublished)
-
General purpose objects:
- SeqVec
- Fastext
- Glove
- Word2Vec
- UniRep
- Albert (unpublished)
Building a Distribution
Building the packages best happens using invoke.
If you manganage your dependecies with poetry this should be already installed.
Simply use poetry run invoke clean build
to update your requirements according to your current status
and to generate the dist files
Contributors
- Christian Dallago (lead)
- Tobias Olenyi
- Michael Heinzinger
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bio_embeddings-0.1.2.tar.gz
.
File metadata
- Download URL: bio_embeddings-0.1.2.tar.gz
- Upload date:
- Size: 79.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.8.2 Linux/5.5.9-arch1-2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7fa6c78c40c7f7778b7b16141184ec7b0a4a2bb4e7176595e3bf3d6ef53a14e |
|
MD5 | 0ce194522b65109cff6e2c2c64597383 |
|
BLAKE2b-256 | 45f8724c7f2e77df3dc5ce73978b8185435443459baae814b0157d5f13c97117 |
File details
Details for the file bio_embeddings-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: bio_embeddings-0.1.2-py3-none-any.whl
- Upload date:
- Size: 102.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.8.2 Linux/5.5.9-arch1-2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | efa7cf589d7c6b51ae677be232c4251ec9ad0d12fc18cdd9fb62381b2dce0de2 |
|
MD5 | 9015610602497893c1a4d6593c069324 |
|
BLAKE2b-256 | cb69c7a42bff2578a977a0b210d8ea11fef172cfac8d2b218299444cd8bbeefd |