A pipeline for protein embedding generation and visualization

These details have not been verified by PyPI

Project links

Project description

Bio Embeddings

Resources to learn about bio_embeddings:

Quickly predict protein structure and function from sequence via embeddings: embed.protein.properties.
Read the current documentation: docs.bioembeddings.com.
Chat with us: chat.bioembeddings.com.
We presented the bio_embeddings pipeline as a talk at ISMB 2020 & LMRL 2020. You can find the talk on YouTube, the poster on F1000, and our Current Protocol Manuscript.
Check out the examples of pipeline configurations a and notebooks.

Project aims:

Facilitate the use of language model based biological sequence representations for transfer-learning by providing a single, consistent interface and close-to-zero-friction
Reproducible workflows
Depth of representation (different models from different labs trained on different dataset for different purposes)
Extensive examples, handle complexity for users (e.g. CUDA OOM abstraction) and well documented warnings and error messages.

The project includes:

General purpose python embedders based on open models trained on biological sequence representations (SeqVec, ProtTrans, UniRep,...)
A pipeline which:
- embeds sequences into matrix-representations (per-amino-acid) or vector-representations (per-sequence) that can be used to train learning models or for analytical purposes
- projects per-sequence embedidngs into lower dimensional representations using UMAP or t-SNE (for lightwieght data handling and visualizations)
- visualizes low dimensional sets of per-sequence embeddings onto 2D and 3D interactive plots (with and without annotations)
- extracts annotations from per-sequence and per-amino-acid embeddings using supervised (when available) and unsupervised approaches (e.g. by network analysis)
A webserver that wraps the pipeline into a distributed API for scalable and consistent workfolws

Installation

You can install bio_embeddings via pip or use it via docker.

Pip

Install the pipeline like so:

pip install bio-embeddings[all]

To install the unstable version, please install the pipeline like so:

pip install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git"

Docker

We provide a docker image at ghcr.io/bioembeddings/bio_embeddings. Simple usage example:

docker run --rm --gpus all \
    -v "$(pwd)/examples/docker":/mnt \
    -v bio_embeddings_weights_cache:/root/.cache/bio_embeddings \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    ghcr.io/bioembeddings/bio_embeddings:v0.1.6 /mnt/config.yml

See the docker example in the examples folder for instructions. You can also use ghcr.io/bioembeddings/bio_embeddings:latest which is built from the latest commit.

Installation notes

bio_embeddings was developed for unix machines with GPU capabilities and CUDA installed. If your setup diverges from this, you may encounter some inconsistencies (e.g. speed is significantly affected by the absence of a GPU and CUDA). For Windows users, we strongly recommend the use of Windows Subsystem for Linux.

What model is right for you?

Each models has its strengths and weaknesses (speed, specificity, memory footprint...). There isn't a "one-fits-all" and we encourage you to at least try two different models when attempting a new exploratory project.

The models prottrans_bert_bfd, prottrans_albert_bfd, seqvec and prottrans_xlnet_uniref100 were all trained with the goal of systematic predictions. From this pool, we believe the optimal model to be prottrans_bert_bfd, followed by seqvec, which has been established for longer and uses a different principle (LSTM vs Transformer).

Usage and examples

We highly recommend you to check out the examples folder for pipeline examples, and the notebooks folder for post-processing pipeline runs and general purpose use of the embedders.

After having installed the package, you can:

Use the pipeline like:
```
bio_embeddings config.yml
```
A blueprint of the configuration file, and an example setup can be found in the examples directory of this repository.

Use the general purpose embedder objects via python, e.g.:

from bio_embeddings.embed import SeqVecEmbedder

embedder = SeqVecEmbedder()

embedding = embedder.embed("SEQVENCE")

More examples can be found in the notebooks folder of this repository.

Cite

Dallago, C., Schütze, K., Heinzinger, M., Olenyi, T., Littmann, M., Lu, A. X., Yang, K. K., Min, S., Yoon, S., Morton, J. T., & Rost, B. (2021). Learned embeddings from deep learning to visualize and predict protein sets. Current Protocols, 1, e113. doi: 10.1002/cpz1.113

The corresponding bibtex:

@article{https://doi.org/10.1002/cpz1.113,
author = {Dallago, Christian and Schütze, Konstantin and Heinzinger, Michael and Olenyi, Tobias and Littmann, Maria and Lu, Amy X. and Yang, Kevin K. and Min, Seonwoo and Yoon, Sungroh and Morton, James T. and Rost, Burkhard},
title = {Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets},
journal = {Current Protocols},
volume = {1},
number = {5},
pages = {e113},
keywords = {deep learning embeddings, machine learning, protein annotation pipeline, protein representations, protein visualization},
doi = {https://doi.org/10.1002/cpz1.113},
url = {https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpz1.113},
eprint = {https://currentprotocols.onlinelibrary.wiley.com/doi/pdf/10.1002/cpz1.113},
year = {2021}
}

Contributors

Christian Dallago (lead)
Konstantin Schütze
Tobias Olenyi
Michael Heinzinger

Non-exhaustive list of tools available (see following section for more details):

Fastext
Glove
Word2Vec
SeqVec (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8)
- SeqVecSec and SeqVecLoc for secondary structure and subcellularlocalization prediction
ProtTrans (ProtBert, ProtAlbert, ProtT5) (https://doi.org/10.1101/2020.07.12.199554)
- ProtBertSec and ProtBertLoc for secondary structure and subcellular localization prediction
UniRep (https://www.nature.com/articles/s41592-019-0598-1)
ESM/ESM1b (https://www.biorxiv.org/content/10.1101/622803v3)
PLUS (https://github.com/mswzeus/PLUS/)
CPCProt (https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf)
PB-Tucker (https://www.biorxiv.org/content/10.1101/2021.01.21.427551v1)
GoPredSim (https://www.nature.com/articles/s41598-020-80786-0)
DeepBlast (https://www.biorxiv.org/content/10.1101/2020.11.03.365932v1)

Tools by category

Pipeline

align:
- DeepBlast (https://www.biorxiv.org/content/10.1101/2020.11.03.365932v1)
embed:
- ProtTrans BERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
- SeqVec (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8)
- ProtTrans ALBERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
- ProtTrans XLNet trained on UniRef100 (https://doi.org/10.1101/2020.07.12.199554)
- ProtTrans T5 trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
- ProtTrans T5 trained on BFD and fine-tuned on UniRef50 (in-house)
- UniRep (https://www.nature.com/articles/s41592-019-0598-1)
- ESM/ESM1b (https://www.biorxiv.org/content/10.1101/622803v3)
- PLUS (https://github.com/mswzeus/PLUS/)
- CPCProt (https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf)
project:
- t-SNE
- UMAP
- PB-Tucker (https://www.biorxiv.org/content/10.1101/2021.01.21.427551v1)
visualize:
- 2D/3D sequence embedding space
extract:
- supervised:
  - SeqVec: DSSP3, DSSP8, disorder, subcellular location and membrane boundness as in https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8
  - ProtBertSec and ProtBertLoc as reported in https://doi.org/10.1101/2020.07.12.199554
- unsupervised:
  - via sequence-level (reduced_embeddings), pairwise distance (euclidean like goPredSim, more options available, e.g. cosine)

General purpose embedders

ProtTrans BERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
SeqVec (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8)
ProtTrans ALBERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
ProtTrans XLNet trained on UniRef100 (https://doi.org/10.1101/2020.07.12.199554)
ProtTrans T5 trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
ProtTrans T5 trained on BFD + fine-tuned on UniRef50 (https://doi.org/10.1101/2020.07.12.199554)
Fastext
Glove
Word2Vec
UniRep (https://www.nature.com/articles/s41592-019-0598-1)
ESM/ESM1b (https://www.biorxiv.org/content/10.1101/622803v3)
PLUS (https://github.com/mswzeus/PLUS/)
CPCProt (https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Sep 6, 2021

0.2.1

Jul 22, 2021

This version

0.2.0

May 14, 2021

0.1.7

Mar 17, 2021

0.1.6

Feb 18, 2021

0.1.5

Jan 22, 2021

0.1.5b1 pre-release

Jan 14, 2021

0.1.4

Oct 1, 2020

0.1.3

Jul 9, 2020

0.1.2

May 5, 2020

0.1.1

May 2, 2020

0.1.0

May 1, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio_embeddings-0.2.0.tar.gz (61.3 kB view details)

Uploaded May 14, 2021 Source

Built Distribution

bio_embeddings-0.2.0-py3-none-any.whl (88.3 kB view details)

Uploaded May 14, 2021 Python 3

File details

Details for the file bio_embeddings-0.2.0.tar.gz.

File metadata

Download URL: bio_embeddings-0.2.0.tar.gz
Upload date: May 14, 2021
Size: 61.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.6 CPython/3.8.10 Linux/4.15.0-117-generic

File hashes

Hashes for bio_embeddings-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`5168ba0042be78d6da23c17a8c1c550cfade0e48e1ba31d26fd8c17403fc6667`
MD5	`d864608ee7d12153442e1645d584188c`
BLAKE2b-256	`240d8d0ca08cd491b0e6886426e61888ab0c1ba137263c6ec336c7cdda8ff2a0`

See more details on using hashes here.

File details

Details for the file bio_embeddings-0.2.0-py3-none-any.whl.

File metadata

Download URL: bio_embeddings-0.2.0-py3-none-any.whl
Upload date: May 14, 2021
Size: 88.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.6 CPython/3.8.10 Linux/4.15.0-117-generic

File hashes

Hashes for bio_embeddings-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`686b7260ea76ff984af404eef6cab52fb4634332eb873ce459903c6d08d3275e`
MD5	`5a5e57790b73b4f8b032a446cd389d41`
BLAKE2b-256	`08efa3ced5c4a39abcbb3c73778e481f46f1af42917bda0bff51fe06a7d206af`