Generating dense embeddings for proteins using kernel PCA
Project description
This tool generates low-dimensional, continuous, distributed vector representations for non-numeric entities such as text or biological sequences (e.g. DNA or proteins) via kernel PCA with rational kernels.
The current implementation accepts any input dataset that can be read as a list of strings.
Installation
RatVec can be installed on Python 3.6+ from PyPI with the following code in your favorite terminal:
$ pip install ratvec
or from the latest code on GitHub with:
$ pip install git+https://github.com/ratvec/ratvec.git
It can be installed in development mode with:
$ git clone https://github.com/ratvec/ratvec.git
$ cd ratvec
$ pip install -e .
The -e dynamically links the code in the git repository to the Python site-packages so your changes get reflected immediately.
How to Use
ratvec automatically installs a command line interface. Check it out with:
$ ratvec --help
RatVec has three main commands: generate, train, and evaluate:
Generate. Downloads and prepare the SwissProt data set that is showcased in the RatVec paper.
$ ratvec generate
Train. Compute KPCA embeddings on a given data set. Please run the following command to see the arguments:
$ ratvec train --help
Evaluate. Evaluate and optimize KPCA embeddings. Please run the following command to see the arguments:
$ ratvec evaluate --help
Showcase Dataset
The application presented in the paper (SwissProt dataset [1] used by Boutet et al. [2]) can be downloaded directly from here or running the following command:
$ ratvec generate
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.