Install functions to simulate gene expression compendia
Project description
ponyo
Alexandra J. Lee and Casey S. Greene 2020
University of Pennsylvania
This repository is named after the the character Ponyo, from Hayao Miyazaki's animated film Ponyo, as she uses her magic to simulate a human appearance after getting a sample of human blood. The method simulates new gene expression data by training a generative neural network on existing gene expression data to learn a representation of gene expression patterns.
Installation
This package can be installed using pip:
pip install ponyo
Types of simulations
There are 3 types of simulations that ponyo implements:
Name | Description |
---|---|
Simulation by random sampling | This approach simulates gene expression data by randomly sampling from the latent space distribution. The function to run this approach is divided into 2 components: simulate_by_random_sampling is a wrapper which loads VAE trained models from directory <root>/<analysis name>/"models"/<NN_architecture> and run_sample_simulation which runs the simulation. Note: simulate_by_random_sampling assumes the files are organized as described above. If this directory organization doesn't apply to you, then you can directly use run_sample_simulation by passing in the trained VAE models. An example of how to use this can be found here. |
Simulation by latent transformation | This approach simulates gene expression data by encoding experiments into the latent space and then shifting samples from that experiment in the latent space. Unlike the "Simulation by random sampling" approach, this method accounts for experiment level information by shifting samples from the same experiment together. The function to run this approach is divided into 2 components: simulate_by_latent_transformation is a wrapper which loads VAE trained models from directory <root>/<analysis name>/"models"/<NN_architecture> and run_latent_transformation_simulation which runs the simulation. Note: simulate_by_latent_transformation assumes the files are organized as described above. If this directory organization doesn't apply to you, then you can directly use run_latent_transformation_simulation by passing in the VAE models trained using run_tybalt_training in vae.py. There are 2 flavors of this approach: simulate_by_latent_transform inputs a dataset with multiple experiments (these are your template experiments) and then it outputs the same number of new simulated experiments that are created by shifting each of those input template experiments. An example of how to use this can be found here. The second flavor is shift_template_experiment which inputs a single template experiment and can output multiple simulated experiments based on that one template by shifting that template experiment to different locations in the latent space. An example for how to use this can be found here. |
How to use
Example notebooks using ponyo on test data can be found in human_tests
Additionally, this method has been used in simulate-expression-compendia and generic-expression-patterns repositories.
Setting random seeds
To keep the VAE training deterministic, you will need to set multiple random seeds:
- numpy random
- python random
- tensorflow random
For an example of this, see human_tests
Configuration file
The tables lists the core parameters required to generate simulated data using modules from ponyo. Those marked with * indicate those parameters that will vary depending on the type of approach .
Name | Description |
---|---|
local_dir | str: Parent directory on local machine to store intermediate results |
dataset_name | str: Name for analysis directory containing notebooks using ponyo |
raw_data_filename | str: File storing raw gene expression data |
normalized_data_filename | str: File storing normalized gene expression data. This file is generated by normalize_expression_data(). |
metadata_filename* | str: File containing metadata associated with data |
experiment_ids_filename* | str: File containing list of experiment ids that have gene expression data available |
scaler_transform_filename | str: Python pickle file to store mapping from normalized to raw gene expression range. This file is generated by normalize_expression_data(). |
simulation_type | str: Name of simulation approach directory to store results locally |
NN_architecture | str: Name of neural network architecture to use. Format NN_<intermediate layer>_<latent layer> |
learning_rate | float: Step size used for gradient descent. In other words, it's how quickly the methods is learning |
batch_size | str: Training is performed in batches. So this determines the number of samples to consider at a given time |
epochs | int: Number of times to train over the entire input dataset |
kappa | float: How fast to linearly ramp up KL loss |
intermediate_dim | int: Size of the hidden layer |
latent_dim | int: Size of the bottleneck layer |
epsilon_std | float: Standard deviation of Normal distribution to sample latent space |
validation_frac | float: Fraction of input samples to use to validate for VAE training |
num_simulated_samples* | int: If using random sampling approach, simulate a compendia with these many samples |
num_simulated_experiments* | int: If using latent-transformation approach, simulate a compendia with these many experiments |
num_simulated* | int: If using template-based approach, simulate these many experiments |
metadata_delimiter* | str: Delimiter to parse metadata file |
metadata_experiment_colname* | str: Column header that contains experiment id that maps expression data and metadata |
metadata_sample_colname* | str: Column header that contains sample id that maps expression data and metadata |
project_id* | int: If using template-based approach, experiment id to use as template experiment |
For guidance on setting VAE training prameters, see configurations used in simulate-expression-compendia and generic-expression-patterns repositories
Acknowledgements
We would like to thank Marvin Thielk for adding coverage to tests and Ben Heil for contributing code to add more flexibility.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ponyo-0.5.tar.gz
.
File metadata
- Download URL: ponyo-0.5.tar.gz
- Upload date:
- Size: 16.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7fb8e7f2eda9cac6b40fc69b2d8b64b0002aeb08747784ed23e2a7836524831 |
|
MD5 | 47f9ac75ba024dca7bf1a862ff8462dc |
|
BLAKE2b-256 | 45a5aecf618c9b8c15c69b6b1d9788b1c32d80a5514debb292f41b4ade091e51 |
File details
Details for the file ponyo-0.5-py3-none-any.whl
.
File metadata
- Download URL: ponyo-0.5-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3fbe82b357be5eeca5aa401bddd7e3fd6a7939ba0cc1fca5d5f41c1569fb794 |
|
MD5 | f15e4cd3c2f92f19c729b55b377013a4 |
|
BLAKE2b-256 | f69267846bac129bf4298d520538b671e24366e054d134e28fe9094452d35f40 |