PROFILE methodology for the binarisation and normalisation of RNA-seq data
Project description
profile_binr
The PROFILE methodology for the binarisation and normalisation of RNA-seq data.
This is a Python interface to a set of normalisation and binarisation functions for RNA-seq data originally written in R.
This software package is based on the methodology developed by Beal, Jonas; Montagud, Arnau; Traynard, Pauline; Barillot, Emmanuel; and Calzone, Laurence at Computational Systems Biology of Cancer team at Institut Curie (contact-sysbio@curie.fr).
This is the repository containing the original implementation in Rmarkdown notebooks.
Installation
This software has only been tested in Debian-based GNU/Linux distributions, it should in principle work on any *nix system.
Prerequisites
system dependencies
- R, version 4.0.2 (2020-06-22) -- "Taking Off Again"
- It could be a newer R version, but this has not been tested.
- To install R dependencies :
- make
- g++
- gfortran
R dependencies
- mclust
- diptest
- moments
- magrittr
- tidyr
- dplyr
- tibble
- bigmemory
- doSNOW
- foreach
- glue
Using pip
This is a barebones functional example. We recommend installing within a Python virtual environment.
pip install git+https://github.com/bnediction/profile_binr
Usage
Once again this is a minimal example :
from profile_binr import ProfileBin
import pandas as pd
# your data is assumed to contain observations as
# rows and genes as columns
data = pd.read_csv("path/to/your/data.csv")
data.head()
Clec1b | Kdm3a | Coro2b | 8430408G22Rik | Clec9a | Phf6 | Usp14 | Tmem167b | |
---|---|---|---|---|---|---|---|---|
cell_id | ||||||||
HSPC_025 | 0.0 | 4.891604 | 1.426148 | 0.0 | 0.0 | 2.599758 | 2.954035 | 6.357369 |
HSPC_031 | 0.0 | 6.877725 | 0.000000 | 0.0 | 0.0 | 2.423483 | 1.804914 | 0.000000 |
HSPC_037 | 0.0 | 0.000000 | 6.913384 | 0.0 | 0.0 | 2.051659 | 8.265465 | 0.000000 |
LT-HSC_001 | 0.0 | 0.000000 | 8.178374 | 0.0 | 0.0 | 6.419817 | 3.453502 | 2.579528 |
HSPC_001 | 0.0 | 0.000000 | 9.475577 | 0.0 | 0.0 | 7.733370 | 1.478900 | 0.000000 |
# create the binarisation instance using the dataframe
# with the index containing the cell identifier
# and the columns being the gene names
probin = ProfileBin(data)
# compute the criteria used to binarise/normalise the data :
# This method uses a parallel implementation, you can specify the
# number of workers with an integer
probin.fit(8) # train using 8 threads
# Look at the computed criteria
probin.criteria.head(8)
Dip | BI | Kurtosis | DropOutRate | MeanNZ | DenPeak | Amplitude | Category | |
---|---|---|---|---|---|---|---|---|
Clec1b | 0.358107 | 1.635698 | 54.017736 | 0.876208 | 1.520978 | -0.007249 | 8.852181 | ZeroInf |
Kdm3a | 0.000000 | 2.407548 | -0.784019 | 0.326087 | 3.847940 | 0.209239 | 10.126676 | Bimodal |
Coro2b | 0.000000 | 2.320060 | 7.061604 | 0.658213 | 2.383819 | 0.004597 | 9.475577 | ZeroInf |
8430408G22Rik | 0.684454 | 3.121069 | 21.729044 | 0.884058 | 2.983472 | 0.005663 | 9.067857 | ZeroInf |
Clec9a | 1.000000 | 2.081717 | 140.089285 | 0.965580 | 2.280293 | -0.009361 | 9.614233 | Discarded |
Phf6 | 0.000000 | 1.988667 | -1.389024 | 0.035628 | 5.025501 | 2.017547 | 10.135226 | Bimodal |
Usp14 | 0.000000 | 2.208080 | -1.224987 | 0.007850 | 6.109964 | 8.245570 | 11.088750 | Bimodal |
Tmem167b | 0.000000 | 2.430813 | 0.093023 | 0.393720 | 3.448331 | 0.072982 | 9.486826 | Bimodal |
# get binarised data (alternatively .binarise()):
my_bin = probin.binarize()
my_bin.head()
Clec1b | Kdm3a | Coro2b | 8430408G22Rik | Clec9a | Phf6 | Usp14 | Tmem167b | |
---|---|---|---|---|---|---|---|---|
HSPC_025 | NaN | 1.0 | NaN | NaN | NaN | 0.0 | 0.0 | 1.0 |
HSPC_031 | NaN | 1.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 |
HSPC_037 | NaN | 0.0 | 1.0 | NaN | NaN | 0.0 | 1.0 | 0.0 |
LT-HSC_001 | NaN | 0.0 | 1.0 | NaN | NaN | 1.0 | 0.0 | 0.0 |
HSPC_001 | NaN | 0.0 | 1.0 | NaN | NaN | 1.0 | 0.0 | 0.0 |
# idem for normalised data :
my_norm = probin.normalize()
my_norm.head()
Clec1b | Kdm3a | Coro2b | 8430408G22Rik | Clec9a | Phf6 | Usp14 | Tmem167b | |
---|---|---|---|---|---|---|---|---|
HSPC_025 | 0.0 | 9.786196e-01 | 0.184102 | 0.0 | NaN | 0.000801 | 8.318176e-05 | 9.999970e-01 |
HSPC_031 | 0.0 | 9.999981e-01 | 0.000000 | 0.0 | NaN | 0.000462 | 8.084114e-07 | 6.874397e-11 |
HSPC_037 | 0.0 | 4.408417e-09 | 0.892449 | 0.0 | NaN | 0.000145 | 9.999940e-01 | 6.874397e-11 |
LT-HSC_001 | 0.0 | 4.408417e-09 | 1.000000 | 0.0 | NaN | 0.991865 | 6.230178e-04 | 1.599753e-04 |
HSPC_001 | 0.0 | 4.408417e-09 | 1.000000 | 0.0 | NaN | 0.999865 | 2.171153e-07 | 6.874397e-11 |
References
Please use the following bibtex entries to cite the original author's work :
@article{Beal2019,
abstract = {Logical models of cancer pathways are typically built by mining the literature for relevant experimental observations. They are usually generic as they apply for large cohorts of individuals. As a consequence, they generally do not capture the heterogeneity of patient tumors and their therapeutic responses. We present here a novel framework, referred to as PROFILE, to tailor logical models to a particular biological sample such as a patient tumor. This methodology permits to compare the model simulations to individual clinical data, i.e., survival time. Our approach focuses on integrating mutation data, copy number alterations (CNA), and expression data (transcriptomics or proteomics) to logical models. These data need first to be either binarized or set between 0 and 1, and can then be incorporated in the logical model by modifying the activity of the node, the initial conditions or the state transition rates. The use of MaBoSS, a tool based on Monte-Carlo kinetic algorithm to perform stochastic simulations on logical models results in model state probabilities, and allows for a semi-quantitative study of the model phenotypes and perturbations. As a proof of concept, we use a published generic model of cancer signaling pathways and molecular data from METABRIC breast cancer patients. For this example, we test several combinations of data incorporation and discuss that, with these data, the most comprehensive patient-specific cancer models are obtained by modifying the nodes' activity of the model with mutations, in combination or not with CNA data, and altering the transition rates with RNA expression. We conclude that these model simulations show good correlation with clinical data such as patients' Nottingham prognostic index (NPI) subgrouping and survival time. We observe that two highly relevant cancer phenotypes derived from personalized models, Proliferation and Apoptosis, are biologically consistent prognostic factors: patients with both high proliferation and low apoptosis have the worst survival rate, and conversely. Our approach aims to combine the mechanistic insights of logical modeling with multi-omics data integration to provide patient-relevant models. This work leads to the use of logical modeling for precision medicine and will eventually facilitate the choice of patient-specific drug treatments by physicians.},
author = {Beal, Jonas and Montagud, Arnau and Traynard, Pauline and Barillot, Emmanuel and Calzone, Laurence},
doi = {10.3389/fphys.2018.01965},
issn = {1664042X},
journal = {Frontiers in Physiology},
keywords = {Breast cancer,Data discretization,Logical models,Personalized mechanistic models,Personalized medicine,Stochastic simulations},
number = {JAN},
title = {{Personalization of logical models with multi-omics data allows clinical stratification of patients}},
volume = {10},
year = {2019}
}
@article{Beal2019a,
abstract = {Logical models of cancer pathways are typically built by mining the literature for relevant experimental observations. They are usually generic as they apply for large cohorts of individuals. As a consequence, they generally do not capture the heterogeneity of patient tumors and their therapeutic responses. We present here a novel framework, referred to as PROFILE, to tailor logical models to a particular biological sample such as a patient tumor. This methodology permits to compare the model simulations to individual clinical data, i.e., survival time. Our approach focuses on integrating mutation data, copy number alterations (CNA), and expression data (transcriptomics or proteomics) to logical models. These data need first to be either binarized or set between 0 and 1, and can then be incorporated in the logical model by modifying the activity of the node, the initial conditions or the state transition rates. The use of MaBoSS, a tool based on Monte-Carlo kinetic algorithm to perform stochastic simulations on logical models results in model state probabilities, and allows for a semi-quantitative study of the model phenotypes and perturbations. As a proof of concept, we use a published generic model of cancer signaling pathways and molecular data from METABRIC breast cancer patients. For this example, we test several combinations of data incorporation and discuss that, with these data, the most comprehensive patient-specific cancer models are obtained by modifying the nodes' activity of the model with mutations, in combination or not with CNA data, and altering the transition rates with RNA expression. We conclude that these model simulations show good correlation with clinical data such as patients' Nottingham prognostic index (NPI) subgrouping and survival time. We observe that two highly relevant cancer phenotypes derived from personalized models, Proliferation and Apoptosis, are biologically consistent prognostic factors: patients with both high proliferation and low apoptosis have the worst survival rate, and conversely. Our approach aims to combine the mechanistic insights of logical modeling with multi-omics data integration to provide patient-relevant models. This work leads to the use of logical modeling for precision medicine and will eventually facilitate the choice of patient-specific drug treatments by physicians.},
author = {Beal, Jonas and Montagud, Arnau and Traynard, Pauline and Barillot, Emmanuel and Calzone, Laurence},
doi = {10.3389/fphys.2018.01965},
issn = {1664042X},
journal = {Frontiers in Physiology},
keywords = {Breast cancer,Data discretization,Logical models,Personalized mechanistic models,Personalized medicine,Stochastic simulations},
number = {JAN},
pages = {1--23},
title = {{Personalization of logical models with multi-omics data allows clinical stratification of patients}},
volume = {10},
year = {2019}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file profile_binr-0.1.0.tar.gz
.
File metadata
- Download URL: profile_binr-0.1.0.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.2.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 86d2d45eefdd2f13902c1ca8a7362ca941f8104835962c16716cfad5f8707651 |
|
MD5 | 97e33b05f6de712dd8c8fad66f62bfe9 |
|
BLAKE2b-256 | 34528e339eead166da3d1fe9ea1b34be5d3fa4eb07661501135f44b5ec9d53af |
File details
Details for the file profile_binr-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: profile_binr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.2.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3685578af4ef62d9adcb63655ec10b90d1b5e1c6ab53c7f044af6976f3a71f76 |
|
MD5 | 564a6796ed8fc8c5e9ea60a2eea6ba7c |
|
BLAKE2b-256 | 39b887a946404c78d27ce884a58257ce7d1b691584deb38a85dfa9a6b8298431 |