scRNA-Seq data binarisation and synthetic generation from Boolean dynamics.
Project description
scBoolSeq
scRNA-Seq data binarisation and synthetic generation from Boolean dynamics.
Installation
We recommend installing scBoolSeq
via conda
, but we provide as well a standard pip
installation (which requires installing R
and a set of R
packages beforehand).
Conda
conda install -c conda-forge -c colomoto scboolseq
Pip
You need R
installed, see the specification of the R dependencies below.
pip install scboolseq
Docker
The scBoolSeq
command can be invoked using the bnediction/scboolseq
image:
docker run --rm -it -v $PWD:/data -w /data bnediction/scboolseq scBoolSeq ...
Usage
Command line
scBoolSeq provides a rich CLI allowing programmatic access to its main functionalities, namely the binarization
of RNA-Seq data and the
generation of synthetic RNA-Seq data synthesis
reflecting activation states from Boolean Network simulations. Once correctly instaled,
the tool's and subcommand's help explain all the possible parameters. Some minimal examples are here presented.
Main CLI
$ scBoolSeq -h
usage: scBoolSeq <command> [<args>]
Available commands:
* binarize Binarize a RNA-Seq dataset.
* synthesize Simulate a RNA-Seq experiment from Boolean dynamics.
* from_file Repeat a binarization or synthethic generation experiment, based on a config file.
NOTE on TSV/CSV file specs:
* If '.csv', the file is assumed to use the standard separator for columns ','.
* The index (gene or sample identifiers) is assumed to be the first column.
* The scBoolSeq is designed with consistency in mind.
The `output` (binarized or synthetic expression frame) will have the same disposition
(genes x observations | observations x genes) as the `input`.
If a `reference` is specified, its disposition must match the `input`'s.
scBoolSeq: bulk and single-cell RNA-Seq data binarization and synthetic generation from Boolean dynamics.
positional arguments:
command Subcommand to run
optional arguments:
-h, --help show this help message and exit
Binarization
Minimal example of binarization, specifying some optional parameters.
curl -fOL https://github.com/pinellolab/STREAM/raw/master/stream/tests/datasets/Nestorowa_2016/data_Nestorowa.tsv.gz
ls
# data_Nestorowa.tsv.gz
time scBoolSeq binarize data_Nestorowa.tsv.gz --genes-are-rows\
--output Nestorowa_binarized.csv --n-threads 10 --dump-config --dump-criteria
# ________________________________________________________
# Executed in 34.49 secs fish external
# usr time 30.04 secs 1211.00 micros 30.04 secs
# sys time 3.90 secs 171.00 micros 3.89 secs
ls
# data_Nestorowa.tsv.gz scBoolSeq_criteria_data_Nestorowa_2022-04-27_15h14m27.tsv
# Nestorowa_binarized.csv scBoolSeq_experiment_config_2022-04-27_15h14m27.toml
# Visualize the binarized expression frame.
# Note that some entries are undefined (NaN)
# These might be discarded genes for which no binarization or synthesis can occur,
# or observations which did not pass the thresholds to be set to 0 or 1.
python -c 'import pandas as pd; pd.read_csv("Nestorowa_binarized.csv", index_col=0).iloc[0:7, 0:7]'
# Clec1b Kdm3a Coro2b 8430408G22Rik Clec9a Phf6 Usp14
# HSPC_025 NaN 1.0 NaN NaN NaN 0.0 0.0
# HSPC_031 NaN 1.0 NaN NaN NaN 0.0 0.0
# HSPC_037 NaN 0.0 1.0 NaN NaN 0.0 1.0
# LT-HSC_001 NaN 0.0 1.0 NaN NaN 1.0 0.0
# HSPC_001 NaN 0.0 1.0 NaN NaN 1.0 0.0
# HSPC_008 1.0 1.0 NaN NaN NaN 1.0 0.0
# HSPC_014 NaN 0.0 NaN NaN NaN 0.0 1.0
Synthetic generation from Boolean states
cat minimal_boolean_example.csv
# the output is not commented out so that it can be copied
# and perhaps be read with `x = pandas.read_clipboard(sep=',', index_col=0)`
,HSPC_025,HSPC_031,HSPC_037,LT-HSC_001,HSPC_001,HSPC_008,HSPC_014,HSPC_020,HSPC_026,HSPC_038,LT-HSC_002,HSPC_002,HSPC_009,HSPC_015,HSPC_021
Kdm3a,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
Coro2b,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
8430408G22Rik,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
Clec9a,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
Phf6,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
# Generate 20 samples per boolean state, using 12 threads
# setting the random number generator's seed ensures reproductiblility.
time scBoolSeq synthesize --genes-are-rows minimal_boolean_example_T.csv --reference data_Nestorowa.tsv.gz\
--n-samples 20 --output new_synthetic.tsv --n-threads 12 --rng-seed 1234
# ________________________________________________________
# Executed in 43.85 secs fish external
# usr time 22.08 secs 0.00 millis 22.08 secs
# sys time 3.65 secs 3.31 millis 3.65 secs
# visualize the newly generated synthetic scRNA-Seq experiment
python -c 'import pandas as pd; pd.read_csv("new_synthetic.tsv", index_col=0, sep="\t").iloc[0:3, 0:7]'
# HSPC_025 HSPC_031 HSPC_037 LT-HSC_001 HSPC_001 HSPC_008 HSPC_014
# Kdm3a 7.328819 8.536391 0.000000 0.000000 0.821561 7.030519 1.891949
# Coro2b 0.000000 0.000000 6.457878 5.479887 0.000000 0.000000 5.503554
# 8430408G22Rik 0.000000 0.005110 0.000000 0.000000 0.000000 6.428994 0.000000
Python API
Here a minimal example is presented, using the same dataset as the CLI usage guide. For further information, please check the documentation.
import pandas as pd
from scboolseq import scBoolSeq
# read in the normalized expression data
nestorowa = pd.read_csv("data_Nestorowa.tsv.gz", index_col=0, sep="\t")
nestorowa.iloc[1:5, 1:5]
# HSPC_031 HSPC_037 LT-HSC_001 HSPC_001
# Kdm3a 6.877725 0.000000 0.000000 0.000000
# Coro2b 0.000000 6.913384 8.178374 9.475577
# 8430408G22Rik 0.000000 0.000000 0.000000 0.000000
# Clec9a 0.000000 0.000000 0.000000 0.000000
#
# NOTE : here, genes are rows and observations are columns
# scBoolSeq expects genes to be columns, thus we transpose the DataFrame.
scbool_nest = scBoolSeq(data=nestorowa.T, r_seed=1234)
##
## Binarization
##
scbool_nest.fit() # compute binarization criteria
binarized = scbool_nestorowa.binarize(nestorowa.T)
binarized.iloc[1:5, 1:5]
# Kdm3a Coro2b 8430408G22Rik Phf6
# HSPC_031 1.0 NaN NaN 0.0
# HSPC_037 0.0 1.0 NaN 0.0
# LT-HSC_001 0.0 1.0 NaN 1.0
# HSPC_001 0.0 1.0 NaN 1.0
##
## Synthetic RNA-Seq generation from Boolean states
##
scbool_nestorowa.simulation_fit() # compute simulation criteria
# we generate Boolean states by randomly (equiprobably) binarize undetermined
# values from the previous binarization.
from scboolseq.simulation import random_nan_binariser
fully_bin = binarized.iloc[1:5, 1:5].pipe(random_nan_binariser)
fully_bin
# Kdm3a Coro2b 8430408G22Rik Phf6
# HSPC_031 1.0 0.0 1.0 0.0
# HSPC_037 0.0 1.0 1.0 0.0
# LT-HSC_001 0.0 1.0 0.0 1.0
# HSPC_001 0.0 1.0 1.0 1.0
# create a synthetic frame, with two samples per boolean state,
# fixing the rng's seed for reproducibility
scbool_nestorowa.simulate(fully_bin, n_threads=4, seed=1234, n_samples=2)
# Kdm3a Coro2b 8430408G22Rik Phf6
# HSPC_031 7.328819 0.000000 8.087928 0.923352
# HSPC_037 1.003712 6.843611 7.003577 0.000000
# LT-HSC_001 0.000000 0.000000 0.000000 5.174053
# HSPC_001 1.672793 0.000000 0.000000 4.481709
# HSPC_031 8.536391 1.060373 0.000000 3.267464
# HSPC_037 1.055816 5.479887 0.000000 3.836276
# LT-HSC_001 0.000000 0.000000 0.000000 8.131221
# HSPC_001 2.451340 0.000000 0.000000 9.969012
Contributors
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scBoolSeq-0.8.2.tar.gz
.
File metadata
- Download URL: scBoolSeq-0.8.2.tar.gz
- Upload date:
- Size: 36.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 22d2138117f318f23e778f39c6e4711ba06897669d0038cd81e52831eb3cf7f5 |
|
MD5 | 6e7f233bb322f0a99567bfbc1d62a639 |
|
BLAKE2b-256 | 65514dcb3e42eac590382fcd183551f653a90b6d199216b918dd31457c3f7b40 |
File details
Details for the file scBoolSeq-0.8.2-py3-none-any.whl
.
File metadata
- Download URL: scBoolSeq-0.8.2-py3-none-any.whl
- Upload date:
- Size: 41.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b0cf7f1bc90d1bc9ba6fa7fc36ec5e58d8eadcdaa0429fc79eede3b3c824b73 |
|
MD5 | a2d4ae70498455a23e9479ad9ef0dcec |
|
BLAKE2b-256 | 42fd1aaa381adecb5651a565bb0879df70e8971752d3cb08686dd92ab5845b77 |