Skip to main content

scRNA-Seq data binarisation and synthetic generation from Boolean dynamics.

Project description

scBoolSeq

scRNA-Seq data binarisation and synthetic generation from Boolean dynamics.

Installation

Conda

conda install -c conda-forge -c colomoto scboolseq

Pip

You need R installed.

pip install scboolseq

TODO: how to install R dependencies

Usage

Command line

scBoolSeq provides a rich CLI allowing programmatic access to its main functionalities, namely the binarization of RNA-Seq data and the generation of synthetic RNA-Seq data synthesis reflecting activation states from Boolean Network simulations. Once correctly instaled, the tool's and subcommand's help explain all the possible parameters. Some minimal examples are here presented.

Main CLI

$ scBoolSeq -h
usage: scBoolSeq <command> [<args>]

Available commands:
	* binarize	 Binarize a RNA-Seq dataset.
	* synthesize	 Simulate a RNA-Seq experiment from Boolean dynamics.
	* from_file	 Repeat a binarization or synthethic generation experiment, based on a config file.

NOTE on TSV/CSV file specs:
* If '.csv', the file is assumed to use the standard separator for columns ','.
* The index (gene or sample identifiers) is assumed to be the first column.
* The scBoolSeq is designed with consistency in mind. 
  The `output` (binarized or synthetic expression frame) will have the same disposition 
  (genes x observations | observations x genes) as the `input`. 
  If a `reference` is specified, its disposition must match the `input`'s.

scBoolSeq: bulk and single-cell RNA-Seq data binarization and synthetic generation from Boolean dynamics.

positional arguments:
  command     Subcommand to run

optional arguments:
  -h, --help  show this help message and exit

Binarization

Minimal example of binarization, specifying some optional parameters.

curl -fOL https://github.com/pinellolab/STREAM/raw/master/stream/tests/datasets/Nestorowa_2016/data_Nestorowa.tsv.gz

ls
# data_Nestorowa.tsv.gz
time scBoolSeq binarize data_Nestorowa.tsv.gz --genes-are-rows\
--output Nestorowa_binarized.csv --n-threads 10 --dump-config --dump-criteria
# ________________________________________________________
# Executed in   34.49 secs   fish           external 
#   usr time   30.04 secs  1211.00 micros   30.04 secs 
#   sys time    3.90 secs  171.00 micros    3.89 secs 

ls
# data_Nestorowa.tsv.gz    scBoolSeq_criteria_data_Nestorowa_2022-04-27_15h14m27.tsv
# Nestorowa_binarized.csv  scBoolSeq_experiment_config_2022-04-27_15h14m27.toml

# Visualize the binarized expression frame. 
# Note that some entries are undefined (NaN)
# These might be discarded genes for which no binarization or synthesis can occur,
# or observations which did not pass the thresholds to be set to 0 or 1.
python -c 'import pandas as pd; pd.read_csv("Nestorowa_binarized.csv", index_col=0).iloc[0:7, 0:7]'
#             Clec1b  Kdm3a  Coro2b  8430408G22Rik  Clec9a  Phf6  Usp14
# HSPC_025       NaN    1.0     NaN            NaN     NaN   0.0    0.0
# HSPC_031       NaN    1.0     NaN            NaN     NaN   0.0    0.0
# HSPC_037       NaN    0.0     1.0            NaN     NaN   0.0    1.0
# LT-HSC_001     NaN    0.0     1.0            NaN     NaN   1.0    0.0
# HSPC_001       NaN    0.0     1.0            NaN     NaN   1.0    0.0
# HSPC_008       1.0    1.0     NaN            NaN     NaN   1.0    0.0
# HSPC_014       NaN    0.0     NaN            NaN     NaN   0.0    1.0
Synthetic generation from Boolean states
cat minimal_boolean_example.csv 
# the output is not commented out so that it can be copied
# and perhaps be read with `x = pandas.read_clipboard(sep=',', index_col=0)`
,HSPC_025,HSPC_031,HSPC_037,LT-HSC_001,HSPC_001,HSPC_008,HSPC_014,HSPC_020,HSPC_026,HSPC_038,LT-HSC_002,HSPC_002,HSPC_009,HSPC_015,HSPC_021
Kdm3a,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
Coro2b,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
8430408G22Rik,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
Clec9a,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
Phf6,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0


# Generate 20 samples per boolean state, using 12 threads
# setting the random number generator's seed ensures reproductiblility.
time scBoolSeq synthesize --genes-are-rows minimal_boolean_example_T.csv --reference data_Nestorowa.tsv.gz\
--n-samples 20 --output new_synthetic.tsv --n-threads 12 --rng-seed 1234
# ________________________________________________________
# Executed in   43.85 secs   fish           external 
#    usr time   22.08 secs    0.00 millis   22.08 secs 
#    sys time    3.65 secs    3.31 millis    3.65 secs 

# visualize the newly generated synthetic scRNA-Seq experiment
python -c 'import pandas as pd; pd.read_csv("new_synthetic.tsv", index_col=0, sep="\t").iloc[0:3, 0:7]'
#                HSPC_025  HSPC_031  HSPC_037  LT-HSC_001  HSPC_001  HSPC_008  HSPC_014
# Kdm3a          7.328819  8.536391  0.000000    0.000000  0.821561  7.030519  1.891949
# Coro2b         0.000000  0.000000  6.457878    5.479887  0.000000  0.000000  5.503554
# 8430408G22Rik  0.000000  0.005110  0.000000    0.000000  0.000000  6.428994  0.000000

Python API

Here a minimal example is presented, using the same dataset as the CLI usage guide. For further information, please check the documentation.

import pandas as pd
from scboolseq import scBoolSeq
from scboolseq.simulation import random_nan_binariser

# read in the normalized expression data
nestorowa = pd.read_csv("data_Nestorowa.tsv.gz", index_col=0, sep="\t")
nestorowa.iloc[1:5, 1:5] 
#                HSPC_031  HSPC_037  LT-HSC_001  HSPC_001
# Kdm3a          6.877725  0.000000    0.000000  0.000000
# Coro2b         0.000000  6.913384    8.178374  9.475577
# 8430408G22Rik  0.000000  0.000000    0.000000  0.000000
# Clec9a         0.000000  0.000000    0.000000  0.000000
#
# NOTE : here, genes are rows and observations are columns

# scBoolSeq expects genes to be columns, thus we transpose the DataFrame.
scbool_nest = scBoolSeq(data=nestorowa.T, r_seed=1234)
scbool_nest
# scBoolSeq(has_data=True, can_binarize=False, can_simulate=False)
scbool_nest.fit() # compute binarization criteria
# scBoolSeq(has_data=True, can_binarize=True, can_simulate=False)

scbool_nestorowa.simulation_fit() # compute simulation criteria
# scBoolSeq(has_data=True, can_binarize=True, can_simulate=True)

binarized = scbool_nestorowa.binarize(nestorowa.T)
binarized.iloc[1:5, 1:5] 
#             Kdm3a  Coro2b  8430408G22Rik  Phf6
# HSPC_031      1.0     NaN            NaN   0.0
# HSPC_037      0.0     1.0            NaN   0.0
# LT-HSC_001    0.0     1.0            NaN   1.0
# HSPC_001      0.0     1.0            NaN   1.0

# randomly (equiprobably) binarize undetermined values
# note that scboolseq.simulation.random_nan_binariser has no seeding mechanism
# so it is not reproducible
fully_bin = binarized.iloc[1:5, 1:5].pipe(random_nan_binariser) 
fully_bin 
#             Kdm3a  Coro2b  8430408G22Rik  Phf6
# HSPC_031      1.0     0.0            1.0   0.0
# HSPC_037      0.0     1.0            1.0   0.0
# LT-HSC_001    0.0     1.0            0.0   1.0
# HSPC_001      0.0     1.0            1.0   1.0

# create a synthetic frame, with two samples per boolean state,
# fixing the rng's seed for reproducibility
# specyfing the number of threads to use
scbool_nestorowa.simulate(fully_bin, n_threads=4, seed=1234, n_samples=2) 
#               Kdm3a    Coro2b  8430408G22Rik      Phf6
# HSPC_031    7.328819  0.000000       8.087928  0.923352
# HSPC_037    1.003712  6.843611       7.003577  0.000000
# LT-HSC_001  0.000000  0.000000       0.000000  5.174053
# HSPC_001    1.672793  0.000000       0.000000  4.481709
# HSPC_031    8.536391  1.060373       0.000000  3.267464
# HSPC_037    1.055816  5.479887       0.000000  3.836276
# LT-HSC_001  0.000000  0.000000       0.000000  8.131221
# HSPC_001    2.451340  0.000000       0.000000  9.969012

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scBoolSeq-0.8.tar.gz (34.8 kB view details)

Uploaded Source

Built Distribution

scBoolSeq-0.8-py3-none-any.whl (39.3 kB view details)

Uploaded Python 3

File details

Details for the file scBoolSeq-0.8.tar.gz.

File metadata

  • Download URL: scBoolSeq-0.8.tar.gz
  • Upload date:
  • Size: 34.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.4

File hashes

Hashes for scBoolSeq-0.8.tar.gz
Algorithm Hash digest
SHA256 38f8f331d505954195857604dc9443303caf53fecdbe87b713acfe9bda2209bb
MD5 5f1906bb2754726b362209e905204bef
BLAKE2b-256 5beb9b6cc8a045ef170a1e27e4e7c817906f58fbe0d5f875ea02e49f9f998659

See more details on using hashes here.

File details

Details for the file scBoolSeq-0.8-py3-none-any.whl.

File metadata

  • Download URL: scBoolSeq-0.8-py3-none-any.whl
  • Upload date:
  • Size: 39.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.4

File hashes

Hashes for scBoolSeq-0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 2a2cd2c95e5610c0270ac7315ceaa9f12e8093c0838504ada78706d1f4951602
MD5 6ceb6a21d1a54e27221673d5350e496a
BLAKE2b-256 f7d2635c3f9a4491ae2b23687eade880ccb8ec4206f3678e0e32fbe18d9a098f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page