Single-Cell Analysis in Python.
Project description
Getting started | Features | Installation | References
Scanpy – Single-Cell Analysis in Python
Highly-performant tools for analyzing and simulating large-scale single-cell data. The draft Wolf, Angerer & Theis (2017) explains conceptual ideas of the package. Any comments are appreciated!
Getting started
Get releases on PyPI via:
pip install scanpy
To work with the latest updates on GitHub: clone the repository – green button on top of the page – and cd into its root directory. With Python 3.5 or 3.6 (preferably Miniconda) installed, type:
pip install --editable .
You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line (more information on installation here).
Then go through the use cases compiled in scanpy_usage, in particular, the recent additions
- 17-05-05
We reproduce most of the Guided Clustering tutorial of Seurat [Macosco15].
- 17-05-03
Analyzing 68 000 cells from [Zheng17], we find that Scanpy is about a factor 5 to 16 faster and more memory efficient than the Cell Ranger R kit for secondary analysis.
- 17-05-02
We reproduce the results of the Diffusion Pseudotime (DPT) paper of [Haghverdi16]. Note that DPT has recently been very favorably discussed by the authors of Monocle.
Features
Let us give an Overview of the toplevel user functions, followed by a few words on Scanpy’s Basic Features and more details.
Overview
Scanpy user functions are grouped into the following modules
- sc.tools
Machine Learning and statistics tools. Abbreviation sc.tl.
- sc.preprocessing
Preprocessing. Abbreviation sc.pp.
- sc.plotting
Plotting. Abbreviation sc.pl.
- sc.settings
Settings.
Preprocessing
- pp.*
Filtering of highly-variable genes, batch-effect correction, per-cell (UMI) normalization, preprocessing recipes.
Visualizations
- tl.pca
PCA [Pedregosa11].
- tl.diffmap
Diffusion Maps [Coifman05] [Haghverdi15] [Wolf17].
- tl.tsne
t-SNE [Maaten08] [Amir13] [Pedregosa11].
- tl.draw_graph
Force-directed graph drawing [Csardi06] [Weinreb17].
Branching trajectories and pseudotime, clustering, differential expression
- tl.dpt
Infer progression of cells, identify branching subgroups [Haghverdi16] [Wolf17].
- tl.louvain
Cluster cells into subgroups [Blondel08] [Traag17].
- tl.rank_genes_groups
Rank genes according to differential expression [Wolf17].
Simulations
- tl.sim
Simulate dynamic gene expression data [Wittmann09] [Wolf17].
Basic Features
The typical workflow consists of subsequent calls of data analysis tools of the form:
sc.tl.tool(adata, **params)
where adata is an AnnData object and params is a dictionary that stores optional parameters. Each of these calls adds annotation to an expression matrix X, which stores n d-dimensional gene expression measurements. By default, Scanpy tools operate inplace and return None. If you want to copy the AnnData object, pass the copy argument:
adata_copy = sc.tl.tool(adata, copy=True, **params)
Reading and writing data files and AnnData objects
One usually calls:
adata = sc.read(filename)
to initialize an AnnData object, possibly adds further annotation using, e.g., np.genfromtxt or pd.read_csv:
annotation = pd.read_csv(filename_annotation) adata.smp['cell_groups'] = annotation['cell_groups'] # categorical annotation of type str or int adata.smp['time'] = annotation['time'] # numerical annotation of type float
and uses:
sc.write(filename, adata)
to save the adata to file. Reading foresees filenames with extensions h5, xlsx, mtx, txt, csv and others. Writing foresees writing h5, csv and txt. Instead of providing a filename, you can provide a filekey, i.e., any string that does not end on a valid file extension.
AnnData objects
An AnnData instance stores an array-like data matrix as adata.X, dict-like sample annotation as adata.smp, dict-like variable annotation as adata.var and additional unstructured dict-like annotation as adata.add. While adata.add is a conventional dictionary, adata.smp and adata.var are instances of a low-level Pandas dataframe-like class.
Values can be retrieved and appended via adata.smp[key] and adata.var[key]. Sample and variable names can be accessed via adata.smp_names and adata.var_names, respectively. AnnData objects can be sliced like Pandas dataframes, for example, adata = adata[:, list_of_gene_names]. The AnnData class is similar to R’s ExpressionSet [Huber15] the latter though is not implemented for sparse data.
Plotting
For each tool, there is an associated plotting function:
sc.pl.tool(adata)
that retrieves and plots the elements of adata that were previously written by sc.tl.tool(adata). Scanpy’s plotting module can be viewed similar to Seaborn: an extension of matplotlib that allows visualizing operations on AnnData objects with one-line commands. Detailed configuration has to be done via matplotlib functions, which is easy as Scanpy’s plotting functions accept and return a Matplotlib.Axes object.
Visualization
pca
[source] Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn [Pedregosa11].
tsne
[source] t-distributed stochastic neighborhood embedding (tSNE) [Maaten08] has been proposed for single-cell data by [Amir13]. By default, Scanpy uses the implementation of scikit-learn [Pedregosa11]. You can achieve a huge speedup if you install Multicore-tSNE by [Ulyanov16], which will be automatically detected by Scanpy.
diffmap
[source] Diffusion maps [Coifman05] has been proposed for visualizing single-cell data by [Haghverdi15]. The tool uses the adapted Gaussian kernel suggested by [Haghverdi16]. Uses the implementation of [Wolf17].
draw_graph
[source] Force-directed graph drawing describes a class of long-established algorithms for visualizing graphs. It has been suggested for visualizing single-cell data by [Weinreb17]. Here, by default, the Fruchterman & Reingold [Fruchterman91] algorithm is used; many other layouts are available. Uses the igraph implementation [Csardi06].
Discrete clustering of subgroups, continuous progression through subgroups, differential expression
dpt
[source] Reconstruct the progression of a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by [Haghverdi16]. Here, we use a further developed version, which is able to detect multiple branching events [Wolf17].
The possibilities of diffmap and dpt are similar to those of the R package destiny of [Angerer16]. The Scanpy tools though run faster and scale to much higher cell numbers.
Examples: See this use case.
louvain
[source] Cluster cells using the Louvain algorithm [Blondel08] in the implementation of [Traag17]. The Louvain algorithm has been proposed for single-cell analysis by [Levine15].
Examples: See this use case.
rank_genes_groups
[source] Rank genes by differential expression.
Examples: See this use case.
Simulation
sim
[source] Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by [Wittmann09]. The Scanpy implementation is due to [Wolf17].
The tool is similar to the Matlab tool Odefy of [Krumsiek10].
Examples: See this use case.
Installation
If you use Windows or Mac OS X and do not have a current Python distribution (Python 3.5 or 3.6), download and install Miniconda (see below). If you use Linux, use your package manager to obtain a current Python distribution.
Get releases on PyPI via:
pip install scanpy
To work with the latest updates on GitHub: clone the repository – green button on top of the page – and cd into its root directory. To install with symbolic links (stay up to date with your cloned version after you update with git pull) call:
pip install --editable .
You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line.
Installing Miniconda
After downloading Miniconda, in a unix shell (Linux, Mac), run
cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh
and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. The whole process takes just a couple of minutes.
References
Amir et al. (2013), viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia, Nature Biotechnology.
Angerer et al. (2016), destiny – diffusion maps for large-scale single-cell data in R, Bioinformatics.
Blondel et al. (2008), Fast unfolding of communities in large networks, J. Stat. Mech..
Coifman et al. (2005), Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS.
Csardi et al. (2006), The igraph software package for complex network researc, InterJournal Complex Systems.
Ester et al. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR.
Fruchterman & Reingold (1991), Graph drawing by force-directed placement, Software: Practice & Experience.
Hagberg et al. (2008), Exploring Network Structure, Dynamics, and Function using NetworkX, Scipy Conference.
Haghverdi et al. (2015), Diffusion maps for high-dimensional single-cell analysis of differentiation data, Bioinformatics.
Haghverdi et al. (2016), Diffusion pseudotime robustly reconstructs branching cellular lineages, Nature Methods.
Huber et al. (2015), Orchestrating high-throughput genomic analysis with Bioconductor, Nature Methods.
Krumsiek et al. (2010), Odefy – From discrete to continuous models, BMC Bioinformatics.
Krumsiek et al. (2011), Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, PLoS ONE.
Levine et al. (2015), Data-Driven Phenotypic Dissection of AML Reveals Progenitor–like Cells that Correlate with Prognosis, Cell.
Macosko et al. (2015), Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets, Cell.
Moignard et al. (2015), Decoding the regulatory network of early blood development from single-cell gene expression measurements, Nature Biotechnology.
Paul et al. (2015), Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors, Cell.
Weinreb et al. (2016), SPRING: a kinetic interface for visualizing high dimensional single-cell expression data, bioRXiv.
Wittmann et al. (2009), Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling, BMC Systems Biology.
Zheng et al. (2017), Massively parallel digital transcriptional profiling of single cells, Nature Communications.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for scanpy-0.2.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a94784f88c9e28f8345c76c944d59b871858048a1a8b66a51f8c7dfbcc833e8 |
|
MD5 | a8f89c0ece93bd628aee2c478504da04 |
|
BLAKE2b-256 | d949b4cc7b3aaa9a81e5d92cc9c706af0bd9e5eecc5d55e4cd2c0fb5e0cc816b |
Hashes for scanpy-0.2.3-cp35-cp35m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 028c60d3da54be181725a7782b7cceace5974e1467cf461055de90da3d3f25e2 |
|
MD5 | f2d431f548f45fcee8906b8410b75f5d |
|
BLAKE2b-256 | 330110eb692e1499f40bccb15af9b268726e227a3ba04ec5e1102aba65f7c413 |