Single-Cell Analysis in Python.
Project description
Getting started | Features | Installation | References
Scanpy – Single-Cell Analysis in Python
Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing and simulation of gene regulatory networks. The Python-based implementation efficiently deals with data sets of more than one million cells and enables easy integration of advanced machine learning algorithms.
Getting started
With Python 3.5 or 3.6 installed, get releases on PyPI via (more information on installation here):
pip install scanpy
To work with the latest version on GitHub: clone the repository – green button on top of the page – and cd into its root directory and type:
pip install --editable .
You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line.
Then go through the use cases compiled in scanpy_usage, in particular, the recent additions
- 17-05-05
We reproduce most of the Guided Clustering tutorial of Seurat [Satija15].
- 17-05-03
Analyzing 68 000 cells from [Zheng17], we find that Scanpy is about a factor 5 to 16 faster and more memory efficient than the Cell Ranger R kit for secondary analysis.
- 17-05-02
We reproduce the results of the Diffusion Pseudotime (DPT) paper of [Haghverdi16]. Note that DPT has recently been very favorably discussed by the authors of Monocle.
Features
Let us give an Overview of the toplevel user functions, followed by a few words on Scanpy’s Basic Features and more details.
Overview
Scanpy user functions are grouped into the following modules
- sc.tools
Machine Learning and statistics tools. Abbreviation sc.tl.
- sc.preprocessing
Preprocessing. Abbreviation sc.pp.
- sc.plotting
Plotting. Abbreviation sc.pl.
- sc.settings
Settings.
Preprocessing
- pp.*
Filtering of highly-variable genes, batch-effect correction, per-cell (UMI) normalization, preprocessing recipes.
Visualizations
- tl.pca
PCA [Pedregosa11].
- tl.diffmap
Diffusion Maps [Coifman05] [Haghverdi15] [Wolf17].
- tl.tsne
t-SNE [Maaten08] [Amir13] [Pedregosa11].
- tl.draw_graph
Force-directed graph drawing [Csardi06] [Weinreb17].
Branching trajectories and pseudotime, clustering, differential expression
- tl.dpt
Infer progression of cells, identify branching subgroups [Haghverdi16] [Wolf17].
- tl.louvain
Cluster cells into subgroups [Blondel08] [Traag17].
- tl.rank_genes_groups
Rank genes according to differential expression [Wolf17].
Simulations
- tl.sim
Simulate dynamic gene expression data [Wittmann09] [Wolf17].
Basic Features
The typical workflow consists of subsequent calls of data analysis tools of the form:
sc.tl.tool(adata, **params)
where adata is an AnnData object and params is a dictionary that stores optional parameters. Each of these calls adds annotation to an expression matrix X, which stores n d-dimensional gene expression measurements. By default, Scanpy tools operate inplace and return None. If you want to copy the AnnData object, pass the copy argument:
adata_copy = sc.tl.tool(adata, copy=True, **params)
Reading and writing data files and AnnData objects
One usually calls:
adata = sc.read(filename)
to initialize an AnnData object, possibly adds further annotation using, e.g., np.genfromtxt or pd.read_csv:
annotation = pd.read_csv(filename_annotation) adata.smp['cell_groups'] = annotation['cell_groups'] # categorical annotation of type str or int adata.smp['time'] = annotation['time'] # numerical annotation of type float
and uses:
sc.write(filename, adata)
to save the adata to file. Reading foresees filenames with extensions h5, xlsx, mtx, txt, csv and others. Writing foresees writing h5, csv and txt. Instead of providing a filename, you can provide a filekey, i.e., any string that does not end on a valid file extension.
AnnData objects
An AnnData instance stores an array-like data matrix as adata.X, dict-like sample annotation as adata.smp, dict-like variable annotation as adata.var and additional unstructured dict-like annotation as adata.add. While adata.add is a conventional dictionary, adata.smp and adata.var are instances of a low-level Pandas dataframe-like class.
Values can be retrieved and appended via adata.smp[key] and adata.var[key]. Sample and variable names can be accessed via adata.smp_names and adata.var_names, respectively. AnnData objects can be sliced like Pandas dataframes, for example, adata = adata[:, list_of_gene_names]. The AnnData class is similar to R’s ExpressionSet [Huber15] the latter though is not implemented for sparse data.
Plotting
For each tool, there is an associated plotting function:
sc.pl.tool(adata)
that retrieves and plots the elements of adata that were previously written by sc.tl.tool(adata). Scanpy’s plotting module can be viewed similar to Seaborn: an extension of matplotlib that allows visualizing operations on AnnData objects with one-line commands. Detailed configuration has to be done via matplotlib functions, which is easy as Scanpy’s plotting functions accept and return a Matplotlib.Axes object.
Visualization
pca
[source] Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn [Pedregosa11].
tsne
[source] t-distributed stochastic neighborhood embedding (tSNE) [Maaten08] has been proposed for single-cell data by [Amir13]. By default, Scanpy uses the implementation of scikit-learn [Pedregosa11]. You can achieve a huge speedup if you install Multicore-tSNE by [Ulyanov16], which will be automatically detected by Scanpy.
diffmap
[source] Diffusion maps [Coifman05] has been proposed for visualizing single-cell data by [Haghverdi15]. The tool uses the adapted Gaussian kernel suggested by [Haghverdi16]. Uses the implementation of [Wolf17].
draw_graph
[source] Force-directed graph drawing describes a class of long-established algorithms for visualizing graphs. It has been suggested for visualizing single-cell data by [Weinreb17]. Here, by default, the Fruchterman & Reingold [Fruchterman91] algorithm is used; many other layouts are available. Uses the igraph implementation [Csardi06].
Discrete clustering of subgroups, continuous progression through subgroups, differential expression
dpt
[source] Reconstruct the progression of a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by [Haghverdi16]. Here, we use a further developed version, which is able to detect multiple branching events [Wolf17].
The possibilities of diffmap and dpt are similar to those of the R package destiny of [Angerer16]. The Scanpy tools though run faster and scale to much higher cell numbers.
Examples: See this use case.
louvain
[source] Cluster cells using the Louvain algorithm [Blondel08] in the implementation of [Traag17]. The Louvain algorithm has been proposed for single-cell analysis by [Levine15].
Examples: See this use case.
rank_genes_groups
[source] Rank genes by differential expression.
Examples: See this use case.
Simulation
sim
[source] Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by [Wittmann09]. The Scanpy implementation is due to [Wolf17].
The tool is similar to the Matlab tool Odefy of [Krumsiek10].
Examples: See this use case.
Installation
If you use Windows or Mac OS X and do not have Python 3.5 or 3.6, download and install Miniconda (see below). If you use Linux, use your package manager to obtain a current Python distribution.
Get releases on PyPI via:
pip install scanpy
To work with the latest version on GitHub: clone the repository – green button on top of the page – and cd into its root directory. To install with symbolic links (stay up to date with your cloned version after you update with git pull) call:
pip install --editable .
You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line.
Installing Miniconda
After downloading Miniconda, in a unix shell (Linux, Mac), run
cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh
and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. The whole process takes just a couple of minutes.
Trouble shooting
If you have both python 2 and python 3 installed:
pip3 install scanpy
If you do not have sudo rights (you get a Permission denied error):
pip install --user scanpy
On MacOS, you probably need to install the C core of igraph via homebrew first
brew install igraph
If python-igraph still fails to install, see here or consider installing gcc via brew install gcc --without-multilib and exporting export CC="/usr/local/Cellar/gcc/X.x.x/bin/gcc-X"; export CXX="/usr/local/Cellar/gcc/X.x.x/bin/gcc-X", where X and x refers to the version of gcc; in my case, the path reads /usr/local/Cellar/gcc/6.3.0_1/bin/gcc-6.
References
Amir et al. (2013), viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia, Nature Biotechnology.
Angerer et al. (2016), destiny – diffusion maps for large-scale single-cell data in R, Bioinformatics.
Blondel et al. (2008), Fast unfolding of communities in large networks, J. Stat. Mech..
Coifman et al. (2005), Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS.
Csardi et al. (2006), The igraph software package for complex network researc, InterJournal Complex Systems.
Ester et al. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR.
Fruchterman & Reingold (1991), Graph drawing by force-directed placement, Software: Practice & Experience.
Hagberg et al. (2008), Exploring Network Structure, Dynamics, and Function using NetworkX, Scipy Conference.
Haghverdi et al. (2015), Diffusion maps for high-dimensional single-cell analysis of differentiation data, Bioinformatics.
Haghverdi et al. (2016), Diffusion pseudotime robustly reconstructs branching cellular lineages, Nature Methods.
Huber et al. (2015), Orchestrating high-throughput genomic analysis with Bioconductor, Nature Methods.
Krumsiek et al. (2010), Odefy – From discrete to continuous models, BMC Bioinformatics.
Krumsiek et al. (2011), Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, PLoS ONE.
Levine et al. (2015), Data-Driven Phenotypic Dissection of AML Reveals Progenitor–like Cells that Correlate with Prognosis, Cell.
Satija et al. (2015), Spatial reconstruction of single-cell gene expression data, Nature Biotechnology.
Moignard et al. (2015), Decoding the regulatory network of early blood development from single-cell gene expression measurements, Nature Biotechnology.
Paul et al. (2015), Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors, Cell.
Weinreb et al. (2016), SPRING: a kinetic interface for visualizing high dimensional single-cell expression data, bioRXiv.
Wittmann et al. (2009), Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling, BMC Systems Biology.
Zheng et al. (2017), Massively parallel digital transcriptional profiling of single cells, Nature Communications.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file scanpy-0.2.5.tar.gz
.
File metadata
- Download URL: scanpy-0.2.5.tar.gz
- Upload date:
- Size: 204.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 623608a73f2a7c05515b0964b33abe5ff9a3dabf757f4d0d50c80837d50ad964 |
|
MD5 | 37f98b249478809c6a5b7943894a9f34 |
|
BLAKE2b-256 | 414da82fcdc0fc47eca450145c9fa026a4551434bdd74d7de352536cad41cfc3 |