Skip to main content

Single-Cell Analysis in Python.

Project description

Getting started | Features | Installation | References

Build Status

Scanpy – Single-Cell Analysis in Python

Highly-performant tools for analyzing and simulating large-scale single-cell data. The draft Wolf, Angerer & Theis (2017) explains conceptual ideas of the package. Any comments are appreciated!

Getting started

Get releases on PyPI via:

pip install scanpy

To work with the latest updates on GitHub: clone the repository – green button on top of the page – and cd into its root directory. With Python 3.5 or 3.6 (preferably Miniconda) installed, type:

pip install --editable .

You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line (more information on installation here).

Then go through the use cases compiled in scanpy_usage, in particular, the recent additions

17-05-05

We reproduce most of the Guided Clustering tutorial of Seurat [Macosco15].

17-05-03

Analyzing 68 000 cells from [Zheng17], we find that Scanpy is about a factor 5 to 16 faster and more memory efficient than the Cell Ranger R kit for secondary analysis.

17-05-02

We reproduce the results of the Diffusion Pseudotime (DPT) paper of [Haghverdi16]. Note that DPT has recently been very favorably discussed by the authors of Monocle.

Features

Let us give an Overview of the toplevel user functions, followed by a few words on Scanpy’s Basic Features and more details.

Overview

Scanpy user functions are grouped into the following modules

sc.tools

Machine Learning and statistics tools. Abbreviation sc.tl.

sc.preprocessing

Preprocessing. Abbreviation sc.pp.

sc.plotting

Plotting. Abbreviation sc.pl.

sc.settings

Settings.

Preprocessing
pp.*

Filtering of highly-variable genes, batch-effect correction, per-cell (UMI) normalization, preprocessing recipes.

Visualizations
tl.pca

PCA [Pedregosa11].

tl.diffmap

Diffusion Maps [Coifman05] [Haghverdi15] [Wolf17].

tl.tsne

t-SNE [Maaten08] [Amir13] [Pedregosa11].

tl.draw_graph

Force-directed graph drawing [Csardi06] [Weinreb17].

Branching trajectories and pseudotime, clustering, differential expression
tl.dpt

Infer progression of cells, identify branching subgroups [Haghverdi16] [Wolf17].

tl.louvain

Cluster cells into subgroups [Blondel08] [Traag17].

tl.rank_genes_groups

Rank genes according to differential expression [Wolf17].

Simulations
tl.sim

Simulate dynamic gene expression data [Wittmann09] [Wolf17].

Basic Features

The typical workflow consists of subsequent calls of data analysis tools of the form:

sc.tl.tool(adata, **params)

where adata is an AnnData object and params is a dictionary that stores optional parameters. Each of these calls adds annotation to an expression matrix X, which stores n d-dimensional gene expression measurements. By default, Scanpy tools operate inplace and return None. If you want to copy the AnnData object, pass the copy argument:

adata_copy = sc.tl.tool(adata, copy=True, **params)
Reading and writing data files and AnnData objects

One usually calls:

adata = sc.read(filename)

to initialize an AnnData object, possibly adds further annotation using, e.g., np.genfromtxt or pd.read_csv:

annotation = pd.read_csv(filename_annotation)
adata.smp['cell_groups'] = annotation['cell_groups']  # categorical annotation of type str or int
adata.smp['time'] = annotation['time']                # numerical annotation of type float

and uses:

sc.write(filename, adata)

to save the adata to file. Reading foresees filenames with extensions h5, xlsx, mtx, txt, csv and others. Writing foresees writing h5, csv and txt. Instead of providing a filename, you can provide a filekey, i.e., any string that does not end on a valid file extension.

AnnData objects

An AnnData instance stores an array-like data matrix as adata.X, dict-like sample annotation as adata.smp, dict-like variable annotation as adata.var and additional unstructured dict-like annotation as adata.add. While adata.add is a conventional dictionary, adata.smp and adata.var are instances of a low-level Pandas dataframe-like class.

Values can be retrieved and appended via adata.smp[key] and adata.var[key]. Sample and variable names can be accessed via adata.smp_names and adata.var_names, respectively. AnnData objects can be sliced like Pandas dataframes, for example, adata = adata[:, list_of_gene_names]. The AnnData class is similar to R’s ExpressionSet [Huber15] the latter though is not implemented for sparse data.

Plotting

For each tool, there is an associated plotting function:

sc.pl.tool(adata)

that retrieves and plots the elements of adata that were previously written by sc.tl.tool(adata). Scanpy’s plotting module can be viewed similar to Seaborn: an extension of matplotlib that allows visualizing operations on AnnData objects with one-line commands. Detailed configuration has to be done via matplotlib functions, which is easy as Scanpy’s plotting functions accept and return a Matplotlib.Axes object.

Visualization

pca

[source] Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn [Pedregosa11].

tsne

[source] t-distributed stochastic neighborhood embedding (tSNE) [Maaten08] has been proposed for single-cell data by [Amir13]. By default, Scanpy uses the implementation of scikit-learn [Pedregosa11]. You can achieve a huge speedup if you install Multicore-tSNE by [Ulyanov16], which will be automatically detected by Scanpy.

diffmap

[source] Diffusion maps [Coifman05] has been proposed for visualizing single-cell data by [Haghverdi15]. The tool uses the adapted Gaussian kernel suggested by [Haghverdi16]. Uses the implementation of [Wolf17].

draw_graph

[source] Force-directed graph drawing describes a class of long-established algorithms for visualizing graphs. It has been suggested for visualizing single-cell data by [Weinreb17]. Here, by default, the Fruchterman & Reingold [Fruchterman91] algorithm is used; many other layouts are available. Uses the igraph implementation [Csardi06].

Discrete clustering of subgroups, continuous progression through subgroups, differential expression

dpt

[source] Reconstruct the progression of a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by [Haghverdi16]. Here, we use a further developed version, which is able to detect multiple branching events [Wolf17].

The possibilities of diffmap and dpt are similar to those of the R package destiny of [Angerer16]. The Scanpy tools though run faster and scale to much higher cell numbers.

Examples: See this use case.

louvain

[source] Cluster cells using the Louvain algorithm [Blondel08] in the implementation of [Traag17]. The Louvain algorithm has been proposed for single-cell analysis by [Levine15].

Examples: See this use case.

rank_genes_groups

[source] Rank genes by differential expression.

Examples: See this use case.

Simulation

sim

[source] Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by [Wittmann09]. The Scanpy implementation is due to [Wolf17].

The tool is similar to the Matlab tool Odefy of [Krumsiek10].

Examples: See this use case.

Installation

If you use Windows or Mac OS X and do not have a current Python distribution (Python 3.5 or 3.6), download and install Miniconda (see below). If you use Linux, use your package manager to obtain a current Python distribution.

Get releases on PyPI via:

pip install scanpy

To work with the latest updates on GitHub: clone the repository – green button on top of the page – and cd into its root directory. To install with symbolic links (stay up to date with your cloned version after you update with git pull) call:

pip install --editable .

You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line.

Installing Miniconda

After downloading Miniconda, in a unix shell (Linux, Mac), run

cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh

and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. The whole process takes just a couple of minutes.

References

[Amir13] (1,2)

Amir et al. (2013), viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia, Nature Biotechnology.

[Angerer16]

Angerer et al. (2016), destiny – diffusion maps for large-scale single-cell data in R, Bioinformatics.

[Blondel08] (1,2)

Blondel et al. (2008), Fast unfolding of communities in large networks, J. Stat. Mech..

[Coifman05] (1,2)

Coifman et al. (2005), Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS.

[Csardi06] (1,2)

Csardi et al. (2006), The igraph software package for complex network researc, InterJournal Complex Systems.

[Ester96]

Ester et al. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR.

[Fruchterman91]

Fruchterman & Reingold (1991), Graph drawing by force-directed placement, Software: Practice & Experience.

[Hagberg08]

Hagberg et al. (2008), Exploring Network Structure, Dynamics, and Function using NetworkX, Scipy Conference.

[Haghverdi15] (1,2)

Haghverdi et al. (2015), Diffusion maps for high-dimensional single-cell analysis of differentiation data, Bioinformatics.

[Haghverdi16] (1,2,3,4)

Haghverdi et al. (2016), Diffusion pseudotime robustly reconstructs branching cellular lineages, Nature Methods.

[Huber15]

Huber et al. (2015), Orchestrating high-throughput genomic analysis with Bioconductor, Nature Methods.

[Krumsiek10]

Krumsiek et al. (2010), Odefy – From discrete to continuous models, BMC Bioinformatics.

[Krumsiek11]

Krumsiek et al. (2011), Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, PLoS ONE.

[Levine15]

Levine et al. (2015), Data-Driven Phenotypic Dissection of AML Reveals Progenitor–like Cells that Correlate with Prognosis, Cell.

[Maaten08] (1,2)

Maaten & Hinton (2008), Visualizing data using t-SNE, JMLR.

[Macosco15]

Macosko et al. (2015), Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets, Cell.

[Moignard15]

Moignard et al. (2015), Decoding the regulatory network of early blood development from single-cell gene expression measurements, Nature Biotechnology.

[Pedregosa11] (1,2,3,4)

Pedregosa et al. (2011), Scikit-learn: Machine Learning in Python, JMLR.

[Paul15]

Paul et al. (2015), Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors, Cell.

[Traag17] (1,2)

Traag (2017), Louvain, GitHub.

[Ulyanov16]

Ulyanov (2016), Multicore t-SNE, GitHub.

[Weinreb17] (1,2)

Weinreb et al. (2016), SPRING: a kinetic interface for visualizing high dimensional single-cell expression data, bioRXiv.

[Wittmann09] (1,2)

Wittmann et al. (2009), Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling, BMC Systems Biology.

[Wolf17] (1,2,3,4,5,6,7)

Wolf et al (2017), TBD.

[Zheng17]

Zheng et al. (2017), Massively parallel digital transcriptional profiling of single cells, Nature Communications.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scanpy-0.2.3.tar.gz (201.5 kB view details)

Uploaded Source

Built Distributions

scanpy-0.2.3-cp36-cp36m-manylinux1_x86_64.whl (225.7 kB view details)

Uploaded CPython 3.6m

scanpy-0.2.3-cp35-cp35m-macosx_10_6_x86_64.whl (186.9 kB view details)

Uploaded CPython 3.5m macOS 10.6+ x86-64

File details

Details for the file scanpy-0.2.3.tar.gz.

File metadata

  • Download URL: scanpy-0.2.3.tar.gz
  • Upload date:
  • Size: 201.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for scanpy-0.2.3.tar.gz
Algorithm Hash digest
SHA256 b3770d01d8579f9c738ba64196ac8304049ef15b5415a1157229104fcacf5b3c
MD5 1ff653952e6abe66851aabda84e41dc4
BLAKE2b-256 a1d53701d2f84b50996883e1e7b2076da7beb01d2a394353f575d5866cca1f46

See more details on using hashes here.

File details

Details for the file scanpy-0.2.3-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for scanpy-0.2.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1a94784f88c9e28f8345c76c944d59b871858048a1a8b66a51f8c7dfbcc833e8
MD5 a8f89c0ece93bd628aee2c478504da04
BLAKE2b-256 d949b4cc7b3aaa9a81e5d92cc9c706af0bd9e5eecc5d55e4cd2c0fb5e0cc816b

See more details on using hashes here.

File details

Details for the file scanpy-0.2.3-cp35-cp35m-macosx_10_6_x86_64.whl.

File metadata

File hashes

Hashes for scanpy-0.2.3-cp35-cp35m-macosx_10_6_x86_64.whl
Algorithm Hash digest
SHA256 028c60d3da54be181725a7782b7cceace5974e1467cf461055de90da3d3f25e2
MD5 f2d431f548f45fcee8906b8410b75f5d
BLAKE2b-256 330110eb692e1499f40bccb15af9b268726e227a3ba04ec5e1102aba65f7c413

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page