scanpy

Single-Cell Analysis in Python.

These details have not been verified by PyPI

Project links

Homepage

Project description

Getting started | Features | Installation | References

Scanpy – Single-Cell Analysis in Python

Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing and simulation of gene regulatory networks. The Python-based implementation efficiently deals with data sets of more than one million cells and enables easy integration of advanced machine learning algorithms.

Getting started

With Python 3.5 or 3.6 installed, get releases on PyPI via (more information on installation here):

pip install scanpy

To work with the latest version on GitHub: clone the repository – green button on top of the page – and cd into its root directory and type:

pip install --editable .

You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line.

Then go through the use cases compiled in scanpy_usage, in particular, the recent additions

17-05-05: We reproduce most of the Guided Clustering tutorial of Seurat [Satija15].
17-05-03: Analyzing 68 000 cells from [Zheng17], we find that Scanpy is about a factor 5 to 16 faster and more memory efficient than the Cell Ranger R kit for secondary analysis.
17-05-02: We reproduce the results of the Diffusion Pseudotime (DPT) paper of [Haghverdi16]. Note that DPT has recently been very favorably discussed by the authors of Monocle.

Features

Let us give an Overview of the toplevel user functions, followed by a few words on Scanpy’s Basic Features and more details.

Overview

Scanpy user functions are grouped into the following modules

sc.tools: Machine Learning and statistics tools. Abbreviation sc.tl.
sc.preprocessing: Preprocessing. Abbreviation sc.pp.
sc.plotting: Plotting. Abbreviation sc.pl.
sc.settings: Settings.

Preprocessing

pp.*: Filtering of highly-variable genes, batch-effect correction, per-cell (UMI) normalization, preprocessing recipes.

Visualizations

tl.pca: PCA [Pedregosa11].
tl.diffmap: Diffusion Maps [Coifman05] [Haghverdi15] [Wolf17].
tl.tsne: t-SNE [Maaten08] [Amir13] [Pedregosa11].
tl.draw_graph: Force-directed graph drawing [Csardi06] [Weinreb17].

Branching trajectories and pseudotime, clustering, differential expression

tl.dpt: Infer progression of cells, identify branching subgroups [Haghverdi16] [Wolf17].
tl.louvain: Cluster cells into subgroups [Blondel08] [Traag17].
tl.rank_genes_groups: Rank genes according to differential expression [Wolf17].

Simulations

tl.sim: Simulate dynamic gene expression data [Wittmann09] [Wolf17].

Basic Features

The typical workflow consists of subsequent calls of data analysis tools of the form:

sc.tl.tool(adata, **params)

where adata is an AnnData object and params is a dictionary that stores optional parameters. Each of these calls adds annotation to an expression matrix X, which stores n d-dimensional gene expression measurements. By default, Scanpy tools operate inplace and return None. If you want to copy the AnnData object, pass the copy argument:

adata_copy = sc.tl.tool(adata, copy=True, **params)

Reading and writing data files and AnnData objects

One usually calls:

adata = sc.read(filename)

to initialize an AnnData object, possibly adds further annotation using, e.g., np.genfromtxt or pd.read_csv:

annotation = pd.read_csv(filename_annotation)
adata.smp['cell_groups'] = annotation['cell_groups']  # categorical annotation of type str or int
adata.smp['time'] = annotation['time']                # numerical annotation of type float

and uses:

sc.write(filename, adata)

to save the adata to file. Reading foresees filenames with extensions h5, xlsx, mtx, txt, csv and others. Writing foresees writing h5, csv and txt. Instead of providing a filename, you can provide a filekey, i.e., any string that does not end on a valid file extension.

AnnData objects

An AnnData instance stores an array-like data matrix as adata.X, dict-like sample annotation as adata.smp, dict-like variable annotation as adata.var and additional unstructured dict-like annotation as adata.add. While adata.add is a conventional dictionary, adata.smp and adata.var are instances of a low-level Pandas dataframe-like class.

Values can be retrieved and appended via adata.smp[key] and adata.var[key]. Sample and variable names can be accessed via adata.smp_names and adata.var_names, respectively. AnnData objects can be sliced like Pandas dataframes, for example, adata = adata[:, list_of_gene_names]. The AnnData class is similar to R’s ExpressionSet [Huber15] the latter though is not implemented for sparse data.

Plotting

For each tool, there is an associated plotting function:

sc.pl.tool(adata)

that retrieves and plots the elements of adata that were previously written by sc.tl.tool(adata). Scanpy’s plotting module can be viewed similar to Seaborn: an extension of matplotlib that allows visualizing operations on AnnData objects with one-line commands. Detailed configuration has to be done via matplotlib functions, which is easy as Scanpy’s plotting functions accept and return a Matplotlib.Axes object.

Visualization

pca

[source] Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn [Pedregosa11].

tsne

[source] t-distributed stochastic neighborhood embedding (tSNE) [Maaten08] has been proposed for single-cell data by [Amir13]. By default, Scanpy uses the implementation of scikit-learn [Pedregosa11]. You can achieve a huge speedup if you install Multicore-tSNE by [Ulyanov16], which will be automatically detected by Scanpy.

diffmap

[source] Diffusion maps [Coifman05] has been proposed for visualizing single-cell data by [Haghverdi15]. The tool uses the adapted Gaussian kernel suggested by [Haghverdi16]. Uses the implementation of [Wolf17].

draw_graph

[source] Force-directed graph drawing describes a class of long-established algorithms for visualizing graphs. It has been suggested for visualizing single-cell data by [Weinreb17]. Here, by default, the Fruchterman & Reingold [Fruchterman91] algorithm is used; many other layouts are available. Uses the igraph implementation [Csardi06].

Discrete clustering of subgroups, continuous progression through subgroups, differential expression

dpt

[source] Reconstruct the progression of a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by [Haghverdi16]. Here, we use a further developed version, which is able to detect multiple branching events [Wolf17].

The possibilities of diffmap and dpt are similar to those of the R package destiny of [Angerer16]. The Scanpy tools though run faster and scale to much higher cell numbers.

Examples: See this use case.

louvain

[source] Cluster cells using the Louvain algorithm [Blondel08] in the implementation of [Traag17]. The Louvain algorithm has been proposed for single-cell analysis by [Levine15].

Examples: See this use case.

rank_genes_groups

[source] Rank genes by differential expression.

Examples: See this use case.

Simulation

sim

[source] Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by [Wittmann09]. The Scanpy implementation is due to [Wolf17].

The tool is similar to the Matlab tool Odefy of [Krumsiek10].

Examples: See this use case.

Installation

If you use Windows or Mac OS X and do not have Python 3.5 or 3.6, download and install Miniconda (see below). If you use Linux, use your package manager to obtain a current Python distribution.

Get releases on PyPI via:

pip install scanpy

To work with the latest version on GitHub: clone the repository – green button on top of the page – and cd into its root directory. To install with symbolic links (stay up to date with your cloned version after you update with git pull) call:

pip install --editable .

You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line.

Installing Miniconda

After downloading Miniconda, in a unix shell (Linux, Mac), run

cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh

and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. The whole process takes just a couple of minutes.

Trouble shooting

If you have both python 2 and python 3 installed:

pip3 install scanpy

If you do not have sudo rights (you get a Permission denied error):

pip install --user scanpy

On MacOS, you probably need to install the C core of igraph via homebrew first

brew install igraph
If python-igraph still fails to install, see here or consider installing gcc via brew install gcc --without-multilib and exporting export CC="/usr/local/Cellar/gcc/X.x.x/bin/gcc-X"; export CXX="/usr/local/Cellar/gcc/X.x.x/bin/gcc-X", where X and x refers to the version of gcc; in my case, the path reads /usr/local/Cellar/gcc/6.3.0_1/bin/gcc-6.

References

[Amir13] (1,2)

Amir et al. (2013), viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia, Nature Biotechnology.

[Angerer16]

Angerer et al. (2016), destiny – diffusion maps for large-scale single-cell data in R, Bioinformatics.

[Blondel08] (1,2)

Blondel et al. (2008), Fast unfolding of communities in large networks, J. Stat. Mech..

[Coifman05] (1,2)

Coifman et al. (2005), Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS.

[Csardi06] (1,2)

Csardi et al. (2006), The igraph software package for complex network researc, InterJournal Complex Systems.

[Ester96]

Ester et al. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR.

[Fruchterman91]

Fruchterman & Reingold (1991), Graph drawing by force-directed placement, Software: Practice & Experience.

[Hagberg08]

Hagberg et al. (2008), Exploring Network Structure, Dynamics, and Function using NetworkX, Scipy Conference.

[Haghverdi15] (1,2)

Haghverdi et al. (2015), Diffusion maps for high-dimensional single-cell analysis of differentiation data, Bioinformatics.

[Haghverdi16] (1,2,3,4)

Haghverdi et al. (2016), Diffusion pseudotime robustly reconstructs branching cellular lineages, Nature Methods.

[Huber15]

Huber et al. (2015), Orchestrating high-throughput genomic analysis with Bioconductor, Nature Methods.

[Krumsiek10]

Krumsiek et al. (2010), Odefy – From discrete to continuous models, BMC Bioinformatics.

[Krumsiek11]

Krumsiek et al. (2011), Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, PLoS ONE.

[Levine15]

Levine et al. (2015), Data-Driven Phenotypic Dissection of AML Reveals Progenitor–like Cells that Correlate with Prognosis, Cell.

[Maaten08] (1,2)

Maaten & Hinton (2008), Visualizing data using t-SNE, JMLR.

[Satija15]

Satija et al. (2015), Spatial reconstruction of single-cell gene expression data, Nature Biotechnology.

[Moignard15]

Moignard et al. (2015), Decoding the regulatory network of early blood development from single-cell gene expression measurements, Nature Biotechnology.

[Pedregosa11] (1,2,3,4)

Pedregosa et al. (2011), Scikit-learn: Machine Learning in Python, JMLR.

[Paul15]

Paul et al. (2015), Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors, Cell.

[Traag17] (1,2)

Traag (2017), Louvain, GitHub.

[Ulyanov16]

Ulyanov (2016), Multicore t-SNE, GitHub.

[Weinreb17] (1,2)

Weinreb et al. (2016), SPRING: a kinetic interface for visualizing high dimensional single-cell expression data, bioRXiv.

[Wittmann09] (1,2)

Wittmann et al. (2009), Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling, BMC Systems Biology.

[Wolf17] (1,2,3,4,5,6,7)

Wolf et al (2017), TBD.

[Zheng17]

Zheng et al. (2017), Massively parallel digital transcriptional profiling of single cells, Nature Communications.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.10.4

Nov 12, 2024

1.10.3

Sep 17, 2024

1.10.2

Jun 25, 2024

1.10.1

Apr 9, 2024

1.10.0

Mar 26, 2024

1.10.0rc2 pre-release

Feb 23, 2024

1.10.0rc1 pre-release

Feb 22, 2024

1.9.8

Jan 26, 2024

1.9.7

Jan 25, 2024

1.9.6

Oct 31, 2023

1.9.5

Sep 8, 2023

1.9.4

Aug 24, 2023

1.9.3

Mar 2, 2023

1.9.2

Feb 16, 2023

1.9.1

Apr 5, 2022

1.9.0

Apr 1, 2022

1.8.2

Nov 3, 2021

1.8.1

Jul 7, 2021

1.8.0

Jun 29, 2021

1.7.2

Apr 7, 2021

1.7.1

Feb 24, 2021

1.7.0

Feb 3, 2021

1.7.0rc1 pre-release

Jan 14, 2021

1.6.1

Jan 14, 2021

1.6.0

Aug 17, 2020

1.5.1

May 21, 2020

1.5.0

May 19, 2020

1.5.0a1 pre-release

May 18, 2020

1.4.6

Mar 17, 2020

1.4.5.1

Feb 13, 2020

1.4.5.post3

Jan 24, 2020

1.4.5.post2

Jan 9, 2020

1.4.5.post1

Jan 8, 2020

1.4.5

Dec 30, 2019

1.4.4.post1

Jul 29, 2019

1.4.4

Jul 20, 2019

1.4.3

May 14, 2019

1.4.2

May 6, 2019

1.4.1

Apr 26, 2019

1.4

Feb 5, 2019

1.3.8

Feb 5, 2019

1.3.7

Jan 2, 2019

1.3.6

Dec 11, 2018

1.3.5

Dec 9, 2018

1.3.4

Nov 26, 2018

1.3.3

Nov 5, 2018

1.3.2

Oct 5, 2018

1.3.1

Sep 3, 2018

1.3

Sep 3, 2018

1.2.2

Jun 8, 2018

1.2.1

Jun 8, 2018

1.2.0

Jun 8, 2018

1.1

Jun 1, 2018

1.1a2 pre-release

May 21, 2018

1.1a1 pre-release

May 15, 2018

1.0.4

Apr 16, 2018

1.0.3.post1

Apr 16, 2018

1.0.3

Apr 11, 2018

1.0.2

Apr 8, 2018

1.0.1.post1

Apr 16, 2018

1.0.1

Apr 3, 2018

1.0

Mar 29, 2018

0.4.4

Feb 26, 2018

0.4.3

Feb 9, 2018

0.4.2

Jan 7, 2018

0.4.1

Jan 1, 2018

0.4

Dec 24, 2017

0.3.2

Nov 29, 2017

0.3.1

Nov 17, 2017

0.3

Nov 17, 2017

0.2.9.1

Nov 7, 2017

0.2.9

Oct 25, 2017

0.2.8

Aug 25, 2017

0.2.7

Aug 23, 2017

0.2.6

Aug 4, 2017

This version

0.2.5

Jul 31, 2017

0.2.4

Jul 27, 2017

0.2.3.5

Jul 25, 2017

0.2.3.4

Jul 25, 2017

0.2.3

Jul 24, 2017

0.2.1

Jul 24, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scanpy-0.2.5.tar.gz (204.3 kB view details)

Uploaded Jul 31, 2017 Source

File details

Details for the file scanpy-0.2.5.tar.gz.

File metadata

Download URL: scanpy-0.2.5.tar.gz
Upload date: Jul 31, 2017
Size: 204.3 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for scanpy-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`623608a73f2a7c05515b0964b33abe5ff9a3dabf757f4d0d50c80837d50ad964`
MD5	`37f98b249478809c6a5b7943894a9f34`
BLAKE2b-256	`414da82fcdc0fc47eca450145c9fa026a4551434bdd74d7de352536cad41cfc3`