Sparse binary format for Hi-C genomic contact heatmaps
Project description
# Cooler
[![Build Status](https://travis-ci.org/mirnylab/cooler.svg?branch=master)](https://travis-ci.org/mirnylab/cooler)
[![Documentation Status](https://readthedocs.org/projects/cooler/badge/?version=latest)](http://cooler.readthedocs.org/en/latest/)
## A cool place to store your Hi-C
Cooler is a **sparse, compressed, binary** persistent storage format for Hi-C contact maps based on HDF5.
See [example Jupyter notebook](https://gist.github.com/nvictus/904160bca9d0e8d5aeeb).
The `cooler` library implements a simple **schema** to store a high resolution contact matrix along with important auxiliary data such as scaffold information, genomic bin annotations, and basic metadata.
Data tables are stored in a **columnar** representation as groups of 1D HDF5 array datasets of the same length. The contact matrix itself is stored as a table containing only the **nonzero upper triangle** pixels.
The library API provides a thin Python wrapper over [h5py](http://docs.h5py.org/en/latest/) for **range queries** on the data:
- Table selections are retrieved as Pandas `DataFrame`s
- Matrix slice selections are retrieved as SciPy sparse matrices or NumPy `ndarray`s
- The metadata is retrieved as a json-serializable dictionary.
Range queries can be supplied as either integer bin indexes or genomic coordinate intervals.
### Installation
Requirements:
- Python 2.7/3.3+
- libhdf5 and Python packages `numpy`, `scipy`, `pandas`, `h5py`. If you don't have them installed already, we recommend you use the [conda](http://conda.pydata.org/miniconda.html) package manager to manage these dependencies instead of pip.
Install from PyPI using pip.
```sh
$ pip install cooler
```
For development, clone and install in "editable" (i.e. development) mode with the `-e` option.
```sh
$ git clone https://github.com/mirnylab/cooler.git
$ cd cooler
$ pip install -e .
```
### Schema
Required attributes (metadata):
```
genome-assembly : <string> Name of genome assembly
bin-type : {"fixed" or "variable"}
bin-size : <int or null> Size of bins in bp if bin-type is "fixed"
nchroms : <int> Number of rows in scaffolds table
nbins : <int> Number of rows in bins table
nnz : <int> Number of rows in matrix table
format-url : <url> URL to page providing format details
format-version : <string> The version of the current format
generated-by : <string> Agent that created the file
creation-date : <datetime> Date the file was built
metadata : <json> custom metadata about the experiment
```
The required tables and indexes can be represented in the [Datashape](http://datashape.readthedocs.org/en/latest/) layout language:
```
{
scaffolds: {
name: typevar['Nchroms'] * string[32, 'A']
length: typevar['Nchroms'] * int64,
},
bins: {
chrom_id: typevar['Nbins'] * int32,
start: typevar['Nbins'] * int64,
end: typevar['Nbins'] * int64
},
matrix: {
bin1_id: typevar['Nnz'] * int32,
bin2_id: typevar['Nnz'] * int32,
count: typevar['Nnz'] * int32
},
indexes: {
chrom_offset: typevar['Nchroms'] * int32,
bin1_offset: typevar['Nbins'] * int32
}
}
```
Notes:
- Any number of additional optional columns can be added to each table. (e.g. quality masks, normalization vectors).
- Genomic coordinates are assumed to be 0-based and intervals half-open (1-based ends).
Matrix storage format:
- The `bins` table is lexicographically sorted by `chrom_id`, `start`, `end`.
- The `matrix` table is lexicographically sorted by `bin1_id`, then `bin2_id`.
- Offset pointers are used to facilitate matrix queries. This is effectively a [compressed sparse row](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29) storage scheme for a symmetric matrix.
Rather than build on top of a more full-featured, opinionated library like PyTables (or `pandas.HDFStore` built on top of that), we provide a simple and transparent data layout on top of HDF5 that supports random access range queries and can be easily [migrated](https://github.com/blaze/odo).
See also:
- [hdf2tab](https://github.com/blajoie/hdf2tab) converts dense Hi-C matrices stored in HDF5 files to tabular text files.
- The [biom](https://github.com/biocore/biom-format) format is an HDF5-based format for metagenomic observation matrices.
[![Build Status](https://travis-ci.org/mirnylab/cooler.svg?branch=master)](https://travis-ci.org/mirnylab/cooler)
[![Documentation Status](https://readthedocs.org/projects/cooler/badge/?version=latest)](http://cooler.readthedocs.org/en/latest/)
## A cool place to store your Hi-C
Cooler is a **sparse, compressed, binary** persistent storage format for Hi-C contact maps based on HDF5.
See [example Jupyter notebook](https://gist.github.com/nvictus/904160bca9d0e8d5aeeb).
The `cooler` library implements a simple **schema** to store a high resolution contact matrix along with important auxiliary data such as scaffold information, genomic bin annotations, and basic metadata.
Data tables are stored in a **columnar** representation as groups of 1D HDF5 array datasets of the same length. The contact matrix itself is stored as a table containing only the **nonzero upper triangle** pixels.
The library API provides a thin Python wrapper over [h5py](http://docs.h5py.org/en/latest/) for **range queries** on the data:
- Table selections are retrieved as Pandas `DataFrame`s
- Matrix slice selections are retrieved as SciPy sparse matrices or NumPy `ndarray`s
- The metadata is retrieved as a json-serializable dictionary.
Range queries can be supplied as either integer bin indexes or genomic coordinate intervals.
### Installation
Requirements:
- Python 2.7/3.3+
- libhdf5 and Python packages `numpy`, `scipy`, `pandas`, `h5py`. If you don't have them installed already, we recommend you use the [conda](http://conda.pydata.org/miniconda.html) package manager to manage these dependencies instead of pip.
Install from PyPI using pip.
```sh
$ pip install cooler
```
For development, clone and install in "editable" (i.e. development) mode with the `-e` option.
```sh
$ git clone https://github.com/mirnylab/cooler.git
$ cd cooler
$ pip install -e .
```
### Schema
Required attributes (metadata):
```
genome-assembly : <string> Name of genome assembly
bin-type : {"fixed" or "variable"}
bin-size : <int or null> Size of bins in bp if bin-type is "fixed"
nchroms : <int> Number of rows in scaffolds table
nbins : <int> Number of rows in bins table
nnz : <int> Number of rows in matrix table
format-url : <url> URL to page providing format details
format-version : <string> The version of the current format
generated-by : <string> Agent that created the file
creation-date : <datetime> Date the file was built
metadata : <json> custom metadata about the experiment
```
The required tables and indexes can be represented in the [Datashape](http://datashape.readthedocs.org/en/latest/) layout language:
```
{
scaffolds: {
name: typevar['Nchroms'] * string[32, 'A']
length: typevar['Nchroms'] * int64,
},
bins: {
chrom_id: typevar['Nbins'] * int32,
start: typevar['Nbins'] * int64,
end: typevar['Nbins'] * int64
},
matrix: {
bin1_id: typevar['Nnz'] * int32,
bin2_id: typevar['Nnz'] * int32,
count: typevar['Nnz'] * int32
},
indexes: {
chrom_offset: typevar['Nchroms'] * int32,
bin1_offset: typevar['Nbins'] * int32
}
}
```
Notes:
- Any number of additional optional columns can be added to each table. (e.g. quality masks, normalization vectors).
- Genomic coordinates are assumed to be 0-based and intervals half-open (1-based ends).
Matrix storage format:
- The `bins` table is lexicographically sorted by `chrom_id`, `start`, `end`.
- The `matrix` table is lexicographically sorted by `bin1_id`, then `bin2_id`.
- Offset pointers are used to facilitate matrix queries. This is effectively a [compressed sparse row](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29) storage scheme for a symmetric matrix.
Rather than build on top of a more full-featured, opinionated library like PyTables (or `pandas.HDFStore` built on top of that), we provide a simple and transparent data layout on top of HDF5 that supports random access range queries and can be easily [migrated](https://github.com/blaze/odo).
See also:
- [hdf2tab](https://github.com/blajoie/hdf2tab) converts dense Hi-C matrices stored in HDF5 files to tabular text files.
- The [biom](https://github.com/biocore/biom-format) format is an HDF5-based format for metagenomic observation matrices.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cooler-0.2.1.tar.gz
(3.4 MB
view details)
Built Distribution
File details
Details for the file cooler-0.2.1.tar.gz
.
File metadata
- Download URL: cooler-0.2.1.tar.gz
- Upload date:
- Size: 3.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ef1b57fe58576d842a9034fec7e3d82356f0a17adf7869713835b6f1fd30d67 |
|
MD5 | 468481a0e0767090d059de8cc634ee6b |
|
BLAKE2b-256 | 6e392d52b22d31b5b7832d31b7c71b7f2d5357be1bb307ffea6441a7febbcf88 |
File details
Details for the file cooler-0.2.1-py2.py3-none-any.whl
.
File metadata
- Download URL: cooler-0.2.1-py2.py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6c39c89a0a113363abb0ed390a6364d1b13d4e597b07b98cdbe02150eb53b78f |
|
MD5 | 7c9159168640e20d1475949d92cbf87f |
|
BLAKE2b-256 | fb86a40fe488f2d8ae780634e928a4c7882dc161cb50630335a7af92b7e59931 |