Skip to main content

Sparse binary format for Hi-C genomic contact heatmaps

Project description

# Cooler

[![Build Status](https://travis-ci.org/mirnylab/cooler.svg?branch=master)](https://travis-ci.org/mirnylab/cooler)
[![Documentation Status](https://readthedocs.org/projects/cooler/badge/?version=latest)](http://cooler.readthedocs.org/en/latest/)

## A cool place to store your Hi-C

Cooler is a **sparse, compressed, binary** persistent storage format for Hi-C contact maps based on HDF5.

See [example Jupyter notebook](https://gist.github.com/nvictus/904160bca9d0e8d5aeeb).

The `cooler` library implements a simple **schema** to store a high resolution contact matrix along with important auxiliary data such as scaffold information, genomic bin annotations, and basic metadata.

Data tables are stored in a **columnar** representation as groups of 1D HDF5 array datasets of the same length. The contact matrix itself is stored as a table containing only the **nonzero upper triangle** pixels.

The library API provides a thin Python wrapper over [h5py](http://docs.h5py.org/en/latest/) for **range queries** on the data:
- Table selections are retrieved as Pandas `DataFrame`s
- Matrix slice selections are retrieved as SciPy sparse matrices or NumPy `ndarray`s
- The metadata is retrieved as a json-serializable dictionary.

Range queries can be supplied as either integer bin indexes or genomic coordinate intervals.

### Installation

Requirements:

- Python 2.7/3.3+
- libhdf5 and Python packages `numpy`, `scipy`, `pandas`, `h5py`. If you don't have them installed already, we recommend you use the [conda](http://conda.pydata.org/miniconda.html) package manager to manage these dependencies instead of pip.

Install from PyPI using pip.
```sh
$ pip install cooler
```

For development, clone and install in "editable" (i.e. development) mode with the `-e` option.
```sh
$ git clone https://github.com/mirnylab/cooler.git
$ cd cooler
$ pip install -e .
```


### Schema

Required attributes (metadata):
```
genome-assembly : <string> Name of genome assembly
bin-type : {"fixed" or "variable"}
bin-size : <int or null> Size of bins in bp if bin-type is "fixed"
nchroms : <int> Number of rows in scaffolds table
nbins : <int> Number of rows in bins table
nnz : <int> Number of rows in matrix table
format-url : <url> URL to page providing format details
format-version : <string> The version of the current format
generated-by : <string> Agent that created the file
creation-date : <datetime> Date the file was built
metadata : <json> custom metadata about the experiment
```

The required tables and indexes can be represented in the [Datashape](http://datashape.readthedocs.org/en/latest/) layout language:
```
{
scaffolds: {
name: typevar['Nchroms'] * string[32, 'A']
length: typevar['Nchroms'] * int64,
},
bins: {
chrom_id: typevar['Nbins'] * int32,
start: typevar['Nbins'] * int64,
end: typevar['Nbins'] * int64
},
matrix: {
bin1_id: typevar['Nnz'] * int32,
bin2_id: typevar['Nnz'] * int32,
count: typevar['Nnz'] * int32
},
indexes: {
chrom_offset: typevar['Nchroms'] * int32,
bin1_offset: typevar['Nbins'] * int32
}
}
```

Notes:
- Any number of additional optional columns can be added to each table. (e.g. quality masks, normalization vectors).
- Genomic coordinates are assumed to be 0-based and intervals half-open (1-based ends).

Matrix storage format:
- The `bins` table is lexicographically sorted by `chrom_id`, `start`, `end`.
- The `matrix` table is lexicographically sorted by `bin1_id`, then `bin2_id`.
- Offset pointers are used to facilitate matrix queries. This is effectively a [compressed sparse row](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29) storage scheme for a symmetric matrix.

Rather than build on top of a more full-featured, opinionated library like PyTables (or `pandas.HDFStore` built on top of that), we provide a simple and transparent data layout on top of HDF5 that supports random access range queries and can be easily [migrated](https://github.com/blaze/odo).

See also:
- [hdf2tab](https://github.com/blajoie/hdf2tab) converts dense Hi-C matrices stored in HDF5 files to tabular text files.
- The [biom](https://github.com/biocore/biom-format) format is an HDF5-based format for metagenomic observation matrices.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cooler-0.2.1.tar.gz (3.4 MB view details)

Uploaded Source

Built Distribution

cooler-0.2.1-py2.py3-none-any.whl (17.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file cooler-0.2.1.tar.gz.

File metadata

  • Download URL: cooler-0.2.1.tar.gz
  • Upload date:
  • Size: 3.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for cooler-0.2.1.tar.gz
Algorithm Hash digest
SHA256 3ef1b57fe58576d842a9034fec7e3d82356f0a17adf7869713835b6f1fd30d67
MD5 468481a0e0767090d059de8cc634ee6b
BLAKE2b-256 6e392d52b22d31b5b7832d31b7c71b7f2d5357be1bb307ffea6441a7febbcf88

See more details on using hashes here.

File details

Details for the file cooler-0.2.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for cooler-0.2.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6c39c89a0a113363abb0ed390a6364d1b13d4e597b07b98cdbe02150eb53b78f
MD5 7c9159168640e20d1475949d92cbf87f
BLAKE2b-256 fb86a40fe488f2d8ae780634e928a4c7882dc161cb50630335a7af92b7e59931

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page