Skip to main content

Sparse binary format for Hi-C genomic contact heatmaps

Project description

# Cooler

[![Build Status](https://travis-ci.org/mirnylab/cooler.svg?branch=master)](https://travis-ci.org/mirnylab/cooler)
[![Documentation Status](https://readthedocs.org/projects/cooler/badge/?version=latest)](http://cooler.readthedocs.org/en/latest/)

## A cool place to store your Hi-C

Cooler is a **sparse, compressed, binary** persistent storage format for Hi-C contact maps based on HDF5.

See [example Jupyter notebook](https://gist.github.com/nvictus/904160bca9d0e8d5aeeb).

The `cooler` library implements a simple **schema** to store a high resolution contact matrix along with important auxiliary data such as scaffold information, genomic bin annotations, and basic metadata.

Data tables are stored in a **columnar** representation as groups of 1D HDF5 array datasets of the same length. The contact matrix itself is stored as a table containing only the **nonzero upper triangle** pixels.

The library API provides a thin Python wrapper over [h5py](http://docs.h5py.org/en/latest/) for **range queries** on the data:
- Table selections are retrieved as Pandas `DataFrame`s
- Matrix slice selections are retrieved as SciPy sparse matrices or NumPy `ndarray`s
- The metadata is retrieved as a json-serializable dictionary.

Range queries can be supplied as either integer bin indexes or genomic coordinate intervals.

### Installation

Requirements:

- Python 2.7/3.3+
- libhdf5 and Python packages `numpy`, `scipy`, `pandas`, `h5py`. If you don't have them installed already, we recommend you use the [conda](http://conda.pydata.org/miniconda.html) package manager to manage these dependencies instead of pip.

Install from PyPI using pip.
```sh
$ pip install cooler
```

For development, clone and install in "editable" (i.e. development) mode with the `-e` option.
```sh
$ git clone https://github.com/mirnylab/cooler.git
$ cd cooler
$ pip install -e .
```


### Schema

Required attributes (metadata):
```
genome-assembly : <string> Name of genome assembly
bin-type : {"fixed" or "variable"}
bin-size : <int or null> Size of bins in bp if bin-type is "fixed"
nchroms : <int> Number of rows in scaffolds table
nbins : <int> Number of rows in bins table
nnz : <int> Number of rows in matrix table
format-url : <url> URL to page providing format details
format-version : <string> The version of the current format
generated-by : <string> Agent that created the file
creation-date : <datetime> Date the file was built
metadata : <json> custom metadata about the experiment
```

The required tables and indexes can be represented in the [Datashape](http://datashape.readthedocs.org/en/latest/) layout language:
```
{
scaffolds: {
name: typevar['Nchroms'] * string[32, 'A']
length: typevar['Nchroms'] * int64,
},
bins: {
chrom_id: typevar['Nbins'] * int32,
start: typevar['Nbins'] * int64,
end: typevar['Nbins'] * int64
},
matrix: {
bin1_id: typevar['Nnz'] * int32,
bin2_id: typevar['Nnz'] * int32,
count: typevar['Nnz'] * int32
},
indexes: {
chrom_offset: typevar['Nchroms'] * int32,
bin1_offset: typevar['Nbins'] * int32
}
}
```

Notes:
- Any number of additional optional columns can be added to each table. (e.g. quality masks, normalization vectors).
- Genomic coordinates are assumed to be 0-based and intervals half-open (1-based ends).

Matrix storage format:
- The `bins` table is lexicographically sorted by `chrom_id`, `start`, `end`.
- The `matrix` table is lexicographically sorted by `bin1_id`, then `bin2_id`.
- Offset pointers are used to facilitate matrix queries. This is effectively a [compressed sparse row](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29) storage scheme for a symmetric matrix.

Rather than build on top of a more full-featured, opinionated library like PyTables (or `pandas.HDFStore` built on top of that), we provide a simple and transparent data layout on top of HDF5 that supports random access range queries and can be easily [migrated](https://github.com/blaze/odo).

See also:
- [hdf2tab](https://github.com/blajoie/hdf2tab) converts dense Hi-C matrices stored in HDF5 files to tabular text files.
- The [biom](https://github.com/biocore/biom-format) format is an HDF5-based format for metagenomic observation matrices.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cooler-0.3.0.tar.gz (3.4 MB view details)

Uploaded Source

Built Distribution

cooler-0.3.0-py2.py3-none-any.whl (17.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file cooler-0.3.0.tar.gz.

File metadata

  • Download URL: cooler-0.3.0.tar.gz
  • Upload date:
  • Size: 3.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for cooler-0.3.0.tar.gz
Algorithm Hash digest
SHA256 957d2a0201d0f66d7d0f2e1a7d44210212060dcd438cab86b6960a2aa9210374
MD5 ccf1bb4e8bf11a14dfa768ee9c913812
BLAKE2b-256 3dafd6c0fce01968dffb5fac99cdce5acf54f1b8590de0c5decc4e160a132a31

See more details on using hashes here.

File details

Details for the file cooler-0.3.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for cooler-0.3.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 0a87850b358db1cc47913b88f9d4b05f3a4382e673f78efee4c6b486709d7976
MD5 4293e48d92deb6d0533d4ec86cc79811
BLAKE2b-256 9e340bf9ae99d7fe3e72d2087f16fe8b5a9f0f36f2f47ac899f9c98309d49232

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page