Skip to main content

Sparse binary format for genomic interaction matrices

Project description

# Cooler

[![Build Status](https://travis-ci.org/mirnylab/cooler.svg?branch=master)](https://travis-ci.org/mirnylab/cooler)
[![Documentation Status](https://readthedocs.org/projects/cooler/badge/?version=latest)](http://cooler.readthedocs.org/en/latest/)
[![Binder](http://mybinder.org/badge.svg)](http://mybinder.org:/repo/mirnylab/cooler-binder)

## A cool place to store your Hi-C

Cooler is a support library for a **sparse, compressed, binary** persistent storage format for Hi-C contact matrices, called `cool`, which is based on [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format).

Cooler aims to provide the following functionality:

- Generate contact matrices from contact lists at arbitrary resolutions.
- Store contact matrices efficiently in `cool` format based on the widely used HDF5 container format.
- Perform out-of-core genome wide contact matrix normalization (a.k.a. balancing)
- Perform fast range queries on a contact matrix.
- Convert contact matrices between formats.
- Provide a clean and well-documented Python API to work with Hi-C data.


To get started:

- Documentation is available [here](http://cooler.readthedocs.org/en/latest/).
- Walkthrough with a [Jupyter notebook](https://github.com/mirnylab/cooler-binder).
- Some published data sets are available at `ftp://cooler.csail.mit.edu/coolers`.


### Installation

Requirements:

- Python 2.7/3.3+
- libhdf5 and Python packages `numpy`, `scipy`, `pandas`, `h5py`. These packages have heavy binary dependencies, so if you don't have them installed already, we recommend you use the [conda](http://conda.pydata.org/miniconda.html) package manager to manage them instead of pip. All other Python package dependencies are easily handled by pip.
- See the [docs](http://cooler.readthedocs.org/en/latest/) for more information.

Install from PyPI using pip.
```sh
$ pip install cooler
```


### Command line interface

The `cooler` library includes utilities for creating and querying `cool` files and for performing out-of-core contact **matrix balancing** on a cooler file of any resolution. See the [docs](http://cooler.readthedocs.org/en/latest/) for more information.

```bash
$ cooler makebins $CHROMSIZES_FILE $BINSIZE > bins.10kb.bed
$ cooler cload bins.10kb.bed $CONTACTS_FILE out.cool
$ cooler balance -p 10 out.cool
$ cooler dump -b -t pixels --header --join -r chr3:10,000,000-12,000,000 -r2 chr17 out.cool | head
```

```
chrom1 start1 end1 chrom2 start2 end2 count balanced
chr3 10000000 10010000 chr17 0 10000 1 0.810766
chr3 10000000 10010000 chr17 520000 530000 1 1.2055
chr3 10000000 10010000 chr17 640000 650000 1 0.587372
chr3 10000000 10010000 chr17 900000 910000 1 1.02558
chr3 10000000 10010000 chr17 1030000 1040000 1 0.718195
chr3 10000000 10010000 chr17 1320000 1330000 1 0.803212
chr3 10000000 10010000 chr17 1500000 1510000 1 0.925146
chr3 10000000 10010000 chr17 1750000 1760000 1 0.950326
chr3 10000000 10010000 chr17 1800000 1810000 1 0.745982
```

### Python API

The `cooler` [library](https://github.com/mirnylab/cooler) provides a thin wrapper over the excellent [h5py](http://docs.h5py.org/en/latest/) Python interface to HDF5. It supports creation of cooler files and the following types of **range queries** on the data:

- Tabular selections are retrieved as Pandas DataFrames and Series.
- Matrix selections are retrieved as SciPy sparse matrices.
- Metadata is retrieved as a json-serializable Python dictionary.
- Range queries can be supplied using either integer bin indexes or genomic coordinate intervals.

```python

>>> import cooler
>>> import matplotlib.pyplot as plt
>>> c = cooler.Cooler('bigDataset.cool')
>>> resolution = c.info['bin-size']
>>> mat = c.matrix(balance=True).fetch('chr5:10,000,000-15,000,000')
>>> plt.matshow(np.log10(mat.toarray()), cmap='YlOrRd')
```

Also see the [Jupyter notebook](https://github.com/mirnylab/cooler-binder) walkthrough.

```python
>>> import multiprocessing as mp
>>> import h5py
>>> pool = mp.Pool(8)
>>> f = h5py.File('bigDataset.cool', 'r')
>>> weights = cooler.ice.iterative_correction(f, map=pool.map, ignore_diags=3, min_nnz=10)
```


### Cooler Schema

The `cool` [format](http://cooler.readthedocs.io/en/latest/intro.html#data-model) implements a simple schema that stores a contact matrix in a sparse representation, crucial for developing robust tools for use on increasingly high resolution Hi-C data sets, including streaming and [out-of-core](https://en.wikipedia.org/wiki/Out-of-core_algorithm) algorithms.

The data tables in a `cool` file are stored in a **columnar** representation as HDF5 groups of 1D array datasets of equal length. The contact matrix itself is stored as a single table containing only the **nonzero upper triangle** pixels.


### Contributing

[Pull requests](https://akrabat.com/the-beginners-guide-to-contributing-to-a-github-project/) are welcome. The current requirements for testing are `nose` and `mock`.

For development, clone and install in "editable" (i.e. development) mode with the `-e` option. This way you can also pull changes on the fly.
```sh
$ git clone https://github.com/mirnylab/cooler.git
$ cd cooler
$ pip install -e .
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cooler-0.5.1.tar.gz (40.0 MB view details)

Uploaded Source

Built Distribution

cooler-0.5.1-py2.py3-none-any.whl (51.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file cooler-0.5.1.tar.gz.

File metadata

  • Download URL: cooler-0.5.1.tar.gz
  • Upload date:
  • Size: 40.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for cooler-0.5.1.tar.gz
Algorithm Hash digest
SHA256 7e9ef6397dd5319fe340c8b62c926a471c11f870219a9f20fc7574fdb5b4e902
MD5 ee3168865d8a0f5c65b5cd98207c3117
BLAKE2b-256 6ac2f48166e14862bf6b3d3df214ac8e9705cc9faa4ed14e24162363599f6a5f

See more details on using hashes here.

File details

Details for the file cooler-0.5.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for cooler-0.5.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 eb6a583eaeed502cc1f554ee2eb24a417806646f1bea817e7ab20f80d802fec3
MD5 7852bf739ea66192d56a420b13a38666
BLAKE2b-256 4a9b3d63d04507d69075043537e98d3ea4b61f296b6cd9ddffeb3532c941b28c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page