Sparse binary format for genomic interaction matrices
Project description
# Cooler
[![Build Status](https://travis-ci.org/mirnylab/cooler.svg?branch=master)](https://travis-ci.org/mirnylab/cooler)
[![Documentation Status](https://readthedocs.org/projects/cooler/badge/?version=latest)](http://cooler.readthedocs.org/en/latest/)
[![Binder](http://mybinder.org/badge.svg)](http://mybinder.org:/repo/mirnylab/cooler-binder)
## A cool place to store your Hi-C
Cooler is a support library for a **sparse, compressed, binary** persistent storage format for Hi-C contact matrices, called `cool`, which is based on [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format).
Cooler aims to provide the following functionality:
- Generate contact matrices from contact lists at arbitrary resolutions.
- Store contact matrices efficiently in `cool` format based on the widely used HDF5 container format.
- Perform out-of-core genome wide contact matrix normalization (a.k.a. balancing)
- Perform fast range queries on a contact matrix.
- Convert contact matrices between formats.
- Provide a clean and well-documented Python API to work with Hi-C data.
To get started:
- Documentation is available [here](http://cooler.readthedocs.org/en/latest/).
- Walkthrough with a [Jupyter notebook](https://github.com/mirnylab/cooler-binder).
- Some published data sets are available at `ftp://cooler.csail.mit.edu/coolers`.
### Installation
Requirements:
- Python 2.7/3.3+
- libhdf5 and Python packages `numpy`, `scipy`, `pandas`, `h5py`. These packages have heavy binary dependencies, so if you don't have them installed already, we recommend you use the [conda](http://conda.pydata.org/miniconda.html) package manager to manage them instead of pip. All other Python package dependencies are easily handled by pip.
- See the [docs](http://cooler.readthedocs.org/en/latest/) for more information.
Install from PyPI using pip.
```sh
$ pip install cooler
```
### Command line interface
The `cooler` library includes utilities for performing out-of-core contact **matrix balancing** on a cooler file of any resolution. See the [docs](http://cooler.readthedocs.org/en/latest/) for more information.
```bash
$ cooler binnify $CHROMSIZES_FILE $BINSIZE > bins.10kb.bed
$ cooler cload bins.10kb.bed $CONTACTS_FILE out.cool
$ cooler balance -p 10 out.cool
$ cooler dump -b -t pixels --header --join -r chr3:10,000,000-12,000,000 -r2 chr17 out.cool | head
```
```
chrom1 start1 end1 chrom2 start2 end2 count balanced
chr3 10000000 10010000 chr17 0 10000 1 0.810766
chr3 10000000 10010000 chr17 520000 530000 1 1.2055
chr3 10000000 10010000 chr17 640000 650000 1 0.587372
chr3 10000000 10010000 chr17 900000 910000 1 1.02558
chr3 10000000 10010000 chr17 1030000 1040000 1 0.718195
chr3 10000000 10010000 chr17 1320000 1330000 1 0.803212
chr3 10000000 10010000 chr17 1500000 1510000 1 0.925146
chr3 10000000 10010000 chr17 1750000 1760000 1 0.950326
chr3 10000000 10010000 chr17 1800000 1810000 1 0.745982
```
### Python API
The `cooler` [library](https://github.com/mirnylab/cooler) provides a thin wrapper over the excellent [h5py](http://docs.h5py.org/en/latest/) Python interface to HDF5. It supports creation of cooler files and the following types of **range queries** on the data:
- Tabular selections are retrieved as Pandas DataFrames and Series.
- Matrix selections are retrieved as SciPy sparse matrices.
- Metadata is retrieved as a json-serializable Python dictionary.
- Range queries can be supplied using either integer bin indexes or genomic coordinate intervals.
```python
>>> import cooler
>>> import matplotlib.pyplot as plt
>>> c = cooler.Cooler('bigDataset.cool')
>>> resolution = c.info['bin-size']
>>> mat = c.matrix(balance=True).fetch('chr5:10,000,000-15,000,000')
>>> plt.matshow(np.log10(mat.toarray()), cmap='YlOrRd')
```
Also see the [Jupyter notebook](https://github.com/mirnylab/cooler-binder) walkthrough.
```python
>>> import multiprocessing as mp
>>> import h5py
>>> pool = mp.Pool(8)
>>> f = h5py.File('bigDataset.cool', 'r')
>>> weights = cooler.ice.iterative_correction(f, map=pool.map, ignore_diags=3, min_nnz=10)
```
### Cooler Schema
The `cool` [format](http://cooler.readthedocs.io/en/latest/intro.html#data-model) implements a simple schema that stores a contact matrix in a sparse representation, crucial for developing robust tools for use on increasingly high resolution Hi-C data sets, including streaming and [out-of-core](https://en.wikipedia.org/wiki/Out-of-core_algorithm) algorithms.
The data tables in a `cool` file are stored in a **columnar** representation as HDF5 groups of 1D array datasets of equal length. The contact matrix itself is stored as a single table containing only the **nonzero upper triangle** pixels.
### Contributing
[Pull requests](https://akrabat.com/the-beginners-guide-to-contributing-to-a-github-project/) are welcome. The current requirements for testing are `nose` and `mock`.
For development, clone and install in "editable" (i.e. development) mode with the `-e` option. This way you can also pull changes on the fly.
```sh
$ git clone https://github.com/mirnylab/cooler.git
$ cd cooler
$ pip install -e .
```
[![Build Status](https://travis-ci.org/mirnylab/cooler.svg?branch=master)](https://travis-ci.org/mirnylab/cooler)
[![Documentation Status](https://readthedocs.org/projects/cooler/badge/?version=latest)](http://cooler.readthedocs.org/en/latest/)
[![Binder](http://mybinder.org/badge.svg)](http://mybinder.org:/repo/mirnylab/cooler-binder)
## A cool place to store your Hi-C
Cooler is a support library for a **sparse, compressed, binary** persistent storage format for Hi-C contact matrices, called `cool`, which is based on [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format).
Cooler aims to provide the following functionality:
- Generate contact matrices from contact lists at arbitrary resolutions.
- Store contact matrices efficiently in `cool` format based on the widely used HDF5 container format.
- Perform out-of-core genome wide contact matrix normalization (a.k.a. balancing)
- Perform fast range queries on a contact matrix.
- Convert contact matrices between formats.
- Provide a clean and well-documented Python API to work with Hi-C data.
To get started:
- Documentation is available [here](http://cooler.readthedocs.org/en/latest/).
- Walkthrough with a [Jupyter notebook](https://github.com/mirnylab/cooler-binder).
- Some published data sets are available at `ftp://cooler.csail.mit.edu/coolers`.
### Installation
Requirements:
- Python 2.7/3.3+
- libhdf5 and Python packages `numpy`, `scipy`, `pandas`, `h5py`. These packages have heavy binary dependencies, so if you don't have them installed already, we recommend you use the [conda](http://conda.pydata.org/miniconda.html) package manager to manage them instead of pip. All other Python package dependencies are easily handled by pip.
- See the [docs](http://cooler.readthedocs.org/en/latest/) for more information.
Install from PyPI using pip.
```sh
$ pip install cooler
```
### Command line interface
The `cooler` library includes utilities for performing out-of-core contact **matrix balancing** on a cooler file of any resolution. See the [docs](http://cooler.readthedocs.org/en/latest/) for more information.
```bash
$ cooler binnify $CHROMSIZES_FILE $BINSIZE > bins.10kb.bed
$ cooler cload bins.10kb.bed $CONTACTS_FILE out.cool
$ cooler balance -p 10 out.cool
$ cooler dump -b -t pixels --header --join -r chr3:10,000,000-12,000,000 -r2 chr17 out.cool | head
```
```
chrom1 start1 end1 chrom2 start2 end2 count balanced
chr3 10000000 10010000 chr17 0 10000 1 0.810766
chr3 10000000 10010000 chr17 520000 530000 1 1.2055
chr3 10000000 10010000 chr17 640000 650000 1 0.587372
chr3 10000000 10010000 chr17 900000 910000 1 1.02558
chr3 10000000 10010000 chr17 1030000 1040000 1 0.718195
chr3 10000000 10010000 chr17 1320000 1330000 1 0.803212
chr3 10000000 10010000 chr17 1500000 1510000 1 0.925146
chr3 10000000 10010000 chr17 1750000 1760000 1 0.950326
chr3 10000000 10010000 chr17 1800000 1810000 1 0.745982
```
### Python API
The `cooler` [library](https://github.com/mirnylab/cooler) provides a thin wrapper over the excellent [h5py](http://docs.h5py.org/en/latest/) Python interface to HDF5. It supports creation of cooler files and the following types of **range queries** on the data:
- Tabular selections are retrieved as Pandas DataFrames and Series.
- Matrix selections are retrieved as SciPy sparse matrices.
- Metadata is retrieved as a json-serializable Python dictionary.
- Range queries can be supplied using either integer bin indexes or genomic coordinate intervals.
```python
>>> import cooler
>>> import matplotlib.pyplot as plt
>>> c = cooler.Cooler('bigDataset.cool')
>>> resolution = c.info['bin-size']
>>> mat = c.matrix(balance=True).fetch('chr5:10,000,000-15,000,000')
>>> plt.matshow(np.log10(mat.toarray()), cmap='YlOrRd')
```
Also see the [Jupyter notebook](https://github.com/mirnylab/cooler-binder) walkthrough.
```python
>>> import multiprocessing as mp
>>> import h5py
>>> pool = mp.Pool(8)
>>> f = h5py.File('bigDataset.cool', 'r')
>>> weights = cooler.ice.iterative_correction(f, map=pool.map, ignore_diags=3, min_nnz=10)
```
### Cooler Schema
The `cool` [format](http://cooler.readthedocs.io/en/latest/intro.html#data-model) implements a simple schema that stores a contact matrix in a sparse representation, crucial for developing robust tools for use on increasingly high resolution Hi-C data sets, including streaming and [out-of-core](https://en.wikipedia.org/wiki/Out-of-core_algorithm) algorithms.
The data tables in a `cool` file are stored in a **columnar** representation as HDF5 groups of 1D array datasets of equal length. The contact matrix itself is stored as a single table containing only the **nonzero upper triangle** pixels.
### Contributing
[Pull requests](https://akrabat.com/the-beginners-guide-to-contributing-to-a-github-project/) are welcome. The current requirements for testing are `nose` and `mock`.
For development, clone and install in "editable" (i.e. development) mode with the `-e` option. This way you can also pull changes on the fly.
```sh
$ git clone https://github.com/mirnylab/cooler.git
$ cd cooler
$ pip install -e .
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cooler-0.4.1.tar.gz
(40.0 MB
view details)
Built Distribution
File details
Details for the file cooler-0.4.1.tar.gz
.
File metadata
- Download URL: cooler-0.4.1.tar.gz
- Upload date:
- Size: 40.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf7a49c9cbea7d3519422bfef7d5e8c91a34158bf6f615f8af73ed743f2dc2f3 |
|
MD5 | 20ac9fc09e5b4c23172a4ba2eeedf797 |
|
BLAKE2b-256 | 902368e0c1a6c79626154f56dc20979db39f4a2334c584fe53521940ab8305d6 |
File details
Details for the file cooler-0.4.1-py2.py3-none-any.whl
.
File metadata
- Download URL: cooler-0.4.1-py2.py3-none-any.whl
- Upload date:
- Size: 50.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82e84c69460be3c17953efe5363e1af353b272eb2fa41533c7046090112d52e0 |
|
MD5 | a1896c098ddb247602b0232d50fbf3f7 |
|
BLAKE2b-256 | e00786f60e9ef50ef4f0b0f8f17d93906beb453b15431affddfb9444b0f7addb |