Sparse binary format for genomic interaction matrices
Project description
Cooler
A cool place to store your Hi-C
Cooler is a support library for a sparse, compressed, binary persistent storage format, called cooler, used to store genomic interaction data, such as Hi-C contact matrices.
The cooler file format is a reference implementation of a genomic matrix data model using HDF5 as the container format.
The cooler
package aims to provide the following functionality:
- Build contact matrices at any resolution from a list of contacts.
- Query a contact matrix.
- Export and visualize the data.
- Perform efficient out-of-core operations, such as aggregation and contact matrix normalization (a.k.a. balancing).
- Facilitate working with potentially larger-than-memory data.
To get started:
- Read the documentation.
- See the Jupyter Notebook walkthrough.
- cool files from published Hi-C data sets are available at
ftp://cooler.csail.mit.edu/coolers
. - Many more multires (mcool) files are available on the 4DN data portal.
Related projects:
- Process Hi-C data with distiller.
- Downstream analysis with cooltools (WIP).
- Visualize your Cooler data with HiGlass!
Installation
Requirements:
- Python 2.7/3.4+
- libhdf5 and Python packages
numpy
,scipy
,pandas
,h5py
. We highly recommend using theconda
package manager to install scientific packages like these. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.
Install from PyPI using pip.
$ pip install cooler
If you are using conda
, you can alternatively install cooler
from the bioconda channel.
$ conda install -c conda-forge -c bioconda cooler
See the docs for more information.
Command line interface
The cooler
package includes command line tools for creating, querying and manipulating cooler files.
$ cooler cload pairs hg19.chrom.sizes:10000 $PAIRS_FILE out.10000.cool
$ cooler balance -p 10 out.10000.cool
$ cooler dump -b -t pixels --header --join -r chr3:10M-12M -r2 chr17 out.10000.cool | head
chrom1 start1 end1 chrom2 start2 end2 count balanced
chr3 10000000 10010000 chr17 0 10000 1 0.810766
chr3 10000000 10010000 chr17 520000 530000 1 1.2055
chr3 10000000 10010000 chr17 640000 650000 1 0.587372
chr3 10000000 10010000 chr17 900000 910000 1 1.02558
chr3 10000000 10010000 chr17 1030000 1040000 1 0.718195
chr3 10000000 10010000 chr17 1320000 1330000 1 0.803212
chr3 10000000 10010000 chr17 1500000 1510000 1 0.925146
chr3 10000000 10010000 chr17 1750000 1760000 1 0.950326
chr3 10000000 10010000 chr17 1800000 1810000 1 0.745982
See also:
- CLI Reference.
- Jupyter Notebook walkthrough.
Python API
The cooler
library provides a thin wrapper over the excellent h5py Python interface to HDF5. It supports creation of cooler files and the following types of range queries on the data:
- Tabular selections are retrieved as Pandas DataFrames and Series.
- Matrix selections are retrieved as NumPy arrays or SciPy sparse matrices.
- Metadata is retrieved as a json-serializable Python dictionary.
- Range queries can be supplied using either integer bin indexes or genomic coordinate intervals. Note that queries with coordinate intervals that are not multiples of the bin size will return the range of shortest range bins that fully contains the open interval [start, end).
>>> import cooler
>>> import matplotlib.pyplot as plt
>>> c = cooler.Cooler('bigDataset.cool')
>>> resolution = c.binsize
>>> mat = c.matrix(balance=True).fetch('chr5:10,000,000-15,000,000')
>>> plt.matshow(np.log10(mat), cmap='YlOrRd')
>>> import multiprocessing as mp
>>> import h5py
>>> pool = mp.Pool(8)
>>> c = cooler.Cooler('bigDataset.cool')
>>> weights, stats = cooler.balance_cooler(c, map=pool.map, ignore_diags=3, min_nnz=10)
See also:
- API Reference.
- Jupyter Notebook walkthrough.
Schema
The cool format implements a simple data model that stores a genomic matrix in a sparse representation, crucial for developing robust tools for use on increasingly high resolution Hi-C data sets, including streaming and out-of-core algorithms.
The data tables in a cooler file are stored in a columnar representation as HDF5 groups of 1D array datasets of equal length. A symmetric contact matrix is represented as a single table containing only the nonzero upper triangle pixels.
Contributing
Pull requests are welcome. The current requirements for testing are pytest
and mock
.
For development, clone and install in "editable" (i.e. development) mode with the -e
option. This way you can also pull changes on the fly.
$ git clone https://github.com/mirnylab/cooler.git
$ cd cooler
$ pip install -e .
License
BSD (New)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cooler-0.8.2.tar.gz
.
File metadata
- Download URL: cooler-0.8.2.tar.gz
- Upload date:
- Size: 8.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53d2a98c581230be88e32da4938218104a39b6a6f5f7a04956ff9d7a21180332 |
|
MD5 | 6fa96ad383f84940a68275019f607130 |
|
BLAKE2b-256 | f0d78554234992b6ac1d01cf2ad839f2de8b6f9628d0607d11db2996f1fb57ae |
File details
Details for the file cooler-0.8.2-py2.py3-none-any.whl
.
File metadata
- Download URL: cooler-0.8.2-py2.py3-none-any.whl
- Upload date:
- Size: 97.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f18a4d4acab7c541b2ed5acb8dc8de0b20f3d338380dc715dd09308ff4c820d |
|
MD5 | 903912dfc156cb952873d6a4324889b9 |
|
BLAKE2b-256 | 101398aaa0fad762aff55254589e2dd70c399fe8b5a1ca20660114eba945718e |