Skip to main content

Sparse binary format for genomic interaction matrices

Project description

Build Status Documentation Status Binder Join the chat at https://gitter.im/mirnylab/cooler

A cool place to store your Hi-C

Cooler is a support library for a sparse, compressed, binary persistent storage format for Hi-C contact matrices, called cool, which is based on HDF5.

Cooler aims to provide the following functionality:

  • Generate contact matrices from contact lists at arbitrary resolutions.

  • Store contact matrices efficiently in cool format based on the widely used HDF5 container format.

  • Perform out-of-core genome wide contact matrix normalization (a.k.a. balancing)

  • Perform fast range queries on a contact matrix.

  • Convert contact matrices between formats.

  • Provide a clean and well-documented Python API to work with Hi-C data.

To get started:

  • Documentation is available here.

  • Walkthrough with a Jupyter notebook.

  • cool files from published Hi-C data sets are available at ftp://cooler.csail.mit.edu/coolers.

Installation

Requirements:

  • Python 2.7/3.4+

  • libhdf5 and Python packages numpy, scipy, pandas, h5py. We highly recommend using the conda package manager to install scientific packages like these. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.

Install from PyPI using pip.

$ pip install cooler

See the docs for more information.

Command line interface

The cooler library includes utilities for creating and querying cool files and for performing contact matrix balancing on a cool file of any resolution.

$ cooler makebins $CHROMSIZES_FILE $BINSIZE > bins.10kb.bed
$ cooler cload bins.10kb.bed $CONTACTS_FILE out.cool
$ cooler balance -p 10 out.cool
$ cooler dump -b -t pixels --header --join -r chr3:10,000,000-12,000,000 -r2 chr17 out.cool | head
chrom1  start1  end1    chrom2  start2  end2    count   balanced
chr3    10000000        10010000        chr17   0       10000   1       0.810766
chr3    10000000        10010000        chr17   520000  530000  1       1.2055
chr3    10000000        10010000        chr17   640000  650000  1       0.587372
chr3    10000000        10010000        chr17   900000  910000  1       1.02558
chr3    10000000        10010000        chr17   1030000 1040000 1       0.718195
chr3    10000000        10010000        chr17   1320000 1330000 1       0.803212
chr3    10000000        10010000        chr17   1500000 1510000 1       0.925146
chr3    10000000        10010000        chr17   1750000 1760000 1       0.950326
chr3    10000000        10010000        chr17   1800000 1810000 1       0.745982

See also:

Python API

The cooler library provides a thin wrapper over the excellent h5py Python interface to HDF5. It supports creation of cooler files and the following types of range queries on the data:

  • Tabular selections are retrieved as Pandas DataFrames and Series.

  • Matrix selections are retrieved as NumPy arrays or SciPy sparse matrices.

  • Metadata is retrieved as a json-serializable Python dictionary.

  • Range queries can be supplied using either integer bin indexes or genomic coordinate intervals.

>>> import cooler
>>> import matplotlib.pyplot as plt
>>> c = cooler.Cooler('bigDataset.cool')
>>> resolution = c.info['bin-size']
>>> mat = c.matrix(balance=True).fetch('chr5:10,000,000-15,000,000')
>>> plt.matshow(np.log10(mat), cmap='YlOrRd')
>>> import multiprocessing as mp
>>> import h5py
>>> pool = mp.Pool(8)
>>> f = h5py.File('bigDataset.cool', 'r')
>>> weights, stats = cooler.ice.iterative_correction(f, map=pool.map, ignore_diags=3, min_nnz=10)

See also:

Schema

The cool format implements a simple schema that stores a contact matrix in a sparse representation, crucial for developing robust tools for use on increasingly high resolution Hi-C data sets, including streaming and out-of-core algorithms.

The data tables in a cool file are stored in a columnar representation as HDF5 groups of 1D array datasets of equal length. The contact matrix itself is stored as a single table containing only the nonzero upper triangle pixels.

Contributing

Pull requests are welcome. The current requirements for testing are nose and mock.

For development, clone and install in “editable” (i.e. development) mode with the -e option. This way you can also pull changes on the fly.

$ git clone https://github.com/mirnylab/cooler.git
$ cd cooler
$ pip install -e .

License

BSD (New)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cooler-0.6.5.tar.gz (50.6 MB view details)

Uploaded Source

Built Distribution

cooler-0.6.5-py2.py3-none-any.whl (56.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file cooler-0.6.5.tar.gz.

File metadata

  • Download URL: cooler-0.6.5.tar.gz
  • Upload date:
  • Size: 50.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for cooler-0.6.5.tar.gz
Algorithm Hash digest
SHA256 6333c846131567cb50af3cfbf1eb9765506ea12a25781d8a51d12cc1f7473a3a
MD5 a60b6bb10b79e6bb6fb8159980663025
BLAKE2b-256 62948ddcf22f1ebbe488e66bc420893a200d0f5668262430c79b63eacc6102e0

See more details on using hashes here.

File details

Details for the file cooler-0.6.5-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for cooler-0.6.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 85d732e0494c05dae5e9329c214873cd2574bfe1e76957a2acf4645e43e1415c
MD5 7f180e100455e7bad8de7673a36f3144
BLAKE2b-256 812400d12544f11ed7f5044ac17d9f230f194e1cb1e9841766431da2c23600a2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page