Skip to main content

Sparse binary format for genomic interaction matrices

Project description

Cooler

Build Status Documentation Status install with bioconda Binder Join the chat at https://gitter.im/mirnylab/cooler DOI

A cool place to store your Hi-C

Cooler is a support library for a sparse, compressed, binary persistent storage format, called cool, used to store genomic interaction data, such as Hi-C contact matrices.

The cooler file format is a reference implementation of a genomic matrix data model using HDF5 as the container format.

The cooler package aims to provide the following functionality:

  • Build contact matrices at any resolution from a list of contacts.
  • Query a contact matrix.
  • Export and visualize the data.
  • Perform efficient out-of-core operations, such as aggregation and contact matrix normalization (a.k.a. balancing).
  • Provide a clean and well-documented Python API to facilitate working with potentially larger-than-memory data.

To get started:

  • Read the documentation.
  • See the Jupyter Notebook walkthrough.
  • cool files from published Hi-C data sets are available at ftp://cooler.csail.mit.edu/coolers.
  • Many more multires (mcool) files are available on the 4DN data portal.

Related projects:

Installation

Requirements:

  • Python 2.7/3.4+
  • libhdf5 and Python packages numpy, scipy, pandas, h5py. We highly recommend using the conda package manager to install scientific packages like these. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.

Install from PyPI using pip.

$ pip install cooler

If you are using conda, you can alternatively install cooler from the bioconda channel.

$ conda install -c conda-forge -c bioconda cooler

See the docs for more information.

Command line interface

The cooler package includes command line tools for creating, querying and manipulating cool files.

$ cooler cload hg19.chrom.sizes:10000 $CONTACTS_FILE out.10000.cool
$ cooler balance -p 10 out.10000.cool
$ cooler dump -b -t pixels --header --join -r chr3:10M-12M -r2 chr17 out.10000.cool | head
chrom1  start1  end1    chrom2  start2  end2    count   balanced
chr3    10000000        10010000        chr17   0       10000   1       0.810766
chr3    10000000        10010000        chr17   520000  530000  1       1.2055
chr3    10000000        10010000        chr17   640000  650000  1       0.587372
chr3    10000000        10010000        chr17   900000  910000  1       1.02558
chr3    10000000        10010000        chr17   1030000 1040000 1       0.718195
chr3    10000000        10010000        chr17   1320000 1330000 1       0.803212
chr3    10000000        10010000        chr17   1500000 1510000 1       0.925146
chr3    10000000        10010000        chr17   1750000 1760000 1       0.950326
chr3    10000000        10010000        chr17   1800000 1810000 1       0.745982

See also:

Python API

The cooler library provides a thin wrapper over the excellent h5py Python interface to HDF5. It supports creation of cooler files and the following types of range queries on the data:

  • Tabular selections are retrieved as Pandas DataFrames and Series.
  • Matrix selections are retrieved as NumPy arrays or SciPy sparse matrices.
  • Metadata is retrieved as a json-serializable Python dictionary.
  • Range queries can be supplied using either integer bin indexes or genomic coordinate intervals. Note that queries with coordinate intervals that are not multiples of the bin size will return the range of shortest range bins that fully contains the open interval [start, end).
>>> import cooler
>>> import matplotlib.pyplot as plt
>>> c = cooler.Cooler('bigDataset.cool')
>>> resolution = c.binsize
>>> mat = c.matrix(balance=True).fetch('chr5:10,000,000-15,000,000')
>>> plt.matshow(np.log10(mat), cmap='YlOrRd')
>>> import multiprocessing as mp
>>> import h5py
>>> pool = mp.Pool(8)
>>> c = cooler.Cooler('bigDataset.cool')
>>> weights, stats = cooler.balance_cooler(c, map=pool.map, ignore_diags=3, min_nnz=10)

See also:

Schema

The cool format implements a simple data model that stores a genomic matrix in a sparse representation, crucial for developing robust tools for use on increasingly high resolution Hi-C data sets, including streaming and out-of-core algorithms.

The data tables in a cooler file are stored in a columnar representation as HDF5 groups of 1D array datasets of equal length. A symmetric contact matrix is represented as a single table containing only the nonzero upper triangle pixels.

Contributing

Pull requests are welcome. The current requirements for testing are pytest and mock.

For development, clone and install in "editable" (i.e. development) mode with the -e option. This way you can also pull changes on the fly.

$ git clone https://github.com/mirnylab/cooler.git
$ cd cooler
$ pip install -e .

License

BSD (New)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cooler-0.8.0.tar.gz (8.3 MB view details)

Uploaded Source

Built Distribution

cooler-0.8.0-py2.py3-none-any.whl (94.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file cooler-0.8.0.tar.gz.

File metadata

  • Download URL: cooler-0.8.0.tar.gz
  • Upload date:
  • Size: 8.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for cooler-0.8.0.tar.gz
Algorithm Hash digest
SHA256 5817c5ae5ac1e50bb1ce8585678127868b956a20236f60009775809b57c64c4b
MD5 1bb60c44402aa35e4c191f6ce6fdcce5
BLAKE2b-256 4fa0baf72e41327c7be87fa9d35196546b44269d3ca480eb5cb1cd2f2275a66a

See more details on using hashes here.

File details

Details for the file cooler-0.8.0-py2.py3-none-any.whl.

File metadata

  • Download URL: cooler-0.8.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 94.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for cooler-0.8.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e6a5a1af55345b8e915d953d38c0c8438925459c6becc1918669937fece11c3f
MD5 d0c332642b13ce42dc68e2da0036b37c
BLAKE2b-256 efe0ec0e9fae65e9a1819214937d139cd9585230f988201102745b2e9b9b0267

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page