cooler

Sparse binary format for Hi-C genomic contact heatmaps

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Operating System
- OS Independent
Programming Language

Project description

# Cooler

[![Build Status](https://travis-ci.org/mirnylab/cooler.svg?branch=master)](https://travis-ci.org/mirnylab/cooler)
[![Documentation Status](https://readthedocs.org/projects/cooler/badge/?version=latest)](http://cooler.readthedocs.org/en/latest/)

## A cool place to store your Hi-C

Cooler is a **sparse, compressed, binary** persistent storage format for Hi-C contact maps based on HDF5.

See [example Jupyter notebook](https://gist.github.com/nvictus/904160bca9d0e8d5aeeb).

The `cooler` library implements a simple **schema** to store a high resolution contact matrix along with important auxiliary data such as scaffold information, genomic bin annotations, and basic metadata.

Data tables are stored in a **columnar** representation as groups of 1D HDF5 array datasets of the same length. The contact matrix itself is stored as a table containing only the **nonzero upper triangle** pixels.

The library API provides a thin Python wrapper over [h5py](http://docs.h5py.org/en/latest/) for **range queries** on the data:
- Table selections are retrieved as Pandas `DataFrame`s
- Matrix slice selections are retrieved as SciPy sparse matrices or NumPy `ndarray`s
- The metadata is retrieved as a json-serializable dictionary.

Range queries can be supplied as either integer bin indexes or genomic coordinate intervals.

### Installation

Requirements:

- Python 2.7/3.3+
- libhdf5 and Python packages `numpy`, `scipy`, `pandas`, `h5py`. If you don't have them installed already, we recommend you use the [conda](http://conda.pydata.org/miniconda.html) package manager to manage these dependencies instead of pip.

Install from PyPI using pip.
```sh
$ pip install cooler
```

For development, clone and install in "editable" (i.e. development) mode with the `-e` option.
```sh
$ git clone https://github.com/mirnylab/cooler.git
$ cd cooler
$ pip install -e .
```

### Schema

Required attributes (metadata):
```
genome-assembly : <string> Name of genome assembly
bin-type : {"fixed" or "variable"}
bin-size : <int or null> Size of bins in bp if bin-type is "fixed"
nchroms : <int> Number of rows in scaffolds table
nbins : <int> Number of rows in bins table
nnz : <int> Number of rows in matrix table
format-url : <url> URL to page providing format details
format-version : <string> The version of the current format
generated-by : <string> Agent that created the file
creation-date : <datetime> Date the file was built
metadata : <json> custom metadata about the experiment
```

The required tables and indexes can be represented in the [Datashape](http://datashape.readthedocs.org/en/latest/) layout language:
```
{
scaffolds: {
name: typevar['Nchroms'] * string[32, 'A']
length: typevar['Nchroms'] * int64,
},
bins: {
chrom_id: typevar['Nbins'] * int32,
start: typevar['Nbins'] * int64,
end: typevar['Nbins'] * int64
},
matrix: {
bin1_id: typevar['Nnz'] * int32,
bin2_id: typevar['Nnz'] * int32,
count: typevar['Nnz'] * int32
},
indexes: {
chrom_offset: typevar['Nchroms'] * int32,
bin1_offset: typevar['Nbins'] * int32
}
}
```

Notes:
- Any number of additional optional columns can be added to each table. (e.g. quality masks, normalization vectors).
- Genomic coordinates are assumed to be 0-based and intervals half-open (1-based ends).

Matrix storage format:
- The `bins` table is lexicographically sorted by `chrom_id`, `start`, `end`.
- The `matrix` table is lexicographically sorted by `bin1_id`, then `bin2_id`.
- Offset pointers are used to facilitate matrix queries. This is effectively a [compressed sparse row](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29) storage scheme for a symmetric matrix.

Rather than build on top of a more full-featured, opinionated library like PyTables (or `pandas.HDFStore` built on top of that), we provide a simple and transparent data layout on top of HDF5 that supports random access range queries and can be easily [migrated](https://github.com/blaze/odo).

See also:
- [hdf2tab](https://github.com/blajoie/hdf2tab) converts dense Hi-C matrices stored in HDF5 files to tabular text files.
- The [biom](https://github.com/biocore/biom-format) format is an HDF5-based format for metagenomic observation matrices.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.10.2

Jun 17, 2024

0.10.1

Jun 17, 2024

0.10.0

May 21, 2024

0.9.3

Sep 11, 2023

0.9.2

Jun 1, 2023

0.9.1

Jan 23, 2023

0.9.0

Jan 19, 2023

0.8.11

Apr 1, 2021

0.8.10

Sep 25, 2020

0.8.9

Jul 18, 2020

0.8.8

Jun 24, 2020

0.8.7

Jan 13, 2020

0.8.6.post0

Aug 13, 2019

0.8.6

Aug 13, 2019

0.8.5

Apr 8, 2019

0.8.4

Apr 5, 2019

0.8.3

Feb 11, 2019

0.8.2

Jan 20, 2019

0.8.1

Jan 3, 2019

0.8.0

Dec 31, 2018

0.7.11

Aug 17, 2018

0.7.10

May 7, 2018

0.7.9

Mar 30, 2018

0.7.8

Mar 18, 2018

0.7.7

Mar 16, 2018

0.7.6

Oct 31, 2017

0.7.5

Jul 13, 2017

0.7.4

May 25, 2017

0.7.3

May 23, 2017

0.7.2

May 9, 2017

0.7.1

Apr 29, 2017

0.7.0

Apr 27, 2017

0.6.6

Mar 22, 2017

0.6.5

Mar 18, 2017

0.6.4

Mar 17, 2017

0.6.3

Feb 22, 2017

0.6.2

Feb 12, 2017

0.6.1

Feb 6, 2017

0.6.0

Feb 4, 2017

0.5.3

Sep 11, 2016

0.5.2

Aug 26, 2016

0.5.1

Aug 24, 2016

0.5.0

Aug 24, 2016

0.4.1

Aug 24, 2016

0.4.0

Aug 19, 2016

This version

0.3.0

Feb 18, 2016

0.2.1

Feb 7, 2016

0.2

Jan 18, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cooler-0.3.0.tar.gz (3.4 MB view details)

Uploaded Feb 18, 2016 Source

Built Distribution

cooler-0.3.0-py2.py3-none-any.whl (17.9 kB view details)

Uploaded Feb 18, 2016 Python 2 Python 3

File details

Details for the file cooler-0.3.0.tar.gz.

File metadata

Download URL: cooler-0.3.0.tar.gz
Upload date: Feb 18, 2016
Size: 3.4 MB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for cooler-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`957d2a0201d0f66d7d0f2e1a7d44210212060dcd438cab86b6960a2aa9210374`
MD5	`ccf1bb4e8bf11a14dfa768ee9c913812`
BLAKE2b-256	`3dafd6c0fce01968dffb5fac99cdce5acf54f1b8590de0c5decc4e160a132a31`

See more details on using hashes here.

File details

Details for the file cooler-0.3.0-py2.py3-none-any.whl.

File metadata

Download URL: cooler-0.3.0-py2.py3-none-any.whl
Upload date: Feb 18, 2016
Size: 17.9 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for cooler-0.3.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`0a87850b358db1cc47913b88f9d4b05f3a4382e673f78efee4c6b486709d7976`
MD5	`4293e48d92deb6d0533d4ec86cc79811`
BLAKE2b-256	`9e340bf9ae99d7fe3e72d2087f16fe8b5a9f0f36f2f47ac899f9c98309d49232`