CLI tools to process mapped Hi-C data

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Operating System
- OS Independent
Programming Language

Project description

pairtools

Process Hi-C pairs with pairtools

pairtools is a simple and fast command-line framework to process sequencing data from a Hi-C experiment.

pairtools process pair-end sequence alignments and perform the following operations:

detect ligation junctions (a.k.a. Hi-C pairs) in aligned paired-end sequences of Hi-C DNA molecules
sort .pairs files for downstream analyses
detect, tag and remove PCR/optical duplicates
generate extensive statistics of Hi-C datasets
select Hi-C pairs given flexibly defined criteria
restore .sam alignments from Hi-C pairs

To get started:

Take a look at a quick example
Check out the detailed documentation.

Data formats

pairtools produce and operate on tab-separated files compliant with the .pairs format defined by the 4D Nucleome Consortium. All pairtools properly manage file headers and keep track of the data processing history.

Additionally, pairtools define the .pairsam format, an extension of .pairs that includes the SAM alignments of a sequenced Hi-C molecule. .pairsam complies with the .pairs format, and can be processed by any tool that operates on .pairs files.

Installation

Requirements:

Python 3.x
Python packages cython, numpy and click.
Command-line utilities sort (the Unix version), bgzip (shipped with tabix) and samtools. If available, pairtools can compress outputs with pbgzip and lz4.

We highly recommend using the conda package manager to install pairtools together with all its dependencies. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.

With conda, you can install pairtools and all of its dependencies from the bioconda channel.

$ conda install -c conda-forge -c bioconda pairtools

Alternatively, install pairtools and only Python dependencies from PyPI using pip:

$ pip install pairtools

Quick example

Setup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome:

$ mkdir /tmp/test-pairtools
$ cd /tmp/test-pairtools
$ wget https://github.com/mirnylab/distiller-test-data/raw/master/bam/MATalpha_R1.bam

Additionally, we will need a .chromsizes file, a TAB-separated plain text table describing the names, sizes and the order of chromosomes in the genome assembly used during mapping:

$ wget https://raw.githubusercontent.com/mirnylab/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes

With pairtools parse, we can convert paired-end sequence alignments stored in .sam/.bam format into .pairs, a TAB-separated table of Hi-C ligation junctions:

$ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam

Inspect the resulting table:

$ less MATalpha_R1.pairs.gz

Pipelines

We provide a simple working example of a mapping bash pipeline in /examples/.
distiller is a powerful Hi-C data analysis workflow, based on pairtools and nextflow.

Tools

parse: read .sam files produced by bwa and form Hi-C pairs
- form Hi-C pairs by reporting the outer-most mapped positions and the strand on the either side of each molecule;
- report unmapped/multimapped (ambiguous alignments)/chimeric alignments as chromosome "!", position 0, strand "-";
- identify and rescue chrimeric alignments produced by singly-ligated Hi-C molecules with a sequenced ligation junction on one of the sides;
- perform upper-triangular flipping of the sides of Hi-C molecules such that the first side has a lower sorting index than the second side;
- form hybrid pairsam output, where each line contains all available data for one Hi-C molecule (outer-most mapped positions on the either side, read ID, pair type, and .sam entries for each alignment);
- print the .sam header as #-comment lines at the start of the file.
sort: sort pairs files (the lexicographic order for chromosomes, the numeric order for the positions, the lexicographic order for pair types).
merge: merge sorted .pairs files
- merge sort .pairs;
- combine the .pairs headers from all input files;
- check that each .pairs file was mapped to the same reference genome index (by checking the identity of the @SQ sam header lines).
select: select pairs according to specified criteria
- select pairs entries according to the provided condition. A programmable interface allows for arbitrarily complex queries on specific pair types, chromosomes, positions, strands, read IDs (including matches to a wildcard/regexp/list).
- optionally print the non-matching entries into a separate file.
dedup: remove PCR duplicates from a sorted triu-flipped .pairs file
- remove PCR duplicates by finding pairs of entries with both sides mapped to similar genomic locations (+/- N bp);
- optionally output the PCR duplicate entries into a separate file.
- NOTE: in order to remove all PCR duplicates, the input must contain *all* mapped read pairs from a single experimental replicate;
maskasdup: mark all pairs in a pairsam as Hi-C duplicates
- change the field pair_type to DD;
- change the pair_type tag (Yt:Z:) for all sam alignments;
- set the PCR duplicate binary flag for all sam alignments (0x400).
split: split a .pairsam file into .pairs and .sam.
stats: calculate various statistics of .pairs files
restrict: identify the span of the restriction fragment forming a Hi-C junction

Contributing

Pull requests are welcome.

For development, clone and install in "editable" (i.e. development) mode with the -e option. This way you can also pull changes on the fly.

$ git clone https://github.com/mirnylab/pairtools.git
$ cd pairtools
$ pip install -e .

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

1.1.0

Apr 23, 2024

1.0.3

Nov 20, 2023

1.0.2

Nov 8, 2022

1.0.1

Sep 7, 2022

1.0.0

Aug 24, 2022

This version

0.3.0

Jul 13, 2019

0.2.2

Jan 7, 2019

0.2.1

Dec 21, 2018

0.2.0

Aug 3, 2018

0.1.1

Jul 19, 2018

0.1.0

Jul 19, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pairtools-0.3.0.tar.gz (572.0 kB view details)

Uploaded Jul 13, 2019 Source

File details

Details for the file pairtools-0.3.0.tar.gz.

File metadata

Download URL: pairtools-0.3.0.tar.gz
Upload date: Jul 13, 2019
Size: 572.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for pairtools-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`4931da4b5fcb327d13c9564e6fdda1acf2335e16213a55d2620fe340ae918b53`
MD5	`acff3d5d2302bfd0d616867e3bee71d1`
BLAKE2b-256	`35e58efba77e40d770c936884bc5def89c8d211c27932246de5bb7ea005ac33f`