vcfnp

Load numpy arrays from a VCF (variant call file).

These details have not been verified by PyPI

Project links

Homepage

Project description

Load data from a VCF (variant call file) into numpy arrays or an HDF5 file.

Installation

Installation requires numpy and cython:

$ pip install vcfnp

…or:

$ git clone --recursive git://github.com/alimanfoo/vcfnp.git
$ cd vcfnp
$ python setup.py build_ext --inplace

Usage

From Python:

import sys
import vcfnp
import numpy as np
import matplotlib.pyplot as plt

filename = '/path/to/my.vcf'

# load data from fixed fields (including INFO)
V = vcfnp.variants(filename, cache=True).view(np.recarray)

# print some simple variant metrics
print 'found %s variants (%s SNPs)' % (v.size, np.count_nonzero(v.is_snp))
print 'QUAL mean (std): %s (%s)' % (np.mean(v.QUAL), np.std(v.QUAL))

# plot a histogram of variant depth
fig = plt.figure(1)
ax = fig.add_subplot(111)
ax.hist(V.DP)
ax.set_title('DP histogram')
ax.set_xlabel('DP')
plt.show()

# load data from sample columns
C = vcfnp.calldata_2d(filename, cache=True).view(np.recarray)

# print some simple genotype metrics
count_phased = np.count_nonzero(C2d.is_phased)
count_variant = np.count_nonzero(np.any(C2d.genotype > 0, axis=2))
count_missing = np.count_nonzero(~C2d.is_called)
print 'calls (phased, variant, missing): %s (%s, %s, %s)' % (C2d.flatten().size, count_phased, count_variant, count_missing)

# plot a histogram of genotype quality
fig = plt.figure(2)
ax = fig.add_subplot(111)
ax.hist(C2d.GQ.flatten())
ax.set_title('GQ histogram')
ax.set_xlabel('GQ')
plt.show()

Command line scripts are also provided to facilitate parallelizing the conversion of a VCF file to NPY arrays split by genome region. For example, the following command will create an NPY file containing a variants array for the second 100kb on chromosome 2:

$ vcf2npy \
    --vcf /path/to/my.vcf \
    --fasta /path/to/ref.fa \
    --output-dir /path/to/npy/output \
    --array-type variants \
    --chromosome chr20 \
    --task-size 100000 \
    --task-index 2 \
    --progress 1000

For those with access to a cluster running Sun Grid Engine a script is provided to submit a job array parallelizing the conversion, e.g.:

$ qsub_vcf2npy \
    --vcf /path/to/my.vcf \
    --fasta /path/to/ref.fa \
    --output-dir /path/to/npy/output \
    --array-type variants \
    --chromosome chr20 \
    --task-size 100000 \
    --progress 1000 \
    -l h_vmem=1G \
    -N test_vcfnp \
    -j y \
    -o /path/to/sge/logs \
    -q shortrun.q

It should be straightforward to adapt this script to run on other parallel computing platforms, see the scripts folder for the source code.

A script is also provided to load data from multiple NPY files into a single HDF5 file. E.g., after having converted a VCF file to 100kb variants and calldata_2d NPY splits, run something like:

$ vcfnpy2hdf5 \
    --vcf /path/to/my.vcf \
    --input-dir /path/to/npy/output \
    --output /path/to/my.h5

If you want to group the data by chromosome, do something like the following for each chromosome separately:

$ vcfnpy2hdf5 \
    --vcf /path/to/my.vcf \
    --input-dir /path/to/npy/output \
    --input-filename-template {array_type}.chr20*.npy \
    --output /path/to/my.h5 \
    --group chr20

Release Notes

1.10
1.9
1.8
1.7
1.6
1.5
1.0 - Note that as of version 1.0 the info() function has been removed and the variants() function now loads data from any of the VCF fixed fields including INFO. I.e., the variants() function gives access to all variant-level data in a single structured array. This is convenient for many use cases, e.g., using PyTables in-kernel queries to select variants passing some filtering criteria.

Acknowledgments

Based on Erik Garrison’s vcflib.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.3.0

Jul 20, 2016

2.2.0

Nov 25, 2015

2.1.5

Sep 1, 2015

2.1.4

Aug 26, 2015

2.1.3

Aug 26, 2015

2.1.2

Mar 1, 2015

2.1.1

Mar 1, 2015

2.1.0

Feb 28, 2015

2.0.1

Jan 23, 2015

2.0.0

Jan 23, 2015

1.12

Aug 26, 2014

1.11.5

Jul 8, 2014

1.11.4

Jul 7, 2014

1.11.3

Jun 26, 2014

1.11.2

Jun 23, 2014

This version

1.11.1

Jun 23, 2014

1.11

Jun 23, 2014

1.10.2

Jun 2, 2014

1.10.1

Mar 10, 2014

1.10

Mar 10, 2014

1.9.1

Feb 20, 2014

1.9

Feb 20, 2014

1.8

Feb 17, 2014

1.7

Jan 23, 2014

1.6

Jan 22, 2014

1.5

Jan 22, 2014

1.4

Nov 19, 2013

1.3

Nov 19, 2013

1.2

Nov 19, 2013

1.1

Nov 19, 2013

1.0.1

Nov 8, 2013

1.0

Nov 7, 2013

0.16

Oct 30, 2013

0.15

Oct 22, 2013

0.14

Sep 12, 2013

0.13

Jul 16, 2013

0.12

Jul 16, 2013

0.11.2

Jul 4, 2013

0.11.1

Jul 4, 2013

0.11

Jul 4, 2013

0.10

Jun 20, 2013

0.9

May 23, 2013

0.8

May 23, 2013

0.7

May 22, 2013

0.6

May 20, 2013

0.5

May 17, 2013

0.4

May 17, 2013

0.3

May 17, 2013

0.4-SNAPSHOT pre-release

May 17, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcfnp-1.11.1.tar.gz (486.4 kB view details)

Uploaded Jun 23, 2014 Source

File details

Details for the file vcfnp-1.11.1.tar.gz.

File metadata

Download URL: vcfnp-1.11.1.tar.gz
Upload date: Jun 23, 2014
Size: 486.4 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for vcfnp-1.11.1.tar.gz
Algorithm	Hash digest
SHA256	`e97178f90c0e0639daab2193dafb71be3971772001abcc5b90cc9a3cd9abdf0f`
MD5	`d8bfae8ce8f34347dc466418b308a9d0`
BLAKE2b-256	`dda3f3f5a9d43660c1c4abaa848bb4f415ca65ed1d537ec69ae410464999b0b5`