Predict splicing variant effect from VCF
Project description
# mmsplice
[![pypi](https://img.shields.io/pypi/v/mmsplice.svg)](https://pypi-hypernode.com/pypi/mmsplice)
[![travis](https://img.shields.io/travis/s6juncheng/mmsplice.svg)](https://travis-ci.org/s6juncheng/mmsplice)
Predict splicing variant effect from VCF
* Free software: MIT license
## Usage example
------
### Preparation
------
#### 1. Prepare annotation (gtf) file
Standard human gene annotation file in GTF format can be downloaded from ensembl or gencode.
`MMSplice` can work directly with those files, however, some filtering is higly recommended.
- Filter for protein coding genes.
- Filter out duplicated exons. The same exon can be annotated multiple times if it appears in multiple transcripts.
This will cause duplicated predictions.
We provide a filtered version [here](https://raw.githubusercontent.com/gagneurlab/MMSplice_paper/master/data/shared/Homo_sapiens.GRCh37.75.chr.uniq_exon.gtf.gz).
Note this version has chromosome names in the format `chr*`. You may need to remove them to match the chromosome names in your fasta file.
#### 2. Prepare variant (VCF) file
A correctly formatted VCF file with work with `MMSplice`, however the following steps will make it less prone to false positives:
- Quality filtering. Low quality variants leads to unreliable predictions.
- Avoid presenting multiple variants in one line by splitting them into multiple lines. Example code to do it:
```bash
bcftools norm -m-both -o out.vcf in.vcf.gz
```
- Left-normalization. For instance, GGCA-->GG is not left-normalized while GCA-->G is. Details for unified representation of genetic variants see [Tan et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4481842/)
```bash
bcftools norm -f reference.fasta -o out.vcf in.vcf
```
#### 3. Prepare reference genome (fasta) file
Human reference fasta file can be downloaded from ensembl/gencode. Make sure the chromosome name matches with GTF annotation file you use.
### Example code
------
Check [notebooks/example.ipynb](https://github.com/gagneurlab/MMSplice/blob/master/notebooks/example.ipynb)
```python
# Import
from mmsplice.vcf_dataloader import SplicingVCFDataloader
from mmsplice import MMSplice, predict_all_table
from mmsplice.utils import max_varEff
# example files
gtf = 'tests/data/test.gtf'
vcf = 'tests/data/test.vcf.gz'
fasta = 'tests/data/hg19.nochr.chr17.fa'
gtfIntervalTree = '../tests/data/test.pkl' # pickle exon interval Tree
# dataloader to load variants from vcf
dl = SplicingVCFDataloader(gtf,
fasta,
vcf,
out_file=gtfIntervalTree, # same pikled gtf IntervalTree
split_seq=False)
# Specify model
model = MMSplice(
exon_cut_l=0,
exon_cut_r=0,
acceptor_intron_cut=6,
donor_intron_cut=6,
acceptor_intron_len=50,
acceptor_exon_len=3,
donor_exon_len=5,
donor_intron_len=13)
# Do prediction
predictions = predict_all_table(model, dl, batch_size=1024, split_seq=False, assembly=False)
# Summerize with maximum effect size
predictionsMax = max_varEff(predictions)
```
=======
History
=======
0.1.0 (2018-07-17)
------------------
* First release on PyPI.
[![pypi](https://img.shields.io/pypi/v/mmsplice.svg)](https://pypi-hypernode.com/pypi/mmsplice)
[![travis](https://img.shields.io/travis/s6juncheng/mmsplice.svg)](https://travis-ci.org/s6juncheng/mmsplice)
Predict splicing variant effect from VCF
* Free software: MIT license
## Usage example
------
### Preparation
------
#### 1. Prepare annotation (gtf) file
Standard human gene annotation file in GTF format can be downloaded from ensembl or gencode.
`MMSplice` can work directly with those files, however, some filtering is higly recommended.
- Filter for protein coding genes.
- Filter out duplicated exons. The same exon can be annotated multiple times if it appears in multiple transcripts.
This will cause duplicated predictions.
We provide a filtered version [here](https://raw.githubusercontent.com/gagneurlab/MMSplice_paper/master/data/shared/Homo_sapiens.GRCh37.75.chr.uniq_exon.gtf.gz).
Note this version has chromosome names in the format `chr*`. You may need to remove them to match the chromosome names in your fasta file.
#### 2. Prepare variant (VCF) file
A correctly formatted VCF file with work with `MMSplice`, however the following steps will make it less prone to false positives:
- Quality filtering. Low quality variants leads to unreliable predictions.
- Avoid presenting multiple variants in one line by splitting them into multiple lines. Example code to do it:
```bash
bcftools norm -m-both -o out.vcf in.vcf.gz
```
- Left-normalization. For instance, GGCA-->GG is not left-normalized while GCA-->G is. Details for unified representation of genetic variants see [Tan et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4481842/)
```bash
bcftools norm -f reference.fasta -o out.vcf in.vcf
```
#### 3. Prepare reference genome (fasta) file
Human reference fasta file can be downloaded from ensembl/gencode. Make sure the chromosome name matches with GTF annotation file you use.
### Example code
------
Check [notebooks/example.ipynb](https://github.com/gagneurlab/MMSplice/blob/master/notebooks/example.ipynb)
```python
# Import
from mmsplice.vcf_dataloader import SplicingVCFDataloader
from mmsplice import MMSplice, predict_all_table
from mmsplice.utils import max_varEff
# example files
gtf = 'tests/data/test.gtf'
vcf = 'tests/data/test.vcf.gz'
fasta = 'tests/data/hg19.nochr.chr17.fa'
gtfIntervalTree = '../tests/data/test.pkl' # pickle exon interval Tree
# dataloader to load variants from vcf
dl = SplicingVCFDataloader(gtf,
fasta,
vcf,
out_file=gtfIntervalTree, # same pikled gtf IntervalTree
split_seq=False)
# Specify model
model = MMSplice(
exon_cut_l=0,
exon_cut_r=0,
acceptor_intron_cut=6,
donor_intron_cut=6,
acceptor_intron_len=50,
acceptor_exon_len=3,
donor_exon_len=5,
donor_intron_len=13)
# Do prediction
predictions = predict_all_table(model, dl, batch_size=1024, split_seq=False, assembly=False)
# Summerize with maximum effect size
predictionsMax = max_varEff(predictions)
```
=======
History
=======
0.1.0 (2018-07-17)
------------------
* First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
mmsplice-0.2.4.tar.gz
(451.0 kB
view details)
Built Distribution
mmsplice-0.2.4-py2.py3-none-any.whl
(448.3 kB
view details)
File details
Details for the file mmsplice-0.2.4.tar.gz
.
File metadata
- Download URL: mmsplice-0.2.4.tar.gz
- Upload date:
- Size: 451.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6947d204330efec653463211cd28f78b577d202aa3403f23c70d82b4056b5df |
|
MD5 | 059f3dba4b23df647c55434dfd60e59f |
|
BLAKE2b-256 | 7bc1634b6ad37835e35a94c487b24166c5c4a95c7379e0b0a7364d1896103a93 |
File details
Details for the file mmsplice-0.2.4-py2.py3-none-any.whl
.
File metadata
- Download URL: mmsplice-0.2.4-py2.py3-none-any.whl
- Upload date:
- Size: 448.3 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dba9e4acda797f70747a58bfde533d3a7b32232932b130ff4579c7807a8f8920 |
|
MD5 | 00fd9867b9f57747a49d8dacd4632b5f |
|
BLAKE2b-256 | d7668f6b318c0ec9fae2edcb188a6b6595d6177b20fab4df90e4eddd68219407 |