Predict splicing variant effect from VCF
Project description
# mmsplice
[![pypi](https://img.shields.io/pypi/v/mmsplice.svg)](https://pypi-hypernode.com/pypi/mmsplice)
[![travis](https://img.shields.io/travis/s6juncheng/mmsplice.svg)](https://travis-ci.org/s6juncheng/mmsplice)
Predict splicing variant effect from VCF
Paper: Cheng et al. https://doi.org/10.1101/438986
![MMSplice](https://raw.githubusercontent.com/kipoi/models/master/MMSplice/Model.png)
## Usage example
------
```bash
pip install mmsplice
```
### Preparation
------
#### 1. Prepare annotation (gtf) file
Standard human gene annotation file in GTF format can be downloaded from ensembl or gencode.
`MMSplice` can work directly with those files, however, some filtering is higly recommended.
- Filter for protein coding genes.
- Filter out duplicated exons. The same exon can be annotated multiple times if it appears in multiple transcripts.
This will cause duplicated predictions.
We provide a filtered version [here](https://raw.githubusercontent.com/gagneurlab/MMSplice_paper/master/data/shared/Homo_sapiens.GRCh37.75.chr.uniq_exon.gtf.gz).
Note this version has chromosome names in the format `chr*`. You may need to remove them to match the chromosome names in your fasta file.
#### 2. Prepare variant (VCF) file
A correctly formatted VCF file with work with `MMSplice`, however the following steps will make it less prone to false positives:
- Quality filtering. Low quality variants leads to unreliable predictions.
- Avoid presenting multiple variants in one line by splitting them into multiple lines. Example code to do it:
```bash
bcftools norm -m-both -o out.vcf in.vcf.gz
```
- Left-normalization. For instance, GGCA-->GG is not left-normalized while GCA-->G is. Details for unified representation of genetic variants see [Tan et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4481842/)
```bash
bcftools norm -f reference.fasta -o out.vcf in.vcf
```
#### 3. Prepare reference genome (fasta) file
Human reference fasta file can be downloaded from ensembl/gencode. Make sure the chromosome name matches with GTF annotation file you use.
### Example code
------
Check [notebooks/example.ipynb](https://github.com/gagneurlab/MMSplice/blob/master/notebooks/example.ipynb)
```python
# Import
from mmsplice.vcf_dataloader import SplicingVCFDataloader
from mmsplice import MMSplice, predict_all_table
from mmsplice.utils import max_varEff
# example files
gtf = 'tests/data/test.gtf'
vcf = 'tests/data/test.vcf.gz'
fasta = 'tests/data/hg19.nochr.chr17.fa'
gtfIntervalTree = '../tests/data/test.pkl' # pickle exon interval Tree
# dataloader to load variants from vcf
dl = SplicingVCFDataloader(gtf,
fasta,
vcf,
out_file=gtfIntervalTree, # same pikled gtf IntervalTree
split_seq=False)
# Specify model
model = MMSplice(
exon_cut_l=0,
exon_cut_r=0,
acceptor_intron_cut=6,
donor_intron_cut=6,
acceptor_intron_len=50,
acceptor_exon_len=3,
donor_exon_len=5,
donor_intron_len=13)
# Do prediction
predictions = predict_all_table(model, dl, batch_size=1024, split_seq=False, assembly=False)
# Summerize with maximum effect size
predictionsMax = max_varEff(predictions)
```
## VEP Plugin
Please check documentation of vep plugin [under VEP_plugin/README.md](VEP_plugin/README.md).
=======
History
=======
0.1.0 (2018-07-17)
------------------
* First release on PyPI.
[![pypi](https://img.shields.io/pypi/v/mmsplice.svg)](https://pypi-hypernode.com/pypi/mmsplice)
[![travis](https://img.shields.io/travis/s6juncheng/mmsplice.svg)](https://travis-ci.org/s6juncheng/mmsplice)
Predict splicing variant effect from VCF
Paper: Cheng et al. https://doi.org/10.1101/438986
![MMSplice](https://raw.githubusercontent.com/kipoi/models/master/MMSplice/Model.png)
## Usage example
------
```bash
pip install mmsplice
```
### Preparation
------
#### 1. Prepare annotation (gtf) file
Standard human gene annotation file in GTF format can be downloaded from ensembl or gencode.
`MMSplice` can work directly with those files, however, some filtering is higly recommended.
- Filter for protein coding genes.
- Filter out duplicated exons. The same exon can be annotated multiple times if it appears in multiple transcripts.
This will cause duplicated predictions.
We provide a filtered version [here](https://raw.githubusercontent.com/gagneurlab/MMSplice_paper/master/data/shared/Homo_sapiens.GRCh37.75.chr.uniq_exon.gtf.gz).
Note this version has chromosome names in the format `chr*`. You may need to remove them to match the chromosome names in your fasta file.
#### 2. Prepare variant (VCF) file
A correctly formatted VCF file with work with `MMSplice`, however the following steps will make it less prone to false positives:
- Quality filtering. Low quality variants leads to unreliable predictions.
- Avoid presenting multiple variants in one line by splitting them into multiple lines. Example code to do it:
```bash
bcftools norm -m-both -o out.vcf in.vcf.gz
```
- Left-normalization. For instance, GGCA-->GG is not left-normalized while GCA-->G is. Details for unified representation of genetic variants see [Tan et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4481842/)
```bash
bcftools norm -f reference.fasta -o out.vcf in.vcf
```
#### 3. Prepare reference genome (fasta) file
Human reference fasta file can be downloaded from ensembl/gencode. Make sure the chromosome name matches with GTF annotation file you use.
### Example code
------
Check [notebooks/example.ipynb](https://github.com/gagneurlab/MMSplice/blob/master/notebooks/example.ipynb)
```python
# Import
from mmsplice.vcf_dataloader import SplicingVCFDataloader
from mmsplice import MMSplice, predict_all_table
from mmsplice.utils import max_varEff
# example files
gtf = 'tests/data/test.gtf'
vcf = 'tests/data/test.vcf.gz'
fasta = 'tests/data/hg19.nochr.chr17.fa'
gtfIntervalTree = '../tests/data/test.pkl' # pickle exon interval Tree
# dataloader to load variants from vcf
dl = SplicingVCFDataloader(gtf,
fasta,
vcf,
out_file=gtfIntervalTree, # same pikled gtf IntervalTree
split_seq=False)
# Specify model
model = MMSplice(
exon_cut_l=0,
exon_cut_r=0,
acceptor_intron_cut=6,
donor_intron_cut=6,
acceptor_intron_len=50,
acceptor_exon_len=3,
donor_exon_len=5,
donor_intron_len=13)
# Do prediction
predictions = predict_all_table(model, dl, batch_size=1024, split_seq=False, assembly=False)
# Summerize with maximum effect size
predictionsMax = max_varEff(predictions)
```
## VEP Plugin
Please check documentation of vep plugin [under VEP_plugin/README.md](VEP_plugin/README.md).
=======
History
=======
0.1.0 (2018-07-17)
------------------
* First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
mmsplice-0.2.5.tar.gz
(451.3 kB
view details)
Built Distribution
mmsplice-0.2.5-py2.py3-none-any.whl
(448.4 kB
view details)
File details
Details for the file mmsplice-0.2.5.tar.gz
.
File metadata
- Download URL: mmsplice-0.2.5.tar.gz
- Upload date:
- Size: 451.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dde05536e16e2a16dcd2e2763a768ffe6dc59ddb626997974493a9301345d5c3 |
|
MD5 | 4c42822fce368cfb8a0934c2cd76a295 |
|
BLAKE2b-256 | 7ced7bf13248347626464c61cb3f2dc90246cead71c7fa1faf38d594c87d2d98 |
File details
Details for the file mmsplice-0.2.5-py2.py3-none-any.whl
.
File metadata
- Download URL: mmsplice-0.2.5-py2.py3-none-any.whl
- Upload date:
- Size: 448.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13d9a6b185d8de063d75fc7e164f829935299b3d5caa058dba23cd9905392835 |
|
MD5 | 65a003f7c6507ec7d00b605605c295a3 |
|
BLAKE2b-256 | 751e574ef7216ab1d1f9f8d6250058b1b3d9ac708cadcc8c05c8d7da6ed341c5 |