VICC normalization routine for variations

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.7
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

Variation Normalization

Services and guidelines for normalizing variation terms into VRS (v1.2.0) and VRSATILE (latest) compatible representations.

Public OpenAPI endpoint: https://normalize.cancervariants.org/variation

Installing with pip:

pip install variation-normalizer

About

Variation Normalization works by using four main steps: tokenization, classification, validation, and translation. During tokenization, we split strings on whitespace and parse to determine the type of token. During classification, we specify the order of tokens a classification can have. We then do validation checks such as ensuring references for a nucleotide or amino acid matches the expected value and validating a position exists on the given transcript. During translation, we return a VRS Allele object.

Variation Normalization is limited to the following types of variants:

HGVS expressions and text representations (ex: BRAF V600E):
- protein (p.): substitution, deletion, insertion, deletion-insertion
- coding DNA (c.): substitution, deletion, insertion, deletion-insertion
- genomic (g.): substitution, deletion, ambiguous deletion, insertion, deletion-insertion, duplication
gnomAD-style VCF (chr-pos-ref-alt, ex: 7-140753336-A-T)
- genomic (g.): substitution, deletion, insertion

We are working towards adding more types of variations, coordinates, and representations.

Endpoints

/toVRS

The /toVRS endpoint returns a list of validated VRS Variations.

The /normalize endpoint returns a Variation Descriptor containing the MANE Transcript, if one is found. If a genomic query is not given a gene, normalize will return its GRCh38 representation.

The steps for retrieving MANE Transcript data is as follows:

Map starting annotation layer to genomic
Liftover to preferred GRCh38
We only support lifting over from GRCh37.
Select preferred compatible annotation
1. MANE Select
2. MANE Plus Clinical
3. Longest Compatible Remaining Transcript
Map back to starting annotation layer

Backend Services

Variation Normalization relies on some local data caches which you will need to set up. It uses pipenv to manage its environment, which you will also need to install.

Once pipenv is installed:

pipenv shell
pipenv lock
pipenv sync

Gene Normalizer

Variation Normalization relies on data from Gene Normalization. You must load all sources and merged concepts.

You must also have Gene Normalization's DynamoDB running for the application to work.

For more information about the gene-normalizer, visit the README.

SeqRepo

Variation Normalization relies on seqrepo, which you must download yourself.

Variation Normalizer uses seqrepo to retrieve sequences at given positions on a transcript.

From the root directory:

pip install seqrepo
sudo mkdir /usr/local/share/seqrepo
sudo chown $USER /usr/local/share/seqrepo
seqrepo pull -i 2021-01-29

UTA

Variation Normalizer also uses uta.

Variation Normalizer uses UTA to retrieve MANE Transcript data.

The following commands will likely need modification appropriate for the installation environment.

Install PostgreSQL

Create user and database.

$ createuser -U postgres uta_admin
$ createuser -U postgres anonymous
$ createdb -U postgres -O uta_admin uta

To install locally, from the variation/data directory:

export UTA_VERSION=uta_20210129.pgd.gz
curl -O http://dl.biocommons.org/uta/$UTA_VERSION
gzip -cdq ${UTA_VERSION} | grep -v "^REFRESH MATERIALIZED VIEW" | psql -h localhost -U uta_admin --echo-errors --single-transaction -v ON_ERROR_STOP=1 -d uta -p 5433

To connect to the UTA database, you can use the default url (postgresql://uta_admin@localhost:5433/uta/uta_20210129). If you use the default url, you must either set the password using environment variable UTA_PASSWORD or setting the parameter db_pwd in the UTA class.

If you do not wish to use the default, you must set the environment variable UTA_DB_URL which has the format of driver://user:pass@host/database/schema.

PyLiftover

Variation Normalizer uses PyLiftover to convert GRCh37 coordinates to GRCh38 coordinates.

Data

RefSeq

Variation Normalizer uses RefSeq data found at FTP site.

This data helps with free text variations in order to get all RefSeq accessions that correspond to a given gene.

Ensembl BioMart

Variation Normalizer uses Ensembl BioMart to retrieve variation/data/transcript_mappings.tsv. We currently use Human Genes (GRCh38.p13) for the dataset and the following attributes we use are: Gene stable ID, Gene stable ID version, Transcript stable ID, Transcript stable ID version, Protein stable ID, Protein stable ID version, RefSeq match transcript (MANE Select), Gene name.

This data helps with free text variations in order to get all Ensembl accessions that correspond to a given gene.

MANE Data

Variation Normalizer uses MANE data from RefSeq's FTP site.

Starting the Variation Normalization Service Locally

gene-normalizers dynamodb and the uta database must be running.

To start the service, run the following:

uvicorn variation.main:app --reload

Next, view the OpenAPI docs on your local machine: http://127.0.0.1:8000/variation

Init coding style tests

Code style is managed by flake8 and checked prior to commit.

We use pre-commit to run conformance tests.

This ensures:

Check code style
Check for added large files
Detect AWS Credentials
Detect Private Key

Before first commit run:

pre-commit install

Testing

From the root directory of the repository:

pytest tests/

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.7
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

0.10.0

Jul 22, 2024

0.9.1

Jul 16, 2024

0.9.0 yanked

Jul 15, 2024

Reason this release was yanked:

Does not work

0.8.2

Mar 21, 2024

0.8.1

Feb 23, 2024

0.8.1.dev0 pre-release

Feb 15, 2024

0.8.0.dev1 pre-release

Feb 5, 2024

0.8.0.dev0 pre-release

Nov 10, 2023

0.7.0.dev7 pre-release

May 2, 2023

0.7.0.dev6 pre-release

Apr 19, 2023

0.7.0.dev5 pre-release

Apr 11, 2023

0.7.0.dev4 pre-release

Apr 6, 2023

0.7.dev0 pre-release

Oct 3, 2022

0.6.3 yanked

Sep 23, 2022

Reason this release was yanked:

This is now on the 0.7.x release

0.6.0 yanked

Aug 25, 2022

Reason this release was yanked:

This is now on the 0.7.x release

0.6.0.dev1 pre-release

Nov 15, 2023

0.6.0.dev0 pre-release

Sep 22, 2023

0.5.5

May 9, 2023

0.5.4

May 7, 2023

0.5.3

Apr 6, 2023

0.5.2

Jan 10, 2023

0.5.1

Nov 8, 2022

0.4.0a7 pre-release

Jun 13, 2022

0.4.0a6 pre-release

Jun 3, 2022

0.4.0a5 pre-release

May 24, 2022

0.4.0a4 pre-release

May 23, 2022

0.4.0a3 pre-release

May 3, 2022

0.4.0a2 pre-release

Apr 20, 2022

0.4.0a1 pre-release

Apr 12, 2022

0.3.0

Apr 4, 2022

0.2.22

Mar 30, 2022

0.2.21

Mar 7, 2022

0.2.20

Feb 21, 2022

This version

0.2.19

Feb 3, 2022

0.2.18

Feb 2, 2022

0.2.17

Jan 27, 2022

0.2.15

Dec 24, 2021

0.2.14

Dec 7, 2021

0.2.13

Nov 23, 2021

0.2.12

Nov 7, 2021

0.2.11

Sep 1, 2021

0.2.10

Aug 27, 2021

0.2.9

Aug 24, 2021

0.2.8

Aug 13, 2021

0.2.7

Aug 12, 2021

0.2.5

Aug 4, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

variation-normalizer-0.2.19.tar.gz (125.6 kB view details)

Uploaded Feb 3, 2022 Source

Built Distribution

variation_normalizer-0.2.19-py3-none-any.whl (4.2 MB view details)

Uploaded Feb 3, 2022 Python 3

File details

Details for the file variation-normalizer-0.2.19.tar.gz.

File metadata

Download URL: variation-normalizer-0.2.19.tar.gz
Upload date: Feb 3, 2022
Size: 125.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for variation-normalizer-0.2.19.tar.gz
Algorithm	Hash digest
SHA256	`9d5d8e3fb20187e11a1d0fa564331e878de5ef0f885437c302da387201b0980d`
MD5	`c8bba388d0b65d0912494ab2190c5cc5`
BLAKE2b-256	`bab4e87fda2f234223acd0469cd4f895ce0620dcc7d930c12bdaf09092a6c4e8`

See more details on using hashes here.

File details

Details for the file variation_normalizer-0.2.19-py3-none-any.whl.

File metadata

Download URL: variation_normalizer-0.2.19-py3-none-any.whl
Upload date: Feb 3, 2022
Size: 4.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for variation_normalizer-0.2.19-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a8c352d93d5234b8f0837d8f1cdad6cefc1541ef10b960f2b642dd9552f09e27`
MD5	`9e006380d6bd9563ad2c50f8ec7b53d0`
BLAKE2b-256	`2d4e5c99953594903cf7ae56d7d2b4d7ca33da2f513b3268f1ea76b40cd35727`