Skip to main content

VICC normalization routine for variations

Project description

DOI

Variation Normalization

Services and guidelines for normalizing variation terms into VRS and VRSATILE compatible representations.

Public OpenAPI endpoint: https://normalize.cancervariants.org/variation

Installing with pip:

pip install variation-normalizer

About

Variation Normalization works by using four main steps: tokenization, classification, validation, and translation. During tokenization, we split strings on whitespace and parse to determine the type of token. During classification, we specify the order of tokens a classification can have. We then do validation checks such as ensuring references for a nucleotide or amino acid matches the expected value and validating a position exists on the given transcript. During translation, we return a VRS Allele object.

Variation Normalization is limited to the following types of variants:

  • HGVS expressions and text representations (ex: BRAF V600E):
    • protein (p.): substitution, deletion, insertion, deletion-insertion
    • coding DNA (c.): substitution, deletion, insertion, deletion-insertion
    • genomic (g.): substitution, deletion, ambiguous deletion, insertion, deletion-insertion, duplication
  • gnomAD-style VCF (chr-pos-ref-alt, ex: 7-140753336-A-T)
    • genomic (g.): substitution, deletion, insertion

Variation Normalizer accepts input from GRCh37 or GRCh8 assemblies.

We are working towards adding more types of variations, coordinates, and representations.

Endpoints

The /to_vrs endpoint returns a list of validated VRS Variations.

The /normalize endpoint returns a Variation Descriptor containing the MANE Transcript, if one is found. If a genomic query is not given a gene, normalize will return its GRCh38 representation. Variation Normalizer relies on Common Operations On Lots-of Sequences Tool (cool-seq-tool) for retrieving MANE Transcript data. More information on the transcript selection algorithm can be found here.

Developer Instructions

Clone the repo:

git clone https://github.com/cancervariants/variation-normalization.git
cd variation-normalization

For a development install, we recommend using Pipenv. See the pipenv docs for direction on installing pipenv in your compute environment.

Once installed, from the project root dir, just run:

pipenv shell
pipenv lock && pipenv sync
pipenv install --dev

Backend Services

Variation Normalization relies on some local data caches which you will need to set up. It uses pipenv to manage its environment, which you will also need to install.

Gene Normalizer

Variation Normalization relies on data from Gene Normalization. You must load all sources and merged concepts.

You must also have Gene Normalization's DynamoDB running in a separate terminal for the application to work.

For more information about the gene-normalizer and how to load the database, visit the README.

SeqRepo

Variation Normalization relies on seqrepo, which you must download yourself.

Variation Normalizer uses seqrepo to retrieve sequences at given positions on a transcript.

From the root directory:

pip install seqrepo
sudo mkdir /usr/local/share/seqrepo
sudo chown $USER /usr/local/share/seqrepo
seqrepo pull -i 2021-01-29  # Replace with latest version using `seqrepo list-remote-instances` if outdated

If you get an error similar to the one below:

PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2021-01-29._fkuefgd' -> '/usr/local/share/seqrepo/2021-01-29'

You will want to do the following:
(Might not be ._fkuefgd, so replace with your error message path)

sudo mv /usr/local/share/seqrepo/2021-01-29._fkuefgd /usr/local/share/seqrepo/2021-01-29
exit

Use the SEQREPO_ROOT_DIR environment variable to set the path of an already existing SeqRepo directory. The default is /usr/local/share/seqrepo/latest.

UTA

Variation Normalizer also uses Common Operations On Lots-of Sequences Tool (cool-seq-tool) which uses UTA as the underlying PostgreSQL database.

The following commands will likely need modification appropriate for the installation environment.

  1. Install PostgreSQL

  2. Create user and database.

    $ createuser -U postgres uta_admin
    $ createuser -U postgres anonymous
    $ createdb -U postgres -O uta_admin uta
    
  3. To install locally, from the variation/data directory:

export UTA_VERSION=uta_20210129.pgd.gz
curl -O http://dl.biocommons.org/uta/$UTA_VERSION
gzip -cdq ${UTA_VERSION} | grep -v "^REFRESH MATERIALIZED VIEW" | psql -h localhost -U uta_admin --echo-errors --single-transaction -v ON_ERROR_STOP=1 -d uta -p 5433
UTA Installation Issues

If you have trouble installing UTA, you can visit these two READMEs.

Connecting to the UTA database

To connect to the UTA database, you can use the default url (postgresql://uta_admin@localhost:5433/uta/uta_20210129). If you use the default url, you must either set the password using environment variable UTA_PASSWORD or setting the parameter db_pwd in the UTA class.

If you do not wish to use the default, you must set the environment variable UTA_DB_URL which has the format of driver://user:pass@host:port/database/schema.

Starting the Variation Normalization Service Locally

gene-normalizers dynamodb and the uta database must be running.

To start the service, run the following:

uvicorn variation.main:app --reload

Next, view the OpenAPI docs on your local machine: http://127.0.0.1:8000/variation

Init coding style tests

Code style is managed by flake8 and checked prior to commit.

We use pre-commit to run conformance tests.

This ensures:

  • Check code style
  • Check for added large files
  • Detect AWS Credentials
  • Detect Private Key

Before first commit run:

pre-commit install

Testing

From the root directory of the repository:

pytest tests/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

variation-normalizer-0.5.4.tar.gz (132.7 kB view details)

Uploaded Source

Built Distribution

variation_normalizer-0.5.4-py3-none-any.whl (250.5 kB view details)

Uploaded Python 3

File details

Details for the file variation-normalizer-0.5.4.tar.gz.

File metadata

  • Download URL: variation-normalizer-0.5.4.tar.gz
  • Upload date:
  • Size: 132.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for variation-normalizer-0.5.4.tar.gz
Algorithm Hash digest
SHA256 1eac12c910ea297fbffb0ef1193522ab8e8c06e3cd1dd407c97a5c2c99b6ef2d
MD5 3cbdcd78f85f5a00979c53f4707d2fe5
BLAKE2b-256 a0342b1502b14ee998f9b1a6bd7d911b40071a2d3f3fd06f0770c792051a250f

See more details on using hashes here.

File details

Details for the file variation_normalizer-0.5.4-py3-none-any.whl.

File metadata

File hashes

Hashes for variation_normalizer-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 68cbdeb330b663af7254d783b90c4bbafa51cd0b61f442933e0dc116f3611222
MD5 bb899f6f91c75ce5bee33f7b3427e2ee
BLAKE2b-256 71eb5f959a2da8a76a5af20338778477f3cf264dfe92de4882c62586424fc156

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page