Skip to main content

Non-redundant, compressed, journalled, file-based storage for biological sequences

Project description

biocommons.seqrepo

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer multiple snapshots.

Clients refer to sequences and metadata using familiar identifiers, such as NM_000551.3 or GRCh38:1, or any of several hash-based identifiers. The interface supports fast slicing of arbitrary regions of large sequences.

A “fully-qualified” identifier includes a namespace to disambiguate accessions (e.g., “1” in GRCh37 and GRCh38). If the namespace is provided, seqrepo uses it as-is. If the namespace is not provided and the unqualified identifier refers to a unique sequence, it is returned; otherwise, ambiguous identifiers will raise an error.

SeqRepo favors identifiers from [identifiers.org](identifiers.org) whenever available. Examples include [refseq](https://registry.identifiers.org/registry/refseq) and [ensembl](https://registry.identifiers.org/registry/ensembl).

seqrepo-rest-service provides a REST interface and docker image.

Released under the Apache License, 2.0.

ci_rel | cov | pypi_rel | ChangeLog

Citation

Hart RK, Prlić A (2020)
SeqRepo: A system for managing local collections of biological sequences.

Features

  • Timestamped, read-only snapshots.

  • Space-efficient storage of sequences within a single snapshot and across snapshots.

  • Bandwidth-efficient transfer incremental updates.

  • Fast fetching of sequence slices on chromosome-scale sequences.

  • Precomputed digests that may be used as sequence aliases.

  • Mappings of external aliases (i.e., accessions or identifiers like NM_013305.4) to sequences.

Deployments Scenarios

Technical Quick Peek

Within a single snapshot, sequences are stored non-redundantly and compressed in an add-only journalled filesystem structure. A truncated SHA-512 hash is used to assess uniquness and as an internal id. (The digest is truncated for space efficiency.)

Sequences are compressed using the Block GZipped Format (BGZF)), which enables pysam to provide fast random access to compressed sequences. (Variable compression typically makes random access impossible.)

Sequence files are immutable, thereby enabling the use of hardlinks across snapshots and eliminating redundant transfers (e.g., with rsync).

Each sequence id is associated with a namespaced alias in a sqlite database. Such as <seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>, <NCBI,NP_004009.1>, <gi,5032303>, <ensembl-75ENSP00000354464>, <ensembl-85,ENSP00000354464.4>. The sqlite database is mutable across releases.

For calibration, recent releases that include 3 human genome assemblies (including patches), and full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes approximately 8GB. The minimum marginal size for additional snapshots is approximately 2GB (for the sqlite database, which is not hardlinked).

For more information, see docs/design.rst.

Requirements

Reading a sequence repository requires several Python packages, all of which are available from pypi. Installation should be as simple as pip install biocommons.seqrepo.

Writing sequence files also requires bgzip, which provided in the htslib repo. Ubuntu users should install the tabix package with sudo apt install tabix.

Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get other systems working would be welcomed.

Mac Developers If you get “xcrun: error: invalid active developer path”, you need to install XCode. See this StackOverflow answer.

Quick Start

On Ubuntu 16.04:

$ sudo apt install -y python3-dev gcc zlib1g-dev tabix
$ pip install seqrepo
$ sudo mkdir /usr/local/share/seqrepo
$ sudo chown $USER /usr/local/share/seqrepo
$ seqrepo pull -i 2018-11-26
$ seqrepo show-status -i 2018-11-26
seqrepo 0.2.3.post3.dev8+nb8298bd62283
root directory: /usr/local/share/seqrepo/2018-11-26, 7.9 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 773587 sequences, 93051609959 residues, 192 files
aliases: 5579572 aliases, 5480085 current, 26 namespaces, 773587 sequences

# Simple Pythonic interface to sequences
>> from biocommons.seqrepo import SeqRepo
>> sr = SeqRepo("/usr/local/share/seqrepo/latest")
>> sr["NC_000001.11"][780000:780020]
'TGGTGGCACGCGCTTGTAGT'

# Or, use the seqrepo shell for even easier access
$ seqrepo start-shell -i 2018-11-26
In [1]: sr["NC_000001.11"][780000:780020]
Out[1]: 'TGGTGGCACGCGCTTGTAGT'

# N.B. The following output is edited for simplicity
$ seqrepo export -i 2018-11-26 | head -n100
>SHA1:9a2acba3dd7603f... SEGUID:mirLo912A/MppLuS1cUyFMduLUQ Ensembl-85:GENSCAN00000003538 ...
MDSPLREDDSQTCARLWEAEVKRHSLEGLTVFGTAVQIHNVQRRAIRAKGTQEAQAELLCRGPRLLDRFLEDACILKEGRGTDTGQHCRGDARISSHLEA
SGTHIQLLALFLVSSSDTPPSLLRFCHALEHDIRYNSSFDSYYPLSPHSRHNDDLQTPSSHLGYIITVPDPTLPLTFASLYLGMAPCTSMGSSSMGIFQS
QRIHAFMKGKNKWDEYEGRKESWKIRSNSQTGEPTF
>SHA1:ca996b263102b1... SEGUID:yplrJjECsVqQufeYy0HkDD16z58 NCBI:XR_001733142.1 gi:1034683989
TTTACGTCTTTCTGGGAATTTATACTGGAAGTATACTTACCTCTGTGCAAAATTGCAAATATATAAGGTAATTCATTCCAGCATTGCTTATATTAGGTTG
AACTATGTAACATTGACATTGATGTGAATCAAAAATGGTTGAAGGCTGGCAGTTTCATATGATTCAGCCTATAATAGCAAAAGATTGAAAAAATCCATTA
ATACAGTGTGGTTCAAAAAAATTTGTTGTATCAAGGTAAAATAATAGCCTGAATATAATTAAGATAGTCTGTGTATACATCGATGAAAACATTGCCAATA

See Installation and Mirroring for more information.

Environment Variables

SEQREPO_LRU_CACHE_MAXSIZE sets the lru_cache maxsize for the sqlite query response caching. It defaults to 1 million but can also be set to “none” to be unlimited.

Developing

Here’s how to get started developing:

python3.6 -m venv
source venv/bin/activate
pip install -U setuptools pip
make develop

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biocommons.seqrepo-0.6.4.tar.gz (84.4 kB view details)

Uploaded Source

Built Distribution

biocommons.seqrepo-0.6.4-py2.py3-none-any.whl (36.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file biocommons.seqrepo-0.6.4.tar.gz.

File metadata

  • Download URL: biocommons.seqrepo-0.6.4.tar.gz
  • Upload date:
  • Size: 84.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.7

File hashes

Hashes for biocommons.seqrepo-0.6.4.tar.gz
Algorithm Hash digest
SHA256 d726fb721218bf5efdb27ee552640729275a2617f1f847172db92e8aedb2f39b
MD5 68661b71447f39d82c5df510e777b803
BLAKE2b-256 ceb11891cd264596a803c6899f91776833cebd3894633822063d11a263db39bd

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.6.4-py2.py3-none-any.whl.

File metadata

  • Download URL: biocommons.seqrepo-0.6.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 36.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.7

File hashes

Hashes for biocommons.seqrepo-0.6.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a3e44b581ded06246ac3e190748e6f58f8faaebadbcc0488df61a4f100803aa5
MD5 793ada28b7a7d563ac677e25720cbc14
BLAKE2b-256 2f35b82db3889e8aeb3ed941bf29e1ba05662e0ba97f1f38daaad21cc4d799e1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page