Skip to main content

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer incremental snapshots.

Project description

biocommons.seqrepo

Important: seqrepo 0.4.0 was released on 2018-10-21. 0.4.0 provides a transition path for breaking changes that will occur in the 0.5 series. See 0.4.0 Changelog for details.


Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer multiple snapshots.

Released under the Apache License, 2.0.

ci_rel | cov | pypi_rel

Features

  • Timestamped, read-only snapshots.

  • Space-efficient storage of sequences within a single snapshot and across snapshots.

  • Bandwidth-efficient transfer incremental updates.

  • Fast fetching of sequence slices on chromosome-scale sequences.

  • Precomputed digests that may be used as sequence aliases.

  • Mappings of external aliases (i.e., accessions or identifiers like NM_013305.4) to sequences.

Deployments Scenarios

  • Local read-only archive, mirrored from public site, accessed via Python API (see Mirroring documentation)

  • Local read-write archive, maintained with command line utility and/or API (see Command Line Interface documentation).

  • Docker-based data-only container that may be linked to application container.

  • Planned: Docker image that provides REST interface for local or remote access

Technical Quick Peek

Within a single snapshot, sequences are stored non-redundantly and compressed in an add-only journalled filesystem structure. A truncated SHA-512 hash is used to assess uniquness and as an internal id. (The digest is truncated for space efficiency.)

Sequences are compressed using the Block GZipped Format (BGZF)), which enables pysam to provide fast random access to compressed sequences. (Variable compression typically makes random access impossible.)

Sequence files are immutable, thereby enabling the use of hardlinks across snapshots and eliminating redundant transfers (e.g., with rsync).

Each sequence id is associated with a namespaced alias in a sqlite database. Such as <seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>, <NCBI,NP_004009.1>, <gi,5032303>, <ensembl-75ENSP00000354464>, <ensembl-85,ENSP00000354464.4>. The sqlite database is mutable across releases.

For calibration, recent releases that include 3 human genome assemblies (including patches), and full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes approximately 8GB. The minimum marginal size for additional snapshots is approximately 2GB (for the sqlite database, which is not hardlinked).

For more information, see doc/design.rst.

Requirements

Reading a sequence repository requires several packages, all of which are available from pypi. Installation should be as simple as pip install biocommons.seqrepo.

Writing sequence files also requires bgzip, which provided in the htslib repo. Ubuntu users should install the tabix package with sudo apt install tabix.

Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get other systems working would be welcomed.

Quick Start

On Ubuntu 16.04:

$ sudo apt install -y python3-dev gcc zlib1g-dev tabix
$ pip install seqrepo
$ sudo mkdir /usr/local/share/seqrepo
$ sudo chown $USER /usr/local/share/seqrepo
$ seqrepo pull -i 20160906
$ seqrepo show-status -i 20160906
seqrepo 0.2.3.post3.dev8+nb8298bd62283
root directory: /usr/local/share/seqrepo/20160906, 7.9 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 773587 sequences, 93051609959 residues, 192 files
aliases: 5579572 aliases, 5480085 current, 26 namespaces, 773587 sequences

$ seqrepo start-shell -i 20160906
In [1]: sr["NC_000001.11"][780000:780020]
Out[1]: 'TGGTGGCACGCGCTTGTAGT'

# N.B. The following output is edited for simplicity
$ seqrepo export -i 20160906 | head -n100
>SHA1:9a2acba3dd7603f... SEGUID:mirLo912A/MppLuS1cUyFMduLUQ Ensembl-85:GENSCAN00000003538 ...
MDSPLREDDSQTCARLWEAEVKRHSLEGLTVFGTAVQIHNVQRRAIRAKGTQEAQAELLCRGPRLLDRFLEDACILKEGRGTDTGQHCRGDARISSHLEA
SGTHIQLLALFLVSSSDTPPSLLRFCHALEHDIRYNSSFDSYYPLSPHSRHNDDLQTPSSHLGYIITVPDPTLPLTFASLYLGMAPCTSMGSSSMGIFQS
QRIHAFMKGKNKWDEYEGRKESWKIRSNSQTGEPTF
>SHA1:ca996b263102b1... SEGUID:yplrJjECsVqQufeYy0HkDD16z58 NCBI:XR_001733142.1 gi:1034683989
TTTACGTCTTTCTGGGAATTTATACTGGAAGTATACTTACCTCTGTGCAAAATTGCAAATATATAAGGTAATTCATTCCAGCATTGCTTATATTAGGTTG
AACTATGTAACATTGACATTGATGTGAATCAAAAATGGTTGAAGGCTGGCAGTTTCATATGATTCAGCCTATAATAGCAAAAGATTGAAAAAATCCATTA
ATACAGTGTGGTTCAAAAAAATTTGTTGTATCAAGGTAAAATAATAGCCTGAATATAATTAAGATAGTCTGTGTATACATCGATGAAAACATTGCCAATA

See Installation and Mirroring for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biocommons.seqrepo-0.4.2.tar.gz (62.2 kB view details)

Uploaded Source

Built Distribution

biocommons.seqrepo-0.4.2-py2.py3-none-any.whl (34.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file biocommons.seqrepo-0.4.2.tar.gz.

File metadata

  • Download URL: biocommons.seqrepo-0.4.2.tar.gz
  • Upload date:
  • Size: 62.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.3

File hashes

Hashes for biocommons.seqrepo-0.4.2.tar.gz
Algorithm Hash digest
SHA256 32bff212f7afaa216fba0b2942602d75e07f4a5348798eaa7f254f18af8efa5b
MD5 e97217f77411f24000e23aa9fad1a4ad
BLAKE2b-256 0ff92c2f258a30fbf56a1d1e3bd6dd837caf8db0e8a9fec00836be3d42ed6ab2

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.4.2-py2.py3-none-any.whl.

File metadata

  • Download URL: biocommons.seqrepo-0.4.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 34.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.3

File hashes

Hashes for biocommons.seqrepo-0.4.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 609b3afc6a09156a89deab5f41216788da1e51bc92f33a8ca3156c4b0b01ab85
MD5 89cdb884d1fa40be7bacc760ba6d8935
BLAKE2b-256 42550dd28b5a26db7ddc1eb6f6df11cc225a1365393bf61694b173cf50b77747

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page