Skip to main content

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer incremental snapshots.

Project description

biocommons.seqrepo

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer multiple snapshots.

Released under the Apache License, 2.0.

ci_rel pypi_rel

Features

  • Timestamped snapshots of read-only sequence repository

  • Space-efficient storage of sequences within a single snapshot and across snapshots

  • Bandwidth-efficient transfer incremental updates

  • Fast fetching of sequence slices on chromosome-scale sequences

  • Precomputed digests that may be used as sequence aliases

  • Mappings of external aliases (i.e., accessions or identifiers like NM_013305.4) to sequences

The above features are achieved by storing sequences non-redundantly and compressed, using an add-only journalled filesystem structure within a single snapshot, and by using hard links across snapshots. Each sequence is associated with a namespaced alias such as <seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>, <ncbi,NP_004009.1>, <gi,5032303>, <ensembl-75ENSP00000354464>, <ensembl-85,ENSP00000354464.4> (all of which refer to the same sequence). Block gzipped format (BGZF)) enables pysam to provide fast random access to compressed sequences.

For more information, see doc/design.rst.

Deployments Scenarios

  • Available now: Local read-only archive, mirrored from public site, accessed via Python API (see Mirroring documentation)

  • Available now: Local read-write archive, maintained with command line utility and/or API (see Command Line Interface documentation).

  • Planned: Docker-based data-only container that may be linked to application container

  • Planned: Docker image that provides REST interface for local or remote access

Requirements

Reading a sequence repository requires several packages, all of which are available from pypi. Installation should be as simple as pip install biocommons.seqrepo.

Writing sequence files also requires bgzip, which provided in the htslib repo. Ubuntu users should install the tabix package with sudo apt install tabix.

Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get other systems working would be welcomed.

Quick Start

On Ubuntu 16.04:

$ sudo apt install -y python3-dev gcc zlib1g-dev tabix
$ pip install seqrepo
$ seqrepo pull -i 20160906
$ seqrepo show-status -i 20160906
seqrepo 0.2.3.post3.dev8+nb8298bd62283
root directory: /usr/local/share/seqrepo/20160906, 7.9 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 773587 sequences, 93051609959 residues, 192 files
aliases: 5579572 aliases, 5480085 current, 26 namespaces, 773587 sequences

$ seqrepo start-shell -i 20160906
In [1]: sr["NC_000001.11"][780000:780020]
Out[1]: 'TGGTGGCACGCGCTTGTAGT'

# N.B. The following output is edited
$ seqrepo export -i 20160906 | head -n100
>sha1:9a2acba3dd7603f... seguid:mirLo912A/MppLuS1cUyFMduLUQ ensembl-85:GENSCAN00000003538 sh:---7nAwbv5Fs2Ml2-k3X6Zvj-6ZcjeD3 ...
MDSPLREDDSQTCARLWEAEVKRHSLEGLTVFGTAVQIHNVQRRAIRAKGTQEAQAELLCRGPRLLDRFLEDACILKEGRGTDTGQHCRGDARISSHLEA
SGTHIQLLALFLVSSSDTPPSLLRFCHALEHDIRYNSSFDSYYPLSPHSRHNDDLQTPSSHLGYIITVPDPTLPLTFASLYLGMAPCTSMGSSSMGIFQS
QRIHAFMKGKNKWDEYEGRKESWKIRSNSQTGEPTF
>sha1:ca996b263102b1... seguid:yplrJjECsVqQufeYy0HkDD16z58 ncbi:XR_001733142.1 sh:---WkVUs3IT3_ZZM-ReDjypLo6d_vJx6 gi:1034683989
TTTACGTCTTTCTGGGAATTTATACTGGAAGTATACTTACCTCTGTGCAAAATTGCAAATATATAAGGTAATTCATTCCAGCATTGCTTATATTAGGTTG
AACTATGTAACATTGACATTGATGTGAATCAAAAATGGTTGAAGGCTGGCAGTTTCATATGATTCAGCCTATAATAGCAAAAGATTGAAAAAATCCATTA
ATACAGTGTGGTTCAAAAAAATTTGTTGTATCAAGGTAAAATAATAGCCTGAATATAATTAAGATAGTCTGTGTATACATCGATGAAAACATTGCCAATA

See Installation and Mirroring for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biocommons.seqrepo-0.3.0a1.tar.gz (46.2 kB view details)

Uploaded Source

Built Distributions

biocommons.seqrepo-0.3.0a1-py2.py3-none-any.whl (30.3 kB view details)

Uploaded Python 2 Python 3

biocommons.seqrepo-0.3.0a1-py2.7.egg (56.4 kB view details)

Uploaded Source

File details

Details for the file biocommons.seqrepo-0.3.0a1.tar.gz.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.3.0a1.tar.gz
Algorithm Hash digest
SHA256 8f1466fd8c15cee65e266279a348e8ff2f119c1ec2505996e9585b9a3248a068
MD5 70e0e4a905595c328add0f1518478082
BLAKE2b-256 913d7ab8f24d5dd3da8aa7bab56a778ed0c10e830d6c134e636e7c2339fd7998

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.3.0a1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.3.0a1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 72a71ef74fb19f6f51673eb55bed0d5cde50309fc79161e6e1c735015126b34b
MD5 0c6dcf5274ded60297b75b2d9ed62199
BLAKE2b-256 4987e428c81f693bc29fac0f9fbb155c541e60bff6f029119d3ad8c0b3fe6b04

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.3.0a1-py2.7.egg.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.3.0a1-py2.7.egg
Algorithm Hash digest
SHA256 e93dd3ab3a19ba24e9daf5cda85831c5eee4b0569586ff784b147cf3e7bf3273
MD5 1915dc87f5e8c38d1fa00abdbfbc79f8
BLAKE2b-256 c6d9cdaa85acca8735ebf7af237d07dfc60fa6649f9c33d881d8b656f8478f0f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page