Skip to main content

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer incremental snapshots.

Project description

biocommons.seqrepo

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer multiple snapshots.

Released under the Apache License, 2.0.

ci_rel pypi_rel

Features

  • Timestamped snapshots of read-only sequence repository

  • Space-efficient storage of sequences within a single snapshot and across snapshots

  • Bandwidth-efficient transfer incremental updates

  • Fast fetching of sequence slices on chromosome-scale sequences

  • Precomputed digests that may be used as sequence aliases

  • Mappings of external aliases (i.e., accessions or identifiers like NM_013305.4) to sequences

The above features are achieved by storing sequences non-redundantly and compressed, using an add-only journalled filesystem structure within a single snapshot, and by using hard links across snapshots. Each sequence is associated with a namespaced alias such as <seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>, <ncbi,NP_004009.1>, <gi,5032303>, <ensembl-75ENSP00000354464>, <ensembl-85,ENSP00000354464.4> (all of which refer to the same sequence). Block gzipped format (BGZF)) enables pysam to provide fast random access to compressed sequences.

For more information, see doc/design.rst.

Deployments Scenarios

  • Available now: Local read-only archive, mirrored from public site, accessed via Python API (see Mirroring documentation)

  • Available now: Local read-write archive, maintained with command line utility and/or API (see Command Line Interface documentation).

  • Planned: Docker-based data-only container that may be linked to application container

  • Planned: Docker image that provides REST interface for local or remote access

Requirements

Reading a sequence repository requires several packages, all of which are available from pypi. Installation should be as simple as pip install biocommons.seqrepo.

Writing sequence files also requires bgzip, which provided in the htslib repo. Ubuntu users should install the tabix package with sudo apt install tabix.

Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get other systems working would be welcomed.

Quick Start

On Ubuntu 16.04:

$ sudo apt install -y python3-dev gcc zlib1g-dev tabix
$ pip install seqrepo
$ seqrepo pull
$ seqrepo -i 20160906 show-status
seqrepo 0.2.3.post3.dev8+nb8298bd62283
root directory: /usr/local/share/seqrepo/20160906, 7.9 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 773587 sequences, 93051609959 residues, 192 files
aliases: 5579572 aliases, 5480085 current, 26 namespaces, 773587 sequences

$ seqrepo -i 20160906 start-shell
In [1]: sr["NC_000001.11"][780000:780020]
Out[1]: 'TGGTGGCACGCGCTTGTAGT'

See Installation and Mirroring for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biocommons.seqrepo-0.3.0.dev2.tar.gz (44.3 kB view details)

Uploaded Source

Built Distributions

biocommons.seqrepo-0.3.0.dev2-py3.5.egg (57.2 kB view details)

Uploaded Source

biocommons.seqrepo-0.3.0.dev2-py2.py3-none-any.whl (29.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file biocommons.seqrepo-0.3.0.dev2.tar.gz.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.3.0.dev2.tar.gz
Algorithm Hash digest
SHA256 99177b9dfc7c0ba8f0283bdb965aba27de2bc96d0a78590e7c1f6e787d36e98b
MD5 d36af27a4c71d09ffb632ecae73f8277
BLAKE2b-256 5c2ec9b37d1891cac2939ee13c65f4569594ad283f1e446363a0cf5f15f46682

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.3.0.dev2-py3.5.egg.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.3.0.dev2-py3.5.egg
Algorithm Hash digest
SHA256 73cb5863646eb3c307892f38867aed1304d674e49d8bc1498f1c22efb3fa0564
MD5 191e788c438392b026aa1766c4c7bb46
BLAKE2b-256 d351403c9e4e00ebcf987912db23a4ec1f1742b61d2000d10078464f94f5fda2

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.3.0.dev2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.3.0.dev2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4f6e646701168349109173641af13ea503bb6527728b82b9b711370d09780685
MD5 95986fd5469c0512f645623418e6d220
BLAKE2b-256 169c424825534ed1a9bdb7495193fe1c2162f44dce63bb9639309dbb4b7eac76

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page