Skip to main content

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer incremental snapshots.

Project description

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer incremental snapshots.

ci_rel pypi_rel

Features

  • Space-efficient storage of sequences within a release and across releases

  • Bandwidth-efficient transfer incremental updates

  • Fast fetching of sequence slices on chromosome-scale sequences

  • Provenance data regarding sequence sources and accessions

  • Precomputed digests that may be used as sequence aliases

For more information, see doc/design.rst.

Expected deployment cases

  • Local access via Python package, using a repo rsync’d from a remote source or loaded locally

  • Docker image with REST interface

Installation

seqrepo has been tested only on Ubuntu 14.04 and 16.04. It requires separate installation of the tabix package. It requires sqlite3 >= 3.8.0, which likely precludes early Ubuntu distributions.

On Ubuntu 16.04:

sudo apt install tabix
pip install seqrepo

Command line usage

seqrepo includes a command line interface for loading, fetching, and exporting sequences.

Loading

$ SEQREPO_ROOT=/opt/seqrepo/data/2016/0818

$ seqrepo -d $SEQREPO_ROOT init

$ seqrepo -v -d $SEQREPO_ROOT load-fasta -n me fasta1.gz fasta2.gz fasta3.gz

$ seqrepo -v -d $SEQREPO_ROOT status
seqrepo 0.1.0
root directory: /opt/seqrepo/data/2016/0818, 0.2 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 3 files, 33080 sequences, 110419437 residues
aliases: 165481 aliases, 165481 current, 5 namespaces, 33080 sequences

Exporting all sequences

$ seqrepo -v -d $SEQREPO_ROOT export | head
>me:sequence1 seguid:EqjiLe... md5:04e8c3c75... sha512:000a70c470f6... sha1:12a8e22d...
GTACGCCCCCTCCCCCCGTCCCTATCGGCAGAACCGGAGGCCAACCTTCGCGATCCCTTGCTGCGGGCCCGGAGATCAAACGTGGCCCGCCCCCGGCAGG
GCACAGCGCGCTGGGCAACCGCGATCCGGCGCCGGACTGGAGGGGTCGATGCGCGGCGCGCTGGGGCGCACAGGGGACGGAGCCCGGGTCTTGCTCCCCA

API Usage

$ seqrepo -v -d $SEQREPO_ROOT shell

In [10]: %time sr.fetch("NC_000001.10", start=6000000, end=6000200)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 492 µs
Out[10]: 'GGACAACAGAGGATGAGGTGGGGCCAGCAGAGGGACAGAGAAGAGCTGCCTGCCCTGGAACAGGCAGAAAGCATCCCACGTGCAAGAAAAAGTAGGCCAGCTAGACTTAAAATCAGAACTACCGCTCATCAAAAGATAGTGTAACATTTGGGGTGCTATAATTTTAACATGTCCCCCAAAAGGCATGTGTTGGAAATTTA'


# iterate over unique sequences:
for srec, arec in sr:
    pprint.pprint(srec)
    pprint.pprint(arec)

# results in records like:
{'added': '2016-08-18 17:40:49',
 'alpha': 'ACGT',
 'len': 2627,
 'relpath': '2016/08/18/1740/1471542046.008535.fa.bgz',
 'seq': 'GTACGCCC...',
 'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518'}

and

[{'added': '2016-08-18 17:40:49',
  'alias': '04e8c3c753dad9c19741cdf81ec2b3d5',
  'is_current': 1,
  'namespace': 'md5',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144388},
 {'added': '2016-08-18 17:40:49',
  'alias': 'EqjiLeXFeeBT6LIMnbCFQxNqHD8',
  'is_current': 1,
  'namespace': 'seguid',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144389},
 {'added': '2016-08-18 17:40:49',
  'alias': '12a8e22de5c579e053e8b20c9db08543136a1c3f',
  'is_current': 1,
  'namespace': 'sha1',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144387},
 {'added': '2016-08-18 17:40:49',
  'alias': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'is_current': 1,
  'namespace': 'sha512',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144386},
 {'added': '2016-08-18 17:40:49',
  'alias': 'NM_013305.4',
  'is_current': 1,
  'namespace': 'ncbi',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144390}]

Fetching existing sequence repositories

TO BE WRITTEN

(General idea: Distribute repository with snapshots via rsync server from public site for manual installation, and use the same source to seed a docker container.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biocommons.seqrepo-0.2.1.tar.gz (38.6 kB view details)

Uploaded Source

Built Distributions

biocommons.seqrepo-0.2.1-py3.5.egg (49.9 kB view details)

Uploaded Source

biocommons.seqrepo-0.2.1-py2.py3-none-any.whl (26.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file biocommons.seqrepo-0.2.1.tar.gz.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.2.1.tar.gz
Algorithm Hash digest
SHA256 1f9eb50d66ff5e0f77a27349cbb7a6c153e1e6819139027f1c4db7caaa626f59
MD5 b167067c34716aae649dc7a730c9098c
BLAKE2b-256 8ebf15dce4f0a7815029fa4c5b5d7ab63478eba76179fd3637bce94ee06eb74a

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.2.1-py3.5.egg.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.2.1-py3.5.egg
Algorithm Hash digest
SHA256 6bada40853797dbb5e25e6c943fc93216e2ad208b182df5054e0a1a01ee36bb9
MD5 c583c52cebe1ba870fb9220add0bdebd
BLAKE2b-256 3e4a7b76f9f7dc9e51c7a058fcd4a545065a3175ab48c5068dac6008f16431dd

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.2.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.2.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2c27c67752444e84580fc5b7418de3a9695b2788f6a70932bb88ae4e37453cab
MD5 52b41e0157bba33eda6d0f2ee27739bc
BLAKE2b-256 7e30ac288ca536185f13902baa21e58140425f13122841071a0f28043fc7383a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page