Skip to main content

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer incremental snapshots.

Project description

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer incremental snapshots.

ci_rel pypi_rel

Features

  • Space-efficient storage of sequences within a release and across releases

  • Bandwidth-efficient transfer incremental updates

  • Fast fetching of sequence slices on chromosome-scale sequences

  • Provenance data regarding sequence sources and accessions

  • Precomputed digests that may be used as sequence aliases

For more information, see doc/design.rst.

Expected deployment cases

  • Local access via Python package, using a repo rsync’d from a remote source or loaded locally

  • Docker image with REST interface

Installation

seqrepo has been tested only on Ubuntu 14.04 and 16.04. It requires separate installation of the tabix package. It requires sqlite3 >= 3.8.0, which likely precludes early Ubuntu distributions.

On Ubuntu 16.04:

sudo apt install tabix
pip install seqrepo

Command line usage

seqrepo includes a command line interface for loading, fetching, and exporting sequences.

Loading

$ SEQREPO_ROOT=/opt/seqrepo/data/2016/0818

$ seqrepo -d $SEQREPO_ROOT init

$ seqrepo -v -d $SEQREPO_ROOT load-fasta -n me fasta1.gz fasta2.gz fasta3.gz

$ seqrepo -v -d $SEQREPO_ROOT status
seqrepo 0.1.0
root directory: /opt/seqrepo/data/2016/0818, 0.2 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 3 files, 33080 sequences, 110419437 residues
aliases: 165481 aliases, 165481 current, 5 namespaces, 33080 sequences

Exporting all sequences

$ seqrepo -v -d $SEQREPO_ROOT export | head
>me:sequence1 seguid:EqjiLe... md5:04e8c3c75... sha512:000a70c470f6... sha1:12a8e22d...
GTACGCCCCCTCCCCCCGTCCCTATCGGCAGAACCGGAGGCCAACCTTCGCGATCCCTTGCTGCGGGCCCGGAGATCAAACGTGGCCCGCCCCCGGCAGG
GCACAGCGCGCTGGGCAACCGCGATCCGGCGCCGGACTGGAGGGGTCGATGCGCGGCGCGCTGGGGCGCACAGGGGACGGAGCCCGGGTCTTGCTCCCCA

API Usage

$ seqrepo -v -d $SEQREPO_ROOT shell

In [10]: %time sr.fetch("NC_000001.10", start=6000000, end=6000200)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 492 µs
Out[10]: 'GGACAACAGAGGATGAGGTGGGGCCAGCAGAGGGACAGAGAAGAGCTGCCTGCCCTGGAACAGGCAGAAAGCATCCCACGTGCAAGAAAAAGTAGGCCAGCTAGACTTAAAATCAGAACTACCGCTCATCAAAAGATAGTGTAACATTTGGGGTGCTATAATTTTAACATGTCCCCCAAAAGGCATGTGTTGGAAATTTA'


# iterate over unique sequences:
for srec, arec in sr:
    pprint.pprint(srec)
    pprint.pprint(arec)

# results in records like:
{'added': '2016-08-18 17:40:49',
 'alpha': 'ACGT',
 'len': 2627,
 'relpath': '2016/08/18/1740/1471542046.008535.fa.bgz',
 'seq': 'GTACGCCC...',
 'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518'}

and

[{'added': '2016-08-18 17:40:49',
  'alias': '04e8c3c753dad9c19741cdf81ec2b3d5',
  'is_current': 1,
  'namespace': 'md5',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144388},
 {'added': '2016-08-18 17:40:49',
  'alias': 'EqjiLeXFeeBT6LIMnbCFQxNqHD8',
  'is_current': 1,
  'namespace': 'seguid',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144389},
 {'added': '2016-08-18 17:40:49',
  'alias': '12a8e22de5c579e053e8b20c9db08543136a1c3f',
  'is_current': 1,
  'namespace': 'sha1',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144387},
 {'added': '2016-08-18 17:40:49',
  'alias': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'is_current': 1,
  'namespace': 'sha512',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144386},
 {'added': '2016-08-18 17:40:49',
  'alias': 'NM_013305.4',
  'is_current': 1,
  'namespace': 'ncbi',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144390}]

Fetching existing sequence repositories

TO BE WRITTEN

(General idea: Distribute repository with snapshots via rsync server from public site for manual installation, and use the same source to seed a docker container.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biocommons.seqrepo-0.2.1.post1.tar.gz (38.6 kB view details)

Uploaded Source

Built Distributions

biocommons.seqrepo-0.2.1.post1-py3.5.egg (49.9 kB view details)

Uploaded Source

biocommons.seqrepo-0.2.1.post1-py2.py3-none-any.whl (26.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file biocommons.seqrepo-0.2.1.post1.tar.gz.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.2.1.post1.tar.gz
Algorithm Hash digest
SHA256 21db182e19342dc2ac1abe3d52dc4befc0b8d3dd7aa0832f044e37cfa938799e
MD5 7f39e2d3459c325c5e6de3108f2fb1ab
BLAKE2b-256 3766a904c5a3abe5f3147cf30d4fb41bd5d1f115200b712e772ce95949fc4a5d

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.2.1.post1-py3.5.egg.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.2.1.post1-py3.5.egg
Algorithm Hash digest
SHA256 d3746ef14aaa6049899a92add4db87109b6d99325f1a42f5bead1363e7966613
MD5 8e492d9c88b68b4432269ebd128ffa6f
BLAKE2b-256 edc7d68379a7610572750172d259a2af7a0c6cb3745c5a7d64e6849fdededc49

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.2.1.post1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.2.1.post1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a29831e796bdd9009301fb968a82c3a64178bd76d552cf0791f127ef016b1d00
MD5 fc93589f3c2cd3185e312974cd9fa03f
BLAKE2b-256 786ecf8a894217db6483a26d96a8594060fe4f32f6299eba33174fe526a8ab36

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page