Skip to main content

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer incremental snapshots.

Project description

Python package for writing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer incremental snapshots.

ci_rel pypi_rel

Features

  • Space-efficient storage of sequences within a release and across releases

  • Bandwidth-efficient transfer incremental updates

  • Fast fetching of sequence slices on chromosome-scale sequences

  • Provenance data regarding sequence sources and accessions

  • Precomputed digests that may be used as sequence aliases

For more information, see doc/design.rst.

Expected deployment cases

  • Local access via Python package, using a repo rsync’d from a remote source or loaded locally

  • Docker image with REST interface

Installation

seqrepo has been tested only on Ubuntu 14.04 and 16.04. It requires separate installation of the tabix package. It requires sqlite3 >= 3.8.0, which likely precludes early Ubuntu distributions.

On Ubuntu 16.04:

sudo apt install tabix
pip install seqrepo

Command line usage

seqrepo includes a command line interface for loading, fetching, and exporting sequences.

Loading

$ SEQREPO_ROOT=/opt/seqrepo/data/2016/0818

$ seqrepo -d $SEQREPO_ROOT init

$ seqrepo -v -d $SEQREPO_ROOT load-fasta -n me fasta1.gz fasta2.gz fasta3.gz

$ seqrepo -v -d $SEQREPO_ROOT status
seqrepo 0.1.0
root directory: /opt/seqrepo/data/2016/0818, 0.2 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 3 files, 33080 sequences, 110419437 residues
aliases: 165481 aliases, 165481 current, 5 namespaces, 33080 sequences

Exporting all sequences

$ seqrepo -v -d $SEQREPO_ROOT export | head
>me:sequence1 seguid:EqjiLe... md5:04e8c3c75... sha512:000a70c470f6... sha1:12a8e22d...
GTACGCCCCCTCCCCCCGTCCCTATCGGCAGAACCGGAGGCCAACCTTCGCGATCCCTTGCTGCGGGCCCGGAGATCAAACGTGGCCCGCCCCCGGCAGG
GCACAGCGCGCTGGGCAACCGCGATCCGGCGCCGGACTGGAGGGGTCGATGCGCGGCGCGCTGGGGCGCACAGGGGACGGAGCCCGGGTCTTGCTCCCCA

API Usage

$ seqrepo -v -d $SEQREPO_ROOT shell

In [10]: %time sr.fetch("NC_000001.10", start=6000000, end=6000200)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 492 µs
Out[10]: 'GGACAACAGAGGATGAGGTGGGGCCAGCAGAGGGACAGAGAAGAGCTGCCTGCCCTGGAACAGGCAGAAAGCATCCCACGTGCAAGAAAAAGTAGGCCAGCTAGACTTAAAATCAGAACTACCGCTCATCAAAAGATAGTGTAACATTTGGGGTGCTATAATTTTAACATGTCCCCCAAAAGGCATGTGTTGGAAATTTA'


# iterate over unique sequences:
for srec, arec in sr:
    pprint.pprint(srec)
    pprint.pprint(arec)

# results in records like:
{'added': '2016-08-18 17:40:49',
 'alpha': 'ACGT',
 'len': 2627,
 'relpath': '2016/08/18/1740/1471542046.008535.fa.bgz',
 'seq': 'GTACGCCC...',
 'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518'}

and

[{'added': '2016-08-18 17:40:49',
  'alias': '04e8c3c753dad9c19741cdf81ec2b3d5',
  'is_current': 1,
  'namespace': 'md5',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144388},
 {'added': '2016-08-18 17:40:49',
  'alias': 'EqjiLeXFeeBT6LIMnbCFQxNqHD8',
  'is_current': 1,
  'namespace': 'seguid',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144389},
 {'added': '2016-08-18 17:40:49',
  'alias': '12a8e22de5c579e053e8b20c9db08543136a1c3f',
  'is_current': 1,
  'namespace': 'sha1',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144387},
 {'added': '2016-08-18 17:40:49',
  'alias': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'is_current': 1,
  'namespace': 'sha512',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144386},
 {'added': '2016-08-18 17:40:49',
  'alias': 'NM_013305.4',
  'is_current': 1,
  'namespace': 'ncbi',
  'seq_id': '000a70c470f637d6e3a76497aac3eabc4f7816be8fe03d15bdbd3504655fd3f6ddb2609aeef5e0edfbea16ae8ab181b704c4bfb3cd4328c57a895e02fe5ab518',
  'seqalias_id': 144390}]

Fetching existing sequence repositories

TO BE WRITTEN

(General idea: Distribute repository with snapshots via rsync server from public site for manual installation, and use the same source to seed a docker container.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biocommons.seqrepo-0.1.9.tar.gz (38.2 kB view details)

Uploaded Source

Built Distributions

biocommons.seqrepo-0.1.9-py3.5.egg (45.4 kB view details)

Uploaded Source

biocommons.seqrepo-0.1.9-py2.py3-none-any.whl (24.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file biocommons.seqrepo-0.1.9.tar.gz.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.1.9.tar.gz
Algorithm Hash digest
SHA256 d0c700ebaf25276acef682303a784785e7f5c682a8dd571bf5a77ce15597271c
MD5 3a376ee1a448ffd3f86c4e6354eaa6c5
BLAKE2b-256 98c0290f28e79b4c948d1df2bca555ea39e566155e6a7984f5d3b64ff9207d46

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.1.9-py3.5.egg.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.1.9-py3.5.egg
Algorithm Hash digest
SHA256 55250fc9dbb2844dd2ee068fc210657936937ba5158e260ee158519927617d60
MD5 54223926a6c3fd0af8d5aaafa60c296e
BLAKE2b-256 3dfb3c17ffb1c7abbfe4de1f307518f7ebd9f65f82ab4b21aaa4e7ba8981404b

See more details on using hashes here.

File details

Details for the file biocommons.seqrepo-0.1.9-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for biocommons.seqrepo-0.1.9-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 829c0614371b14b9925d124fbfbb693e510505f0cf2e436565ca8b54615b9f4f
MD5 f8d2f8badb403642c9c39286badcb325
BLAKE2b-256 84a99a1e736522b152f57ce128ccbf7142b3bda11c09c1267d33699c8452afb3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page