Skip to main content

Taxoniq: Taxon Information Query - fast, offline querying of NCBI Taxonomy and related data

Project description

Taxoniq: Taxon Information Query - fast, offline querying of NCBI Taxonomy and related data

Taxoniq is a Python and command-line interface to the NCBI Taxonomy database and selected data sources that cross-reference it.

Taxoniq's features include:

  • Pre-computed indexes updated monthly from NCBI, WoL and cross-referenced databases
  • Offline operation: all indexes are bundled with the package; no network calls are made when querying taxon information (separately, Taxoniq can fetch the nucleotide or protein sequences over the network given a taxon or accession - see Retrieving sequences below)
  • A CLI capable of JSON I/O, batch processing and streaming of inputs for ease of use and pipelining in shell scripts
  • A stable, well-documented, type-hinted Python API (Python 3.6 and higher is supported)
  • Comprehensive testing and continuous integration
  • An intuitive interface with useful defaults
  • Compactness, readability, and extensibility

The Taxoniq package bundles an indexed, compressed copy of the NCBI taxonomy database files, the NCBI RefSeq nucleotide and protein accessions associated with each taxon, the WoL kingdom-wide phylogenetic distance database, and relevant information from other databases. Accessions which appear in the NCBI RefSeq BLAST databases are indexed so that given a taxon ID, accession ID, or taxon name, you can quickly retrieve the taxon's rank, lineage, description, citations, representative RefSeq IDs, LCA information, evolutionary distance, sequence (with a network call), and more, as described in the Cookbook section below.

Installation

pip3 install taxoniq

Synopsis

>>> import taxoniq
>>> t = taxoniq.Taxon(9606)
>>> t.scientific_name
'Homo sapiens'
>>> t.common_name
'human'

>>> t.ranked_lineage
[taxoniq.Taxon(9606), taxoniq.Taxon(9605), taxoniq.Taxon(9604), taxoniq.Taxon(9443),
 taxoniq.Taxon(40674), taxoniq.Taxon(7711), taxoniq.Taxon(33208), taxoniq.Taxon(2759)]
>>> len(t.lineage)
32
>>> [(t.rank.name, t.scientific_name) for t in t.ranked_lineage]
[('species', 'Homo sapiens'), ('genus', 'Homo'), ('family', 'Hominidae'), ('order', 'Primates'),
 ('class', 'Mammalia'), ('phylum', 'Chordata'), ('kingdom', 'Metazoa'), ('superkingdom', 'Eukaryota')]

>>> t.refseq_representative_genome_accessions[:10]
[taxoniq.Accession('NC_000001.11'), taxoniq.Accession('NC_000002.12'), taxoniq.Accession('NC_000003.12'),
 taxoniq.Accession('NC_000004.12'), taxoniq.Accession('NC_000005.10'), taxoniq.Accession('NC_000006.12'),
 taxoniq.Accession('NC_000007.14'), taxoniq.Accession('NC_000008.11'), taxoniq.Accession('NC_000009.12'),
 taxoniq.Accession('NC_000010.11')]

>>> t.url
'https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606'

# Wikidata provides structured links to many databases about taxa represented on Wikipedia
>>> t.wikidata_url
'https://www.wikidata.org/wiki/Q15978631'
>>> t2 = taxoniq.Taxon(scientific_name="Bacillus anthracis")
>>> t2.description
'<p class="mw-empty-elt"> </p> <p class="mw-empty-elt"> </p> <p><i><b>Bacillus anthracis</b></i>
 is the agent of anthrax—a common disease of livestock and, occasionally, of humans—and the only
 obligate pathogen within the genus <i>Bacillus</i>. This disease can be classified as a zoonosis,
 causing infected animals to transmit the disease to humans. <i>B. anthracis</i> is a Gram-positive,
 endospore-forming, rod-shaped bacterium, with a width of 1.0–1.2 µm and a length of 3–5&#160;µm.
 It can be grown in an ordinary nutrient medium under aerobic or anaerobic conditions.</p>
 <p>It is one of few bacteria known to synthesize a protein capsule (poly-D-gamma-glutamic acid).
 Like <i>Bordetella pertussis</i>, it forms a calmodulin-dependent adenylate cyclase exotoxin known
 as anthrax edema factor, along with anthrax lethal factor. It bears close genotypic and phenotypic
 resemblance to <i>Bacillus cereus</i> and <i>Bacillus thuringiensis</i>. All three species share
 cellular dimensions and morphology</p>...'
>>> t3 = taxoniq.Taxon(accession_id="NC_000913.3")
>>> t3.scientific_name
'Escherichia coli str. K-12 substr. MG1655"'
>>> t3.parent.parent.common_name
'E. coli'
>>> t3.refseq_representative_genome_accessions[0].length
4641652

# The get_from_s3() method is the only command that will trigger a network call.
>>> seq = t2.refseq_representative_genome_accessions[0].get_from_s3().read()
>>> len(seq)
4641652
>>> seq[:64]
b'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGAT'

Retrieving sequences

Mirrors of the NCBI BLAST databases are maintained on AWS S3 (s3://ncbi-blast-databases) and Google Storage (gs://blast-db). This is a key resource, since S3 and GS have superior bandwidth and throughput compared to the NCBI FTP server, so range requests can be used to retrieve individual sequences from the database files without downloading and keeping a copy of the whole database.

The Taxoniq PyPI distribution (the package you install using pip3 install taxoniq) indexes accessions for the following NCBI BLAST databases:

  • Refseq viruses representative genomes (ref_viruses_rep_genomes) (nucleotide)
  • Refseq prokaryote representative genomes (contains refseq assembly) (ref_prok_rep_genomes) (nucleotide)
  • RefSeq Eukaryotic Representative Genome Database (ref_euk_rep_genomes) (nucleotide)
  • Betacoronavirus (nucleotide)

Given an accession ID, Taxoniq can issue a single HTTP request and return a file-like object streaming the nucleotide sequence for this accession from the S3 or GS mirror as follows:

with taxoniq.Accession("NC_000913.3").get_from_s3() as fh:
     fh.read()

To retrieve many sequences quickly, you may want to use a threadpool to open multiple network connections at once:

from concurrent.futures import ThreadPoolExecutor
def fetch_seq(accession_id):
    accession = taxoniq.Accession(accession_id)
    seq = accession.get_from_s3().read()
    return (accession, seq)

taxon = taxoniq.Taxon(scientific_name="Apis mellifera")
for accession, seq in ThreadPoolExecutor().map(fetch_seq, taxon.refseq_representative_genome_accessions):
    print(accession, len(seq))

Using the nr/nt databases

In progress

Cookbook

In progress

Links

License

Taxoniq software is licensed under the terms of the MIT License.

Distributions of this package contain data from the National Center for Biotechnology Information released into the public domain under the NCBI Public Domain Notice.

Distributions of this package contain text excerpts from Wikipedia licensed under the terms of the CC-BY-SA License.

Bugs

Please report bugs, issues, feature requests, etc. on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxoniq-0.0.6.tar.gz (71.7 MB view details)

Uploaded Source

Built Distribution

taxoniq-0.0.6-py3-none-any.whl (71.5 MB view details)

Uploaded Python 3

File details

Details for the file taxoniq-0.0.6.tar.gz.

File metadata

  • Download URL: taxoniq-0.0.6.tar.gz
  • Upload date:
  • Size: 71.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.5

File hashes

Hashes for taxoniq-0.0.6.tar.gz
Algorithm Hash digest
SHA256 a9ccc2ff6d757fbdfdd439bea512bac7f3e0108c94eda54f3f8ab480e8fc03fc
MD5 13440fa78d6edaeebb641693c3172456
BLAKE2b-256 048bea21f361287035cdc4374de58ab5d2d997eb72c9cdc90af2b3c73d43338e

See more details on using hashes here.

File details

Details for the file taxoniq-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: taxoniq-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 71.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.5

File hashes

Hashes for taxoniq-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 eed6c74c6dd01360391ca4624e26f30b9c4346a884c6faeb335e8f2f37f967f8
MD5 9eb59c6373522aebce6fb3ce908ddcf4
BLAKE2b-256 c65b7380e513e6239aa80b2d05aadacca0451c881b6b443863dc36198f7742c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page