Skip to main content

Fungi DNA barcoder based on semantic searching

Project description

TaxoTagger

pypi badge Static Badge

TaxoTagger is a Python library for DNA barcode identification, powered by semantic searching.

Features:

  • 🚀 Effortlessly build vector databases from DNA sequences (FASTA files)
  • ⚡ Achieve highly efficient and accurate semantic searching
  • 🔥 Easily extend support for various embedding models

Installation

TaxoTagger requires Python 3.10 or later.

# create an virtual environment
conda create -n venv-3.10 python=3.10
conda activate venv-3.10

# install the `taxotagger` package
pip install --pre taxotagger

Usage

Build a vector database from a FASTA file

from taxotagger import ProjectConfig
from taxotagger import TaxoTagger

config = ProjectConfig()
tt = TaxoTagger(config)

# creating the database will take ~30s
tt.create_db('data/database.fasta')

By default, the ~/.cache/mycoai folder is used to store the vector database and the embedding model. The MycoAI-CNN.pt model is automatically downloaded to this folder if it is not there, and the vector database is created and named after the model.

Conduct a semantic search with FASTA file

from taxotagger import ProjectConfig
from taxotagger import TaxoTagger

config = ProjectConfig()
tt = TaxoTagger(config)

# semantic search and return the top 1 result for each query sequence
res = tt.search('data/query.fasta', limit = 1)

The data/query.fasta file contains two query sequences: KY106088 and KY106087.

The search results res will be a dictionary with taxonomic level names as keys and matched results as values for each of the two query sequences. For example, res['phylum'] will look like:

[
    [{"id": "KY106088", "distance": 1.0, "entity": {"phylum": "Ascomycota"}}],
    [{"id": "KY106087", "distance": 0.9999998807907104, "entity": {"phylum": "Ascomycota"}}]
]

The first inner list is the top results for the first query sequence, and the second inner list is the top results for the second query sequence.

We can see that the top 1 results for both query sequences are exactly themselves. This is because the query sequences are also in the database. You can try with different query sequences to see the search results.

Docs

Please visit the official documentation for more details.

Question and feedback

Please submit an issue if you have any question or feedback.

Citation

If you use TaxoTagger in your work, please cite it by clicking the Cite this repository on right top of this page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxotagger-0.0.1a2.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

taxotagger-0.0.1a2-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file taxotagger-0.0.1a2.tar.gz.

File metadata

  • Download URL: taxotagger-0.0.1a2.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for taxotagger-0.0.1a2.tar.gz
Algorithm Hash digest
SHA256 e395b23e0f8644d94149f37af196ba881710afa59a7589075a544b1a2a2d0b26
MD5 143b999e86d3d06bf845542aa3f8e1f3
BLAKE2b-256 52000b3b7b35ab03c16a4c02972546ff30e440cfbaa4321d57cd3825c07199dc

See more details on using hashes here.

File details

Details for the file taxotagger-0.0.1a2-py3-none-any.whl.

File metadata

  • Download URL: taxotagger-0.0.1a2-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for taxotagger-0.0.1a2-py3-none-any.whl
Algorithm Hash digest
SHA256 fd80016c2eec2434513aafd8cf2fe9699d1efac7fb5f2a1edb974c62fcff134e
MD5 1ccda7adfc68d543df0274d55d5f8d52
BLAKE2b-256 93a46d0be13156bd7667acd29b6b02439066516e1f630a71c48ce0d3e2b9402b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page