Skip to main content

Fungi DNA barcoder based on semantic searching

Project description

TaxoTagger

Fungi DNA taxonomy label identification using semantic searching.

Features:

  • Building vector databases directly from DNA sequences (FASTA file) with ease
  • Supporting various embedding models
  • Semantic searching with high efficiency

Installation

Install from PyPI:

# create an virtual environment
conda create -n venv-3.10 python=3.10
conda activate venv-3.10

# install the `taxotagger` package
pip install --pre taxotagger

Or install from source code:

# create an virtual environment
conda create -n venv-3.10 python=3.10
conda activate venv-3.10

# install from this repo
pip install git+https://github.com/MycoAI/taxotagger

Usage

Build a vector database from a FASTA file

from taxotagger import ProjectConfig
from taxotagger import TaxoTagger

config = ProjectConfig()
tt = TaxoTagger(config)

# creating the database will take ~30s
tt.create_db('data/database.fasta')

By default, the model MycoAI-CNN.pt will be used as the embedding model, and the database will be created and stored in the default folder (~/.cache/mycoai) if you do not set a new value to config.mycoai_home. The embedding model is automatically downloaded to there.

Conduct a semantic search with FASTA file

from taxotagger import ProjectConfig
from taxotagger import TaxoTagger

config = ProjectConfig()
tt = TaxoTagger(config)

# semantic search and return the top 1 result for each query sequence
res = tt.search('data/query.fasta', limit = 1)

The search results res will be a dictionary with taxonomic level names as keys and matched results as values for each query sequence. For example, res['phylum'] will look like:

[
    [{"id": "KY106088", "distance": 1.0, "entity": {"phylum": "Ascomycota"}}],
    [{"id": "KY106087", "distance": 0.9999998807907104, "entity": {"phylum": "Ascomycota"}}]
]

The first inner list is the top results for the first query sequence, and the second inner list is the top results for the second query sequence.

Question and feedback

Please submit an issue if you have any question or feedback.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxotagger-0.0.1a1.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

taxotagger-0.0.1a1-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file taxotagger-0.0.1a1.tar.gz.

File metadata

  • Download URL: taxotagger-0.0.1a1.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for taxotagger-0.0.1a1.tar.gz
Algorithm Hash digest
SHA256 f377a3f80d2c4c4b5a1bb9a43ae48f91edbff1f7f57113937a4700d5e0740597
MD5 a9133e21d0827e7bffafc990a5d038ce
BLAKE2b-256 0b226b5979ff169e2361c3633d37d006e314814abdbd3321d3b69b8da8655b38

See more details on using hashes here.

File details

Details for the file taxotagger-0.0.1a1-py3-none-any.whl.

File metadata

  • Download URL: taxotagger-0.0.1a1-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for taxotagger-0.0.1a1-py3-none-any.whl
Algorithm Hash digest
SHA256 665d1d3170bc4be40caeee3d953a517ef163f960b34ff1aac27367167205a054
MD5 b393c9373552df1e5c9800d41f63232a
BLAKE2b-256 0d0cb500b961f5f6c287e52c864e322ca0be9f693b9cb92523f05d5a524c3b19

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page