Fungi DNA barcoder based on semantic searching
Project description
TaxoTagger
Fungi DNA taxonomy label identification using semantic searching.
Features:
- Building vector databases directly from DNA sequences (FASTA file) with ease
- Supporting various embedding models
- Semantic searching with high efficiency
Installation
Install from PyPI:
# create an virtual environment
conda create -n venv-3.10 python=3.10
conda activate venv-3.10
# install the `taxotagger` package
pip install --pre taxotagger
Or install from source code:
# create an virtual environment
conda create -n venv-3.10 python=3.10
conda activate venv-3.10
# install from this repo
pip install git+https://github.com/MycoAI/taxotagger
Usage
Build a vector database from a FASTA file
from taxotagger import ProjectConfig
from taxotagger import TaxoTagger
config = ProjectConfig()
tt = TaxoTagger(config)
# creating the database will take ~30s
tt.create_db('data/database.fasta')
By default, the model MycoAI-CNN.pt will be used as the embedding model, and the database will be created and stored in the default folder (~/.cache/mycoai
) if you do not set a new value to config.mycoai_home
. The embedding model is automatically downloaded to there.
Conduct a semantic search with FASTA file
from taxotagger import ProjectConfig
from taxotagger import TaxoTagger
config = ProjectConfig()
tt = TaxoTagger(config)
# semantic search and return the top 1 result for each query sequence
res = tt.search('data/query.fasta', limit = 1)
The search results res
will be a dictionary with taxonomic level names as keys and matched results as values for each query sequence. For example, res['phylum']
will look like:
[
[{"id": "KY106088", "distance": 1.0, "entity": {"phylum": "Ascomycota"}}],
[{"id": "KY106087", "distance": 0.9999998807907104, "entity": {"phylum": "Ascomycota"}}]
]
The first inner list is the top results for the first query sequence, and the second inner list is the top results for the second query sequence.
Question and feedback
Please submit an issue if you have any question or feedback.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file taxotagger-0.0.1a1.tar.gz
.
File metadata
- Download URL: taxotagger-0.0.1a1.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f377a3f80d2c4c4b5a1bb9a43ae48f91edbff1f7f57113937a4700d5e0740597 |
|
MD5 | a9133e21d0827e7bffafc990a5d038ce |
|
BLAKE2b-256 | 0b226b5979ff169e2361c3633d37d006e314814abdbd3321d3b69b8da8655b38 |
File details
Details for the file taxotagger-0.0.1a1-py3-none-any.whl
.
File metadata
- Download URL: taxotagger-0.0.1a1-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 665d1d3170bc4be40caeee3d953a517ef163f960b34ff1aac27367167205a054 |
|
MD5 | b393c9373552df1e5c9800d41f63232a |
|
BLAKE2b-256 | 0d0cb500b961f5f6c287e52c864e322ca0be9f693b9cb92523f05d5a524c3b19 |