Fungi DNA barcoder based on semantic searching
Project description
TaxoTagger
TaxoTagger is a Python library for DNA barcode identification, powered by semantic searching.
Features:
- 🚀 Effortlessly build vector databases from DNA sequences (FASTA files)
- ⚡ Achieve highly efficient and accurate semantic searching
- 🔥 Easily extend support for various embedding models
Installation
TaxoTagger requires Python 3.10 or later.
# create an virtual environment
conda create -n venv-3.10 python=3.10
conda activate venv-3.10
# install the `taxotagger` package
pip install --pre taxotagger
Usage
Build a vector database from a FASTA file
from taxotagger import ProjectConfig
from taxotagger import TaxoTagger
config = ProjectConfig()
tt = TaxoTagger(config)
# creating the database will take ~30s
tt.create_db('data/database.fasta')
By default, the ~/.cache/mycoai
folder is used to store the vector database and the embedding model. The MycoAI-CNN.pt
model is automatically downloaded to this folder if it is not there, and the vector database is created and named after the model.
Conduct a semantic search with FASTA file
from taxotagger import ProjectConfig
from taxotagger import TaxoTagger
config = ProjectConfig()
tt = TaxoTagger(config)
# semantic search and return the top 1 result for each query sequence
res = tt.search('data/query.fasta', limit = 1)
The data/query.fasta
file contains two query sequences: KY106088
and KY106087
.
The search results res
will be a dictionary with taxonomic level names as keys and matched results as values for each of the two query sequences. For example, res['phylum']
will look like:
[
[{"id": "KY106088", "distance": 1.0, "entity": {"phylum": "Ascomycota"}}],
[{"id": "KY106087", "distance": 0.9999998807907104, "entity": {"phylum": "Ascomycota"}}]
]
The first inner list is the top results for the first query sequence, and the second inner list is the top results for the second query sequence.
The id
field is the sequence ID of the matched sequence. The distance
field is the cosine similarity between the query sequence and the matched sequence with a value between 0 and 1, the closer to 1, the more similar. The entity
field is the taxonomic information of the matched sequence.
We can see that the top 1 results for both query sequences are exactly themselves. This is because the query sequences are also in the database. You can try with different query sequences to see the search results.
Docs
Please visit the official documentation for more details.
Question and feedback
Please submit an issue if you have any question or feedback.
Citation
If you use TaxoTagger in your work, please cite it by clicking the Cite this repository
on right top of this page.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file taxotagger-0.0.1a5.tar.gz
.
File metadata
- Download URL: taxotagger-0.0.1a5.tar.gz
- Upload date:
- Size: 19.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dac30c9d356e6ed018c9bb1624efd90a8fb05726ea917037feafb2b8a522380d |
|
MD5 | d5581fbc4b6a597825ac6a6fa15b6b89 |
|
BLAKE2b-256 | 7f57dc5f0cb54bad2a8397f7ff0e387075b654503d92715dc5e4326229da5472 |
File details
Details for the file taxotagger-0.0.1a5-py3-none-any.whl
.
File metadata
- Download URL: taxotagger-0.0.1a5-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c090b4aa71b04bb4314dd583799e3f36c593a0704495b38c4b40499c133f2aa |
|
MD5 | ad2327b04ba84617dfa5c0e158c39c6a |
|
BLAKE2b-256 | b8b06144dc3620d971175cf029109b40616bcaa2e98620b75246e643e87301c8 |