sourmash plugin for improved output of containment search for metagenomes
Project description
sourmash_plugin_containment_search: improved containment search for genomes in metagenomes
This plugin adds a command sourmash scripts mgsearch
that provides
new & nicer output for searching genomes against metagenomes.
Installation
pip install sourmash_plugin_containment_search
Usage
This command:
sourmash scripts mgsearch query.sig metagenome.sig [ metagenome2.sig ...] \
[ -o output.csv ]
will search for the query genome query.sig
in one or more
metagenome.sig
files, producing decent human-readable output and
(optionally) useful CSV outputs.
For example,
sourmash scripts mgsearch ../sourmash/podar-ref/0.fa.sig ../sourmash/SRR606249.trim.k31.sig.gz
produces:
Loaded query signature: CP001472.1 Acidobacterium capsulatum ATCC 51196, com...
p_genome avg_abund p_metag metagenome name
-------- --------- ------- ---------------
100.0% 55.4 3.1% SRR606249
Backstory: Why this command?
sourmash search
supports sample search x sample search, broadly -
perhaps too
broadly. And the output formats aren't that helpful.
sourmash prefetch
supports metagenome overlap search against many genomes,
which is the reverse of this use case. Moreover, prefetch doesn't provided weighted results and its output isn't frendly.
sourmash gather
has friendly and useful output, but calculates something
different from overlap.
There is also some interest in reverse containment search.
The manysearch
command of
the sourmash branchwater plugin
also does a nice containment search like this plugin, but it doesn't
provide nice human-readable output and it also doesn't provide
weighted results.
Advanced info: other implementation details
This command is streaming, in the sense that it will load each metagenome, calculate the match, and then discard the metagenome.
CSV output
Each row contains:
intersect_bp
- overlap between genome and metagenome.match_filename
- metagenome filename from sketch.match_name
- metagenome name.match_md5
- metagenome md5.query_filename
- genome filename from sketch.query_name
- genome name.query_md5
- genome md5.ksize
- ksize of comparison.moltype
- moltype of comparison.scaled
- scaled of comparison.f_query
- fraction of query (genome) found. "Detection"; roughly matches the number of bases that will be covered by mapped metagenome reads.f_match
- fraction of metagenome found, unweighted.f_match_weighted
- fraction of metagenome found, weighted. Roughly matches the fraction of metagenome reads that will map to this genome.sum_weighted_found
- sum of weights from intersectiong hashes.average_abund
- average abundance of weights intersecting hashes.median_abund
- median abundance of weights from intersecting hashes.std_abund
- std dev of weights from intersecting hashes.total_weighted_hashes
- total number of weighted hashes in metagenome.
TODO
- write tests
- evaluate whether we should add more columns by looking at prefetch and gather output
Support
We suggest filing issues in the main sourmash issue tracker as that receives more attention!
Dev docs
containment_search
is developed at https://github.com/ctb/sourmash_plugin_containment_search.
Generating a release
Bump version number in pyproject.toml
and push.
Make a new release on github.
Then pull, and:
python -m build
followed by twine upload dist/...
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for sourmash_plugin_containment_search-0.2.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64c4ae2506ae412bcf3c9d251c471fd57ac8ea894a29af88d2c1aadeb9ed52f6 |
|
MD5 | e30e329e7605701591402181b3e9a980 |
|
BLAKE2b-256 | eb755c948c125cb49b4dcc9e2dc9c737f4ea830227ddee5b9ad0feec6a4a703f |