tools to support genome and metagenome analysis
Project description
genome-grist - reference-based exploration of Illumina metagenomes
In brief
genome-grist automates a number of tasks around genome-based metagenome interpretation.
One key point of genome-grist is this: we can take advantage of sourmash gather to find the smallest set of genomes to which to map metagenome reads. genome-grist automates all the stuff AROUND doing that!
So, genome-grist
is a toolkit to do the following:
- download a metagenome
- process it into trimmed reads, and make a sourmash signature
- search the sourmash signature with 'gather' against sourmash databases, e.g. all of genbank
- download the matching genomes from genbank
- map all metagenome reads to genomes using minimap
- extract matching reads iteratively based on gather, successively eliminating reads that matched to previous gather matches
- run mapping on “leftover” reads to genomes
- summarize all mapping results
Installation
The command:
python -m pip install genome-grist
will install the latest version. Plase use python3.7 or later. We suggest using an isolated conda environment; the following commands should work for conda:
conda create -n grist python=3.7 pip
conda activate grist
python -m pip install genome-grist
Quick start:
Run the following three commands.
First, download SRA sample HSMA33MX, trim reads, and build a sourmash signature:
genome-grist process HSMA33MX smash_reads
Next, run sourmash signature against genbank:
genome-grist process HSMA33MX gather_genbank
(NOTE, this depends on the latest genbank genomes and won't work for most people just yet; for now, use cached results from the repo:
cp tests/test-data/HSMA33MX.x.genbank.gather.csv outputs/genbank/
touch outputs/genbank/HSMA33MX.x.genbank.gather.out
)
Finally, download the reference genomes, map reads and produce a summary report:
genome-grist process HSMA33MX summarize -j 8
(You can run all of the above with make test
in the repo.)
The summary report will be in outputs/reports/report-HSMA33MX.html
.
You can see some example reports for this and other data sets online:
- HSMA33MX report
- Illumina metagenome from Shakya et al., 2014) (ref)
- sample 1 from Hu et al., 2016 (oil well metagenome) (ref)
Compute requirements
You'll need enough disk space to store about 5 copies of your raw metagenome.
The peak memory requirement is in the k-mer trimming and sourmash gather steps. You'll probably want between 30 and 60 GB of RAM for those, although for smaller or less diverse metagenomes, you will use a lot less.
Full set of top-level process
targets
- download_reads
- trim_reads
- smash_reads
- gather_genbank
- download_matching_genomes
- map_reads
- summarize
Support
genome-grist is alpha-level software. Please be patient and kind :).
Please ask questions and add comments by filing github issues.
Why the name grist
?
'grist' is in the sourmash family of names (sourmash, wort, distillerycats, etc.) See Grist.
(It is not the computing grist!)
CTB Nov 8, 2020
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file genome-grist-0.5.tar.gz
.
File metadata
- Download URL: genome-grist-0.5.tar.gz
- Upload date:
- Size: 8.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea2e2ba45f6e7e0cc649467847ef50d2c5ca77fc12cd2456a7f347b3f0c8f8ce |
|
MD5 | 45bc72de42c584f579376280de7db489 |
|
BLAKE2b-256 | 0bed16d8f94c044aaf1566e8088db92f5db962f032bbab4b54078bcf4083b80b |