tools to support genome and metagenome analysis

These details have not been verified by PyPI

Project links

Homepage

Environment
- Console
- MacOS X
Intended Audience
- Science/Research
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- MacOS :: MacOS X
- POSIX :: Linux
Programming Language
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

genome-grist: a quickstart tutorial.

This quickstart tutorial will take about 30 minutes to run, and requires 5 GB of disk space and 4 GB of RAM, as well as a fairly good Internet connection.

What is genome-grist?

genome-grist is software that automates a number of tedious metagenome tasks related to reference-based analyses on Illumina metagenomes. Specifically, genome-grist will download public metagenomes from the SRA, preprocess them, and use sourmash gather to identify reference genomes for the metagenome. It will then download the reference genomes, map reads to them, and summarize the mapping.

Installing genome-grist

We suggest installing in an isolated conda environment. The following will create a new environment, activate it, and install the latest version of genome-grist from PyPI (which is ).

conda create -y -n grist python=3.8 pip
conda activate grist
python -m pip install genome-grist

Running genome-grist

We currently recommend running genome-grist in its own directory, for several reasons that include software installation (genome-grist uses snakemake and conda to install software under this directory).

Within the current working directory, genome-grist will create an inputs subdir, a genbank_genomes subdir, and any outputs.NAME subdirectories required by the configuration; it should be straightforward to keep projects separate by configuring the output directories appropriately.

So, create a subdirectory and change into it:

mkdir grist/
cd grist/

Note, genome-grist does not rely on the directory name or location in any way; it works entirely within the current working directory.

Download a small example database

Download the GTDB release 95 set of ~32k guide genomes, in a pre-prepared sourmash database format:

curl -L https://osf.io/4n3m5/download -o gtdb-r95.nucleotide-k31-scaled1000.sbt.zip

(Any sourmash database will do as long as the sequences are named so that the full GenBank accession is the first field in the name.)

Make a configuration file

Put the following in a config file named conf-tutorial.yml:

sample:
- HSMA33MX
outdir: outputs.tutorial/
metagenome_trim_memory: 1e9
sourmash_database_glob_pattern: gtdb-r95.nucleotide-k31-scaled1000.sbt.zip

Notes:

you can put multiple samples IDs here, in a YAML array format - put them on a new line after a dash (-).
if you have multiple databases you can specify them here with an appropriate wild card pattern, e.g. db/* will work.
if you are running this on the farm HPC at UC Davis, you can search all of genbank by omitting the database configuration line. Currently these files are not yet publicly available, which is why this tutorial uses GTDB instead.

Do your first real run!

Execute:

genome-grist run conf-tutorial.yml summarize

This will perform the following steps:

download the HSMA33MX metagenome from the Sequence Read Archive (target download_reads).
preprocess it to remove adapters and low-abundance k-mers (target trim_reads).
build a sourmash signature from the preprocess reads. (target smash_reads).
perform a sourmash gather against the specified database (target gather_genbank).
download the matching genomes from GenBank into genbank_genomes/ (target download_matching_genomes).
map the metagenome reads to the various genomes (target map_reads).
produce a summary notebook (target summarize).

The default target is gather_genbank, and you can put one or more targets on the command line as above with summarize.

Output files

The key output files under the outputs directory are:

genbank/{sample}.x.genbank.gather.out - human-readable output from sourmash gather.
genbank/{sample}.x.genbank.gather.csv - sourmash gather CSV output.
genbank/{sample}.genomes.info.csv - information about the matching genomes from genbank.
reports/report-{sample}.html - a summary report.
abundtrim/{sample}.abundtrim.fq.gz - trimmed and preprocessed reads.
sigs/HSMA33MX.abundtrim.sig - sourmash signature for the preprocessed reads.

Note that genome-grist run <config.yml> zip will create a file named transfer.zip with the above files in it.

Where to insert your own files

genome-grist is built on top of the snakemake workflow, which lets you substitute your own files in many places.

For example,

you can put your own SAMPLE_1.fastq.gz, SAMPLE_2.fastq.gz, and SAMPLE_unpaired.fastq.gz files in raw/ to have genome-grist process reads for you.
you can put your own interleaved reads file in abundtrim/SAMPLE.abundtrim.fq.gz to run genome-grist on a private or preprocessed set of reads;
you can put your own sourmash signature (k=31, scaled=1000) in sigs/SAMPLE.abundtrim.sig if you want to have it do the database search for you;

Please see the genome-grist Snakefile for all the gory details.

Other information

Resource requirements

Disk space: genome-grist makes about 4-5 copies of each SRA metagenome.

Memory: the genbank search step on all of genbank takes ~120 GB of RAM. On GTDB, it's much, much less. Other than that, the other steps are all under 10 GB of RAM (unless you adjust metagenome_trim_memory upwards, which may be needed for complex metagenomes).

Time: This is largely dependent on the size of the metagenome; 100m reads takes less than a day or two, typically. The processing of multiple data sets can be done in parallel with -j, as well, although you probably want to specify resource limits. For example, here is the command that Titus uses on farm:

genome-grist run <config> -k --resources mem_mb=145000 -j 16

to run in 150GB of RAM, which will run at most one genbank search at a time.

Installing unreleased versions.

You can run genome-grist from a git checkout directory by using pip to install it in editable mode:

pip install -e .

Support

We like to support our software!

That having been said, genome-grist is early-stage beta-level software. Please be patient and kind :).

Please ask questions and add comments on the github issue tracker for genome-grist.

Why the name `grist`?

'grist' is in the sourmash family of names (sourmash, wort, distillerycats, etc.) See Grist in Wikipedia.

(It is not the computing grist!)

CTB Jan 27, 2021

Project details

These details have not been verified by PyPI

Project links

Homepage

Environment
- Console
- MacOS X
Intended Audience
- Science/Research
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- MacOS :: MacOS X
- POSIX :: Linux
Programming Language
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

0.9.3

Dec 7, 2022

0.9.2

Dec 6, 2022

0.9.1

Dec 4, 2022

0.9.0

Sep 30, 2022

0.8.4

Jul 3, 2022

0.8.3

Feb 16, 2022

0.8.2

Feb 12, 2022

0.8.1

Jan 30, 2022

0.8.0

Jan 17, 2022

0.7.4

Dec 19, 2021

0.7.3

Nov 3, 2021

0.7.2

May 24, 2021

This version

0.7.1

May 19, 2021

0.7

Feb 15, 2021

0.6.1

Jan 27, 2021

0.6

Jan 27, 2021

0.5

Nov 21, 2020

0.4

Nov 16, 2020

0.3.2

Nov 8, 2020

0.3.1

Nov 7, 2020

0.3

Nov 7, 2020

0.2.2

Nov 6, 2020

0.1.1

Oct 27, 2020

0.1

Oct 27, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genome-grist-0.7.1.tar.gz (8.5 MB view details)

Uploaded May 19, 2021 Source

File details

Details for the file genome-grist-0.7.1.tar.gz.

File metadata

Download URL: genome-grist-0.7.1.tar.gz
Upload date: May 19, 2021
Size: 8.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.6

File hashes

Hashes for genome-grist-0.7.1.tar.gz
Algorithm	Hash digest
SHA256	`1acf881747bd48cbc7ff4aa2ceb169e9f2ce75d1026ee7a00b47a7cbeae1699d`
MD5	`2f0f626d4b608c6ffd27cd8b7822f320`
BLAKE2b-256	`c8e9b85d7afc39f3704b28d2b5330209b7e01a9783b8675ed3d5c76ba7069d5b`

See more details on using hashes here.

genome-grist 0.7.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

genome-grist: a quickstart tutorial.

What is genome-grist?

Installing genome-grist

Running genome-grist

Download a small example database

Make a configuration file

Do your first real run!

Output files

Where to insert your own files

Other information

Resource requirements

Installing unreleased versions.

Support

Why the name `grist`?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Provenance

genome-grist 0.7.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

genome-grist: a quickstart tutorial.

What is genome-grist?

Installing genome-grist

Running genome-grist

Download a small example database

Make a configuration file

Do your first real run!

Output files

Where to insert your own files

Other information

Resource requirements

Installing unreleased versions.

Support

Why the name grist?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Provenance

Why the name `grist`?