Accurate host read removal
Project description
Hostile
Hostile removes host sequences from short and long reads, consuming paired or unpaired fastq[.gz]
input. Batteries are included – a human reference genome is downloaded when run for the first time. Hostile is precise by default, removing an order of magnitude fewer microbial reads than existing approaches while removing >99.5% of real human reads from 1000 Genomes Project samples. For ultimate precision, a prebuilt masked reference can be downloaded, or a new one created for chosen target organisms. Read headers can be replaced with integers (using --rename
) for privacy and smaller FASTQs. Heavy lifting is done with fast existing tools (Minimap2/Bowtie2 and Samtools). Bowtie2 is the default aligner for short (paired) reads while Minimap2 is default aligner for long reads. In benchmarks, bacterial Illumina reads were decontaminated at 32Mbp/s (210k reads/sec) and bacterial ONT reads at 22Mbp/s, using 8 alignment threads. Further information and benchmarks can be found in the BioRxiv preprint and this blog post. Please open an issue, tweet, toot or email me to report problems or suggest improvements.
Reference genomes & indexes
For removing human contamination, the default human-t2t-hla
reference genome is recommended. It is downloaded automatically from object storage when running Hostile for the first time. Slightly higher microbial retention may be achieved by specifying an alternate reference masked against target organisms (using the --index
option). human-t2t-hla-argos985
is masked against 985 reference grade bacterial genomes, making it a good choice for decontaminating bacterial genomes. Another masked genome human-t2t-hla-argos985-mycob140
was created for maximising the retention of mycobacterial genomes. Both human-t2t-hla
and human-t2t-hla-argos985-mycob140
were compared in the paper, and are available for download. Both genomes (for Minimap2) and Bowtie2 indexes are provided for each reference genome.
Name | Composition | Minimap2 genome | Bowtie2 index | Date |
---|---|---|---|---|
human-t2t-hla (default) |
T2T-CHM13v2.0 + IPD-IMGT/HLA v3.51 | human-t2t-hla.fa.gz | human-t2t-hla.tar | 2023-07 |
human-t2t-hla-argos985 |
T2T-CHM13v2.0 & IPD-IMGT/HLA v3.51; masked with 985 FDA-ARGOS 150mers | human-t2t-hla-argos985.fa.gz | human-t2t-hla-argos985.tar | 2023-07 |
human-t2t-hla-argos985-mycob140 |
T2T-CHM13v2.0 & IPD-IMGT/HLA v3.51; masked with 985 FDA-ARGOS & 140 mycobacterial 150mers | human-t2t-hla-argos985-mycob140.fa.gz | human-t2t-hla-argos985-mycob140.tar | 2023-07 |
Tips
- To force Hostile to download the defaults, run
hostile fetch
- To show a list of available genomes, run
hostile fetch --list-available
- To download a non-default genome, run e.g.
hostile fetch --filename human-t2t-hla-argos985-mycob140.fa.gz
- To use a downloaded non-default genome, run
hostile clean --index path/to/genome …
Install
Installation with conda/mamba or Docker is recommended due to non-Python dependencies (Bowtie2, Minimap2, Samtools and Bedtools). Hostile is tested with Ubuntu Linux 22.04, MacOS 12, and under WSL for Windows.
Conda/mamba
conda create -y -n hostile -c conda-forge -c bioconda hostile
conda activate hostile
Docker
docker run quay.io/biocontainers/hostile:0.2.0--pyhdfd78af_0
# Build your own
wget https://raw.githubusercontent.com/bede/hostile/main/Dockerfile
docker build . --platform linux/amd64
Development install
git clone https://github.com/bede/hostile.git
cd hostile
conda env create -f environment.yml
conda activate hostile
pip install --editable '.[dev]'
pytest
Command line usage
$ hostile clean --help
usage: hostile clean [-h] --fastq1 FASTQ1 [--fastq2 FASTQ2] [--aligner {bowtie2,minimap2,auto}] [--index INDEX] [--rename] [--reorder] [--out-dir OUT_DIR]
[--threads THREADS] [--aligner-args ALIGNER_ARGS] [--force] [--debug]
Remove reads aligning to a target genome from fastq[.gz] input files
options:
-h, --help show this help message and exit
--fastq1 FASTQ1 path to forward fastq.gz] file
--fastq2 FASTQ2 optional path to reverse fastq[.gz] file
(default: None)
--aligner {bowtie2,minimap2,auto}
alignment algorithm. Use Bowtie2 for short reads and Minimap2 for long reads
(default: auto)
--index INDEX path to custom genome or index. For Bowtie2, exclude the .1.bt2 suffix
(default: None)
--rename replace read names with incrementing integers
(default: False)
--reorder ensure deterministic output order
(default: False)
--out-dir OUT_DIR path to output directory
(default: /Users/bede/Research/Git/hostile)
--threads THREADS number of alignment threads. A sensible default is chosen automatically
(default: 5)
--aligner-args ALIGNER_ARGS
additional arguments for alignment
(default: )
--force overwrite existing output files
(default: False)
--debug show debug messages
(default: False)
Short reads
$ hostile clean --fastq1 reads.r1.fastq.gz --fastq2 reads.r2.fastq.gz
INFO: Using Bowtie2 (paired reads)
INFO: Found cached index (/Users/bede/Library/Application Support/hostile/human-t2t-hla)
INFO: Cleaning…
INFO: Complete
[
{
"aligner": "bowtie2",
"index": "/path/to/data/dir/human-t2t-hla",
"fastq1_in_name": "reads.r1.fastq.gz",
"fastq2_in_name": "reads.r2.fastq.gz",
"fastq1_in_path": "/path/to/hostile/reads.r1.fastq.gz",
"fastq2_in_path": "/path/to/hostile/reads.r2.fastq.gz",
"fastq1_out_name": "reads.r1.clean_1.fastq.gz",
"fastq2_out_name": "reads.r2.clean_2.fastq.gz",
"fastq1_out_path": "/path/to/hostile/reads.r1.clean_1.fastq.gz",
"fastq2_out_path": "/path/to/hostile/reads.r2.clean_2.fastq.gz",
"reads_in": 20,
"reads_out": 20,
"reads_removed": 0,
"reads_removed_proportion": 0.0
}
]
$ hostile clean --rename --fastq1 reads_1.fastq.gz --fastq2 reads_2.fastq.gz \
--index /path/to/human-t2t-hla-argos985-mycob140 > decontamination-log.json
INFO: Using Bowtie2
INFO: Found cached index (/Users/bede/Library/Application Support/hostile/human-t2t-hla)
INFO: Cleaning…
INFO: Complete
Long reads
$ hostile clean --fastq1 tests/data/h37rv_10.r1.fastq.gz
INFO: Using Minimap2's long read preset (map-ont)
INFO: Found cached genome (/Users/bede/Library/Application Support/hostile/human-t2t-hla)
INFO: Cleaning…
INFO: Complete
[
{
"aligner": "minimap2",
"index": "/Users/bede/Library/Application Support/hostile/human-t2t-hla.fa.gz",
"fastq1_in_name": "reads.fastq.gz",
"fastq1_in_path": "/path/to/hostile/reads.fastq.gz",
"fastq1_out_name": "reads.clean.fastq.gz",
"fastq1_out_path": "/path/to/hostile/reads.clean.fastq.gz",
"reads_in": 10,
"reads_out": 10,
"reads_removed": 0,
"reads_removed_proportion": 0.0
}
]
Python usage
from pathlib import Path
from hostile.lib import clean_fastqs, clean_paired_fastqs
# Long reads, defaults
clean_fastqs(
fastqs=[Path("reads.fastq.gz")],
)
# Paired short reads, various options, capture log
log = clean_paired_fastqs(
fastqs=[(Path("reads_1.fastq.gz"), Path("reads_2.fastq.gz"))],
index=Path("reference.fasta.gz"),
out_dir=Path("decontaminated-reads"),
rename=True,
force=True,
threads=4
)
print(log)
Masking reference genomes
The mask
subcommand makes it easy to create custom-masked reference genomes and achieve maximum retention of specific target organisms:
hostile mask human.fasta lots-of-bacterial-genomes.fasta --threads 8
You may wish to use one of the existing reference genomes as a starting point. Masking uses Minimap2's asm10
preset to align the supplied target genomes with the reference genome, and bedtools to mask out all aligned regions. For Bowtie2—the default aligner for decontaminating short reads—you will also need to build an index before you can use your masked genome with Hostile.
bowtie2-build masked.fasta masked-index
hostile clean --index masked-index --fastq1 reads_1.fastq.gz --fastq2 reads_2.fastq.gz
Citation
BioRxiv preprint (accepted for publication in Oxford Bioinformatics)
@article {Constantinides2023,
author = {Bede Constantinides and Martin Hunt and Derrick W Crook},
title = {Hostile: accurate host decontamination of microbial sequences},
elocation-id = {2023.07.04.547735},
year = {2023},
doi = {10.1101/2023.07.04.547735},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2023/07/21/2023.07.04.547735},
eprint = {https://www.biorxiv.org/content/early/2023/07/21/2023.07.04.547735.full.pdf},
journal = {bioRxiv}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hostile-0.4.0.tar.gz
.
File metadata
- Download URL: hostile-0.4.0.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd003b09a7f17a8fc126bc81c2b402d1ab5946916f1f55dbcffa4c9c3014a56a |
|
MD5 | af93d153a5b36d55192b9d68fbae208b |
|
BLAKE2b-256 | 5663a646f88f3bb9d2aecb2b9adfbf48af4d8be02b055b87a2540005d46dfb25 |
File details
Details for the file hostile-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: hostile-0.4.0-py3-none-any.whl
- Upload date:
- Size: 14.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01cf377453f331369ae2d7501abcd1aecee42be58aee6bee2a645e558940d022 |
|
MD5 | cd728de2b85dc039f02c9695e823aee6 |
|
BLAKE2b-256 | 7656cfbbe3e84c411a84c1ea7a563e9a36c24cde9a1dcda867f5cda7ca5e1c5d |