Skip to main content

Accurate host read removal

Project description

Tests PyPI PyPI - Downloads

Hostile

Hostile removes host sequences from short and long reads, consuming paired or unpaired fastq[.gz] input and producing fastq.gz output. Batteries are included – Hostile downloads and saves a human T2T-CHM13v2.0 + HLA reference genome to $XDG_DATA_DIR when run for the first time. Read headers are replaced with incrementing integers for privacy and more compressible FASTQs. Hostile is implemented as a Python package with a CLI and Python API, but all of the heavy lifting is done by fast compiled code (Minimap2/Bowtie2 and Samtools). When used with a masked reference genome, Hostile achieves near-perfect retention of microbial reads while removing >99.5% of human reads. Please read the BioRxiv preprint for further information and open a GitHub issue, tweet or toot me to report bugs or suggest improvements.

Reference genomes

The default human-t2t-hla reference is downloaded when running Hostile for the first time. This can be overriden by specifying a custom --index. Bowtie2 indexes need to be untarred before use. The databases human-t2t-hla and human-t2t-hla-argos985-mycob140 were compared in the paper.

Name Composition Genome (Minimap2) Bowtie2 index
human-t2t-hla (default) T2T-CHM13v2.0 + IPD-IMGT/HLA v3.51 human-t2t-hla.fa.gz human-t2t-hla.tar
human-t2t-hla-argos985 T2T-CHM13v2.0 & IPD-IMGT/HLA v3.51; masked with 985 FDA-ARGOS 150mers human-t2t-hla-argos985.fa.gz human-t2t-hla-argos985.tar
human-t2t-hla-argos985-mycob140 T2T-CHM13v2.0 & IPD-IMGT/HLA v3.51; masked with 985 FDA-ARGOS & 140 mycobacterial 150mers human-t2t-hla-argos985-mycob140.fa.gz human-t2t-hla-argos985-mycob140.tar

Install

Installation with Conda/Miniconda or Docker is recommended due to non-Python dependencies (Minimap2, Bowtie2, Samtools, Bedtools). Hostile is tested with Ubuntu Linux 22.04, MacOS 12, and WSL2.

Conda

curl -OJ https://raw.githubusercontent.com/bede/hostile/main/environment.yml
conda env create -f environment.yml  # Use Mamba if impatient
conda activate hostile
pip install hostile

Development install

git clone https://github.com/bede/hostile.git
cd hostile
conda env create -f environment.yml  # Use Mamba if impatient
conda activate hostile
pip install --editable '.[dev]'
pytest

Command line usage

$ hostile clean --help
usage: hostile clean [-h] --fastq1 FASTQ1 [--fastq2 FASTQ2] [--aligner {bowtie2,minimap2,auto}] [--index INDEX] [--rename] [--out-dir OUT_DIR] [--threads THREADS] [--force] [--debug]

Remove host reads from paired fastq(.gz) files

options:
  -h, --help            show this help message and exit
  --fastq1 FASTQ1       path to forward fastq(.gz) file
  --fastq2 FASTQ2       optional path to reverse fastq(.gz) file
                        (default: None)
  --aligner {bowtie2,minimap2,auto}
                        alignment algorithm
                        (default: auto)
  --index INDEX         path to custom genome or index. For Bowtie2, provide an index path without the .bt2 extension
                        (default: None)
  --rename              replace read names with incrementing integers
                        (default: False)
  --out-dir OUT_DIR     path to output directory
                        (default: ./)
  --threads THREADS     number of CPU threads to use
                        (default: 10)
  --force               overwrite existing output files
                        (default: False)
  --debug               show debug messages
                        (default: False)

Short reads

$ hostile clean --fastq1 reads.r1.fastq.gz --fastq2 reads.r2.fastq.gz
INFO: Using Bowtie2
INFO: Found cached index (/Users/bede/Library/Application Support/hostile/human-t2t-hla)
INFO: Cleaning…
[
    {
        "aligner": "bowtie2",
        "index": "/path/to/data/dir/human-t2t-hla",
        "fastq1_in_name": "reads.r1.fastq.gz",
        "fastq2_in_name": "reads.r2.fastq.gz",
        "fastq1_in_path": "/path/to/hostile/reads.r1.fastq.gz",
        "fastq2_in_path": "/path/to/hostile/reads.r2.fastq.gz",
        "fastq1_out_name": "reads.r1.clean_1.fastq.gz",
        "fastq2_out_name": "reads.r2.clean_2.fastq.gz",
        "fastq1_out_path": "/path/to/hostile/reads.r1.clean_1.fastq.gz",
        "fastq2_out_path": "/path/to/hostile/reads.r2.clean_2.fastq.gz",
        "reads_in": 20,
        "reads_out": 20,
        "reads_removed": 0,
        "reads_removed_proportion": 0.0
    }
]

Long reads

$ hostile clean --fastq1 tests/data/h37rv_10.r1.fastq.gz
INFO: Using Minimap2's long read preset (map-ont)
INFO: Found cached reference (/Users/bede/Library/Application Support/hostile/human-t2t-hla.fa.gz)
INFO: Cleaning…
[
    {
        "aligner": "minimap2",
        "index": "/Users/bede/Library/Application Support/hostile/human-t2t-hla.fa.gz",
        "fastq1_in_name": "reads.fastq.gz",
        "fastq1_in_path": "/path/to/hostile/reads.fastq.gz",
        "fastq1_out_name": "reads.clean.fastq.gz",
        "fastq1_out_path": "/path/to/hostile/reads.clean.fastq.gz",
        "reads_in": 10,
        "reads_out": 10,
        "reads_removed": 0,
        "reads_removed_proportion": 0.0
    }
]

Python usage

from pathlib import Path
from hostile.lib import clean_paired_fastqs, ALIGNER

# Short reads, defaults
clean_fastqs(
    fastqs=[(data_dir / "reads_1.fastq.gz", data_dir / "reads_2.fastq.gz")],
)

# Long reads, all the options, capture statistics
statistics = lib.clean_paired_fastqs(
    fastqs=[data_dir / "reads.fastq.gz"],
    aligner=ALIGNER.minimap2,
    index=data_dir / "reference.fasta.gz",
    out_dir=data_dir / "decontaminated-reads",
    threads=4
)

print(stats)

Masking reference genomes

The mask subcommand makes it easy to create custom-masked reference genomes and achieve maximum retention of specific target organisms:

hostile mask human.fasta lots-of-bacterial-genomes.fasta --threads 8

​ You may wish to use one of the existing reference genomes as a starting point. Masking uses Minimap2's asm10 preset to align the supplied target genomes with the reference genome, and bedtools to mask out all aligned regions. This feature requires a development install until release in version 0.0.3. For Bowtie2—the default aligner for decontaminating short reads—you will also need to build an index before you can use your masked genome with Hostile.

bowtie2-build masked.fasta masked-index
hostile clean --index masked-index --fastq1 reads_1.fastq.gz --fastq2 reads_2.fastq.gz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hostile-0.0.3.tar.gz (461.5 kB view details)

Uploaded Source

Built Distribution

hostile-0.0.3-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file hostile-0.0.3.tar.gz.

File metadata

  • Download URL: hostile-0.0.3.tar.gz
  • Upload date:
  • Size: 461.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for hostile-0.0.3.tar.gz
Algorithm Hash digest
SHA256 efaef22a482cc7fab0217804f0112ac7795079cbbfb433cb0a2d0f456f8e16ac
MD5 2c909960750cf5ce45777311d6e2344d
BLAKE2b-256 3b0176d366c0159603cef19b198648d4dedd6229ad1aa6ad6a4d52f0f43e52c0

See more details on using hashes here.

File details

Details for the file hostile-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: hostile-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for hostile-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 cf7b3d43ec6f435251fc7bd3aaf0930f69c20b63b7187637bf188b9c07146d3b
MD5 604cf62bb9b31d9710595653a74548f5
BLAKE2b-256 e6797f101ac60444637b465a0f37cc0283c10f939113bf5aa1212a0b227e20ae

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page