Skip to main content

Python library for Digital Pathology Image Processing

Project description

Coverage Status Build Status Documentation Status Total alerts Language grade: Python Code Grade Code Grade Code style: black PyPI GitHub PyPI - Python Version PyPI - Wheel

histolab

Table of Contents

Motivation

The histo-pathological analysis of tissue sections is the gold standard to assess the presence of many complex diseases, such as tumors, and understand their nature. In daily practice, pathologists usually perform microscopy examination of tissue slides considering a limited number of regions and the clinical evaulation relies on several factors such as nuclei morphology, cell distribution, and color (staining): this process is time consuming, could lead to information loss, and suffers from inter-observer variability.

The advent of digital pathology is changing the way patholgists work and collaborate, and has opened the way to a new era in computational pathology. In particular, histopathology is expected to be at the center of the AI revolution in medicine [1], prevision supported by the increasing success of deep learning applications to digital pathology.

Whole Slide Images (WSIs), namely the translation of tissue slides from glass to digital format, are a great source of information from both a medical and a computational point of view. WSIs can be coloured with different staining techniques (e.g. H&E or IHC), and are usually very large in size (up to several GB per slide). Because of WSIs typical pyramidal structure, images can be retrieved at different magnification factors, providing a further layer of information beyond color.

However, processing WSIs is far from being trivial. First of all, WSIs can be stored in different proprietary formats, according to the scanner used to digitalize the slides, and a standard protocol is still missing. WSIs can also present artifacts, such as shadows, mold, or annotations (pen marks) that are not useful. Moreover, giving their dimensions, it is not possible to process a WSI all at once, or, for example, to feed a neural network: it is necessary to crop smaller regions of tissues (tiles), which in turns require a tissue detection step.

The aim of this project is to provide a tool for WSI processing in a reproducible environment to support clinical and scientific research. Histolab is designed to handle WSIs, automatically detect the tissue, and retrieve informative tiles, and it can thus be integrated in a deep learning pipeline.

Getting Started

Prerequisites

Histolab has only one system-wide dependency: OpenSlide.

You can download and install it from OpenSlide according to your operating system.

Documentation

Read the full documentation here https://histolab.readthedocs.io/en/latest/.

Quickstart

Here we present a step-by-step tutorial on the use of histolab to extract a tile dataset from example WSIs. The corresponding Jupyter Notebook is available at https://github.com/histolab/histolab-box: this repository contains a complete histolab environment that can be used through Vagrant or Docker on all platforms.

Thus, the user can decide either to use histolab through histolab-box or installing it in his/her python virtual environment (using conda, pipenv, pyenv, virtualenv, etc...). In the latter case, as the histolab package has been published on (PyPi), it can be easily installed via the command:

pip install histolab

TCGA data

First things first, let’s import some data to work with, for example the prostate tissue slide and the ovarian tissue slide available in the data module:

from histolab.data import prostate_tissue, ovarian_tissue

Note: To use the data module, you need to install pooch, also available on PyPI (https://pypi-hypernode.com/project/pooch/). This step is needless if we are using the Vagrant/Docker virtual environment.

The calling to a data function will automatically download the WSI from the corresponding repository and save the slide in a cached directory:

prostate_svs, prostate_path = prostate_tissue()
ovarian_svs, ovarian_path = ovarian_tissue()

Notice that each data function outputs the corresponding slide, as an OpenSlide object, and the path where the slide has been saved.

Slide initialization

histolab maps a WSI file into a Slide object. Each usage of a WSI requires a 1-o-1 association with a Slide object contained in the slide module:

from histolab.slide import Slide

To initialize a Slide it is necessary to specify the WSI path, and the processed_path where the thumbnail and the tiles will be saved. In our example, we want the processed_path of each slide to be a subfolder of the current working directory:

import os

BASE_PATH = os.getcwd()

PROCESS_PATH_PROSTATE = os.path.join(BASE_PATH, 'prostate', 'processed')
PROCESS_PATH_OVARIAN = os.path.join(BASE_PATH, 'ovarian', 'processed')

prostate_slide = Slide(prostate_path, processed_path=PROCESS_PATH_PROSTATE)
ovarian_slide = Slide(ovarian_path, processed_path=PROCESS_PATH_PROSTATE)

Note: If the slides were stored in the same folder, this can be done directly on the whole dataset by using the SlideSet object of the slide module.

With a Slide object we can easily retrieve information about the slide, such as the slide name, the number of available levels, the dimensions at native magnification or at a specified level:

print(f"Slide name: {prostate_slide.name}")
print(f"Levels: {prostate_slide.levels}")
print(f"Dimensions at level 0: {prostate_slide.dimensions}")
print(f"Dimensions at level 1: {prostate_slide.level_dimensions(level=1)}")
print(f"Dimensions at level 2: {prostate_slide.level_dimensions(level=2)}")
Slide name: 6b725022-f1d5-4672-8c6c-de8140345210
Levels: [0, 1, 2]
Dimensions at level 0: (16000, 15316)
Dimensions at level 1: (4000, 3829)
Dimensions at level 2: (2000, 1914)
print(f"Slide name: {ovarian_slide.name}")
print(f"Levels: {ovarian_slide.levels}")
print(f"Dimensions at level 0: {ovarian_slide.dimensions}")
print(f"Dimensions at level 1: {ovarian_slide.level_dimensions(level=1)}")
print(f"Dimensions at level 2: {ovarian_slide.level_dimensions(level=2)}")
Slide name: b777ec99-2811-4aa4-9568-13f68e380c86
Levels: [0, 1, 2]
Dimensions at level 0: (30001, 33987)
Dimensions at level 1: (7500, 8496)
Dimensions at level 2: (1875, 2124)

Moreover, we can save and show the slide thumbnail in a separate window. In particular, the thumbnail image will be automatically saved in a subdirectory of the processedpath:

prostate_slide.save_thumbnail()
prostate_slide.show()

ovarian_slide.save_thumbnail()
ovarian_slide.show()

Tile extraction

Once that the Slide objects are defined, we can proceed to extract the tiles. To speed up the extraction process, histolab automatically detects the tissue region with the largest connected area and crops the tiles within this field. The tiler module implements different strategies for the tiles extraction and provides an intuitive interface to easily retrieve a tile dataset suitable for our task. In particular, each extraction method is customizable with several common parameters:

  • tile_size: the tile size;
  • level: the extraction level (from 0 to the number of available levels);
  • check_tissue: if a minimum percentage of tissue is required to save the tiles (default is 80%);
  • prefix: a prefix to be added at the beginning of the tiles’ filename (default is the empty string);
  • suffix: a suffix to be added to the end of the tiles’ filename (default is .png).

Random Extraction

The simplest approach we may adopt is to randomly crop a fixed number of tiles from our slides; in this case, we need the RandomTiler extractor:

from histolab.tiler import RandomTiler

Let us suppose that we want to randomly extract 6 squared tiles at level 2 of size 512 from our prostate slide, and that we want to save them only if they have at least 80% of tissue inside. We then initialize our RandomTiler extractor as follows:

# save tiles in the 'random' subdirectory
PROSTATE_RANDOM_TILES_PATH = os.path.join(PROCESS_PATH_PROSTATE, 'random')

random_tiles_extractor = RandomTiler(
    tile_size=(512, 512),
    n_tiles=6,
    level=2,
    seed=42,
    check_tissue=True, # default
    prefix=PROSTATE_RANDOM_TILES_PATH,
    suffix=".png" # default
)

Notice that we also specify the random seed to ensure the reproducibility of the extraction process. Starting the extraction is as simple as calling the extract method on the extractor, passing the slide as parameter:

random_tiles_extractor.extract(prostate_slide)

Random tiles extracted from the prostate slide at level 2.

Grid Extraction

Instead of picking tiles at random, we may want to retrieve all the tiles available. The Grid Tiler extractor crops the tiles following a grid structure on the largest tissue region detected in the WSI:

from histolab.tiler import GridTiler

In our example, we want to extract squared tiles at level 0 of size 512 from our ovarian slide, independently of the amount of tissue detected. By default, tiles will not overlap, namely the parameter defining the number of overlapping pixels between two adjacent tiles, pixel_overlap, is set to zero:

# save tiles in the 'grid' subdirectory
OVARIAN_GRID_TILES_PATH = os.path.join(PROCESS_PATH_OVARIAN, 'grid')

grid_tiles_extractor = GridTiler(
   tile_size=(512, 512),
   level=0,
   check_tissue=False,
   pixel_overlap=0, # default
   prefix=OVARIAN_GRID_TILES_PATH,
   suffix=".png" # default
)

Again, the extraction process starts when the extract method is called on our extractor:

grid_tiles_extractor.extract(ovarian_slide)

Examples of non-overlapping grid tiles extracted from the ovarian slide at level 0.

Score-based extraction

Depending on the task we will use our tile dataset for, the extracted tiles may not be equally informative. The ScoreTiler allows us to save only the "best" tiles, among all the ones extracted with a grid structure, based on a specific scoring function. For example, let us suppose that our goal is the detection of mitotic activity on our ovarian slide. In this case, tiles with a higher presence of nuclei are preferable over tiles with few or no nuclei. We can leverage the NucleiScorer function of the scorer module to order the extracted tiles based on the proportion of the tissue and of the hematoxylin staining. In particular, the score is computed as formula where formula is the percentage of nuclei and formula the percentage of tissue in the tile t

First, we need the extractor and the scorer:

from histolab.tiler import ScoreTiler
from histolab.scorer import NucleiScorer

As the ScoreTiler extends the GridTiler extractor, we also set the pixel_overlap as additional parameter. Moreover, we can specify the number of the top tiles we want to save with the n_tile parameter:

# save tiles in the 'scored' subdirectory
OVARIAN_SCORED_TILES_PATH = os.path.join(PROCESS_PATH_OVARIAN, 'scored')

scored_tiles_extractor = ScoreTiler(
    scorer = NucleiScorer(),
    tile_size=(512, 512),
    n_tiles=100,
    level=0,
    check_tissue=True,
    pixel_overlap=0, # default
    prefix=OVARIAN_SCORED_TILES_PATH,
    suffix=".png" # default
)

Finally, when we extract our cropped images, we can also write a report of the saved tiles and their scores in a CSV file:

summary_filename = 'summary_ovarian_tiles.csv'
SUMMARY_PATH = os.path.join(OVARIAN_SCORED_TILES_PATH, summary_filename)

scored_tiles_extractor.extract(ovarian_slide, report_path=SUMMARY_PATH)

Representation of the score assigned to each extracted tile by the NucleiScorer, based on the amount of nuclei detected.

Versioning

We use PEP 440 for versioning.

Authors

License

This project is licensed under Apache License Version 2.0 - see the LICENSE.txt file for details

Roadmap

Open issues

Acknowledgements

References

[1] Colling, Richard, et al. "Artificial intelligence in digital pathology: A roadmap to routine use in clinical practice." The Journal of pathology 249.2 (2019)

Contribution guidelines

If you want to contribute to Histolab, be sure to review the contribution guidelines

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

histolab-0.1.0.tar.gz (38.0 MB view details)

Uploaded Source

Built Distribution

histolab-0.1.0-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file histolab-0.1.0.tar.gz.

File metadata

  • Download URL: histolab-0.1.0.tar.gz
  • Upload date:
  • Size: 38.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.6

File hashes

Hashes for histolab-0.1.0.tar.gz
Algorithm Hash digest
SHA256 46a7d6c23cefdfaa7031e73f9212f73d5a86f18340e7065209d4913b88fa7a4e
MD5 6577a1c580e484e3d48bf5bc96b04f9d
BLAKE2b-256 2c7f2ce1ca1d5700f9426b45e7296f187958040a13b9d4855126d4e7d3b77e31

See more details on using hashes here.

File details

Details for the file histolab-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: histolab-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.6

File hashes

Hashes for histolab-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2cbb2614d8f9d0cf1e2dbc9386528fed032dc9bc7dc4604d2bcf502a6297ae36
MD5 285f0c358a18761e4a8fbef653b7d462
BLAKE2b-256 1ecc874f5487442174c664c2958eedbdd22c08a6fd2f5b0d7140489770fe7c69

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page