A Generalized Deep Learning Framework for Whole-Slide Image Segmentation and Analysis

01/01/2020 ∙ by Mahendra Khened, et al. ∙ Indian Institute Of Technology, Madras 15

Histopathology tissue analysis is considered the gold standard in cancer diagnosis and prognosis. Given the large size of these images and the increase in the number of potential cancer cases, an automated solution as an aid to histopathologists is highly desirable. In the recent past, deep learning-based techniques have provided state of the art results in a wide variety of image analysis tasks, including analysis of digitized slides. However, the size of images and variability in histopathology tasks makes it a challenge to develop an integrated framework for histopathology image analysis. We propose a deep learning-based framework for histopathology tissue analysis. We demonstrate the generalizability of our framework, including training and inference, on several open-source datasets, which include CAMELYON (breast cancer metastases), DigestPath (colon cancer), and PAIP (liver cancer) datasets. We discuss multiple types of uncertainties pertaining to data and model, namely aleatoric and epistemic, respectively. Simultaneously, we demonstrate our model generalization across different data distribution by evaluating some samples on TCGA data. On CAMELYON16 test data (n=139) for the task of lesion detection, the FROC score achieved was 0.86 and in the CAMELYON17 test-data (n=500) for the task of pN-staging the Cohen's kappa score achieved was 0.9090 (third in the open leaderboard). On DigestPath test data (n=212) for the task of tumor segmentation, a Dice score of 0.782 was achieved (fourth in the challenge). On PAIP test data (n=40) for the task of viable tumor segmentation, a Jaccard Index of 0.75 (third in the challenge) was achieved, and for viable tumor burden, a score of 0.633 was achieved (second in the challenge). Our entire framework and related documentation are freely available at GitHub and PyPi.



There are no comments yet.


page 22

page 26

page 29

page 31

page 36

page 38

page 40

page 42

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Histopathology is still the only definitive method to diagnose cancer (salamat2010robbins)

. Early diagnosis of cancer significantly increases the probability of survival

(hawkes2019cancer). Unfortunately, pathological analysis is an arduous process that is difficult, time-consuming, and requires in-depth knowledge. A study conducted by (elmore2015diagnostic) investigated the concordance of pathologists investigating biopsies of the breast. This study comprised of 115 pathologists across the United States and 240 biopsy specimens. It was found that pathologists disagreed with each other on a diagnosis 24.7% of the time on average. This high rate of misdiagnosis stresses the need to develop computer-aided methods for histopathology analysis to aid the pathologists. With the increasing prevalence of whole-slide imaging(WSI) scanners that can scan the entire tissue sample, in-silico methods of conducting pathology analysis can now be explored (madabhushi2016image).

Broadly, histopathology analysis includes two types, namely (1) Classification analysis and (2) Segmentation analysis. Classification analysis helps in classifying WSI into benign or malignant

(nanthagopal2013classification), (guray2006benign). Classification can also be used to study cancer sub-types directly (malhotra2010histological). Segmentation analysis helps in detecting and separating tumor cells from the normal cells (wahlby2004combining), (xu2016deep)

. This analysis can also be used for the aforementioned classification task and other analyses like pN-staging and tumor burden estimation.

Automatic histopatholgy analysis is plagued by a myriad of challenges (tizhoosh2018artificial);

  1. Insufficient training samples: As the data extraction process for medical images is expensive and less frequent, this results in class imbalance issues, adding an inherent skewness to the dataset. This makes it hard to adapt algorithms which are developed for natural images to medical data.

  2. Large dimensionality of WSI: A WSI is generated by digitizing a glass slide at a very high resolution of order 0.25 micrometers/pixel (which corresponds to 40X magnification on a microscope). A typical glass-slide of size 20mm x 15mm results in gigapixel image of size 80,000 x 60,000 pixels.

  3. Stain variability across laboratories: As the data is acquired from multiple sources, there exists a lack of uniformity in staining protocol. Building a generalized framework that is invariant to stain pattern variability is challenging.

  4. Extraction of clinically relevant features and information. Another major challenge is trying to extract features that are meaningful from a clinical point of view. Deep learning does an excellent task of automatic feature extraction, but understanding these extracted features and extracting meaningful information from them is challenging.

The organization of the paper is as follows. Prior work on histopathology tissues using deep learning methods are discussed in the section 1.1, Followed by our contributions in section 1.2. In section 2.2 we introduce the datasets used in this work, all the preprocessing techniques are discussed in section 2.3. In section 2.4, we discuss multiple deep convolutional networks used, followed by training and inference pipelines in section 2.7 and 2.8 respectively. In section 2.9, 2.10, and 2.6 we describe various other concepts like pN-staging, tumor burden, and uncertainty analysis. Experimental analysis and comprehensive results are presented in section 3 and 4. In section 5 we describe our open-source project (DigiPath AI) and in section 6 we discuss our results and conclude with future direction of work.

1.1 Related work

1.1.1 Deep learning methods for histopathology image analysis

Deep learning methods have shown great success in various tasks involving image, graph, and text data. In the field of medical imaging and diagnosis, deep learning models have achieved human-like results on many specific problems (kermany2018identifying), (bakas2018identifying), (weng2017can), (rajpurkar2017chexnet). One possible medical application where deep learning could have a very high impact if leveraged accurately would be the automation of histopathology analysis. In recent years, grand challenges (li2019signet; CAMELYON16; CAMELYON17; paip; icair) are encouraging all the researchers to collectively work on histopathological data using deep learning based solutions, by providing labeled data.

Histopathology slides provide a more comprehensive view of diseases on the tissue and is still considered as a gold standard for cancer diagnosis (gurcan2009histopathological). In recent past many deep learning-based methods have approached problems of nuclei segmentation (sirinukunwattana2016locality), liver tumor segmentation (kaluva20182d), epithelial tumor tissue segmentation (shapcott2019deep), Gleason grading (arvaniti2018automated), signet ringcell detection (li2019signet) and many more. Deep learning-based methods are increasingly used in the context of histopathology analysis because of their ability to automatically discover the representations needed for feature detection from raw data. Deep learning has also helped in merging multi-domain information (bagari2018combined) in diagnosis, since they learn to associate important features from each domain, and has been proven to provide better results.

1.1.2 Lymph node metastases of breast cancer

Breast cancer is one of the most common cancer among women in the world. The prognosis of breast cancer patients is mainly based on the extent of metastases (sites2014seer). Metastases refers to the spreading of cancer to different parts of the body from where it originally started. This usually occurs when cancer cells break away from the main tumor and enter the bloodstream or lymphatic system. A formally accepted way to classify the extent of cancer is based on TNM staging criteria (amin2017ajcc; sobin2011tnm). The TNM staging system takes into account the size of the tumor (T-stage), cancer spreading to regional lymph nodes (N-stage) and metastasization of tumor to other parts of the body (M-stage). In case of breast cancer, the lymph nodes are usually the first location to get metastasized. With the help of sentinel lymph node procedure (giuliano2011axillary; giuliano2017effect), the most likely metastasized lymph nodes are excised and taken for further histopathological processing and examination by a pathologist. The excised node is preserved by fixing in formalin and embedded in paraffin wax block to enable cutting micrometers thin slices of the tissue. These tissue slices are placed on glass slides and are stained with hematoxylin and eosin (H&E). This staining enables the pathologist to differentiate between nuclear (blue) and cytoplasmic parts (pink) of the cell, thereby providing a general overview of the tissue’s structure.

Category Size
Isolated tumor cells
Single tumor cells or a cluster of
tumor cells not larger than 0.2
mm or less than 200 cells
Larger than 0.2 mm and/or con-
taining more than 200 cells, but
not larger than 2 mm
Macro-metastasis Larger than 2 mm
Table 1: Size criteria for assigning metastasis type

Clinically, metastases is divided into one of the three categories namely:- isolated tumor cells (ITC), micro-metastases and macro-metastases. The categorization is based on the diameter of the largest tumor cells cluster. The Table 1 provides the size criteria for metastases type. The assigning of pathologic N-stages (pN-stages) is based on metastases size and number of lymph nodes per patient (sobin2011tnm). A simplified version of the pN-staging scheme used in CAMELYON17 Challenge (bejnordi2017diagnostic) is provided in Table 2. However, this diagnostic procedure in assessing lymph nodes status is challenging for pathologists as large tissue area has to be examined in detail under microscope at several magnification levels for identifying metastases. Also, this process is tedious, prone to missing small metastases and would require extensive time by pathologist for a thorough examination.

The advent of whole-slide imaging scanners has enabled digitization of glass-slides at very high resolution. Typical whole-slide images (WSI) are in order of gigapixels and usually stored in multi-resolution pyramidal format. These WSIs are suitable for developing computer aided diagnosis systems for automating the pathologist workflow and also with the availability of large amount of data makes WSIs amenable for analysis with machine learning algorithms. Some of the earliest works

(wolberg1994machine; diamond2004use; petushi2006large) based on machine learning algorithms for digital pathology and WSIs were cancer classification.

Recently, with advent of convolutional neural networks (CNNs) considerable improvements have been shown in various computer vision tasks like detection, segmentation and classification. CNNs have also been proposed in lymph node metastasis detection in the recent years

(litjens2016deep; paeng2017unified; liu2017detecting; lee2018robust; li2018cancer; wang2018deep).

pN-Stage Slide Labels
pN0 No micro-metastases or macro-metastases or ITCs found.
pN0(i+) Only ITCs found.
pN1mi Micro-metastases found, but no macro-metastases found.
pN1 Metastases found in 1-3 lymph nodes, of which at least
one is a macro-metastasis.
pN2 Metastases found in 4-9 lymph nodes, of which at least
one is a macro-metastasis.
Table 2: Pathologic lymph node classification (pN-stage) in CAMELYON17 Challenge

1.1.3 Colorectal carcinoma

Colorectal carcinoma is third most common cancer in the United States (fleming2012colorectal). Majority of colorectal carcinoma are adenocarcinomas originating from epithelial cells (hamilton2000carcinoma). shapcott2019deep discuss the application of deep learning methods for cell identification on TCGA data. kather2019predicting discuss the deep learning methods to predict the clinical course of colorectal cancer patients based on histopathological images and bychkov2018deep discuss the use of LSTM (greff2016lstm) for estimating the patient risk score using spatial sequential memory.

1.1.4 Liver cancer

Three out of every four cases of liver cancer results in death, making it one of the deadliest forms of cancer (stewart2019world). Amongst the different forms of liver cancer, hepatocellular carcinoma is the most common form(chuang2009liver). Computer-aided methods for detecting liver cancer from various modalities such as CT (li2015automatic, yasaka2017deep), ultrasound (wu2014deep) and laparoscopy (gibson2017deep) have been explored. Methods utilizing multi-omics data such as RNA sequencing have also been successful (chaudhary2018deep). However, for histopathology, lack of annotated datasets comprising of liver tissue images in this modality has rendered this area unexplored. The new grand challenge paip on liver cancer segmentation motivates research in this area.

1.2 Contributions

In this paper, we developed an end-to-end framework for histopathology analysis from WSI images that generalizes well with various cancer sites. The framework comprises of segmentation at its core along with novel algorithms that utilize the segmentation to do pathological analyses such as metastasis classification, viable tumor burden estimation, and pN-staging. As discussed in section 1, challenges in WSI analysis are mainly due to their extremely large size, variability in staining, and the limited amount of data. In this work, we propose a few key contributions which helped in addressing a few of the aforementioned problems.

  • Extremely large size of WSI: we propose an efficient patch-based approach for training and inference, where sampling is done on a generated tissue mask, thereby reducing computations on unnecessary patches.

  • Insufficient data: To address the problems of limited data and class imbalance, we made use of overlapping and oversampling based strategies in patch extraction along with multiple data augmentations.

  • Inference Pipeline: To remove the edge artifacts characteristic of patch-based segmentation, we have proposed an overlapping strategy, which proved to reduce them significantly.

  • pN Staging: We combined and experimented on various strategies of over-sampling and under-sampling techniques to address class imbalance problem in the dataset for metastases classification.

  • Whole tumor segmentation: We propose an empirical non-deep learning based approach for estimating the whole tumor region from the viable tumor region that is computationally efficient yet performs on par with the deep learning based methods.

  • Uncertainty Estimation: We have proposed an efficient patch-based uncertainty estimation framework, to estimate both data specific and model (parameter) specific uncertainties.

  • Transfer Learning: Based on our study, we show that transfer learning using CAMELYON dataset helps in faster convergence for other histopathology datasets.

  • Generalizability: We bench-marked the performance of our methods by validating it on multiple open-source datasets including CAMELYON (litjens20181399), (bejnordi2017diagnostic), (bandi2018detection) DigestPath(li2019signet), and PAIP(paip).

  • Open-source Packaging: Finally, we packaged the framework into an open-source GUI application for the benefit of researchers.

2 Materials and methods

2.1 Overview

Figure 1: Entire framework pipeline

The Figure 1, illustrates the proposed overall framework. The proposed framework predicts segmentation mask along with uncertainty maps for a given input image (liver or breast or colon tissue). In this framework we also define a pipeline for pN staging, metastases classification, and tumor burden estimation.

2.2 Data

We made use of multiple open-source datasets including DigestPath19(li2019signet), PAIP19 (paip), CAMELYON(bejnordi2017diagnostic; bandi2018detection; litjens20181399) datasets to validate our framework.

2.2.1 CAMELYON (CAncer MEtastases in LYmph nOdes challeNge)

The CAMELYON dataset comprises of sentinel lymph node tissue sections collected from the CAMELYON16 (CM16) and CAMELYON17 (CM17) challenges. The CM16 dataset comprised of 399 WSIs taken from two medical centers in the Netherlands, out of which 159 WSIs were metastases, and the remaining 240 were negative. Pathologists exhaustively annotated all the WSIs with metastases at the pixel level. In the CM16 challenge the 399 WSIs were split into training and testing sets, comprising of 160 negative and 110 metastases WSIs for training, 80 negative and 49 metastases WSIs for testing.

The CM17 dataset comprised of 1000 WSIs taken from five medical centers in the Netherlands. In CM17 challenge, 500 WSIs were allocated for training, and the remaining 500 WSIs were allocated for testing. The training dataset of CM17 included 318 negative WSIs and 182 WSIs with metastases. In CM17 dataset, slide-level labels were provided for all the WSIs and exhaustive pixel level annotations were provided for 50 WSIs. The slide-level labels were negative, Isolated Tumor cells (ITC), micro-metastases and macro-metastases (Table 1). In CM17 challenge, the 1000 WSIs were divided into 200 artificial patients, and each artificial patient was constructed by grouping 5 WSIs from the same medical center. The pN-stage labels were provided for 100 patients from the training set based on the rules provided in Table 2. The Table 3 provides the metastases type distribution in CM17 training dataset.

Metastases (Training Set)
Negative ITC Micro Macro
318 35 64 88
Table 3: Metastases type distribution in CAMELYON17 training set

2.2.2 DigestPath

DigestPath dataset consists of a total of 750 tissue slices taken from 450 different patients. On average, each tissue image was of size 5000x5000. The dataset also included fine pixel-level annotations of lesion done by experienced pathologists. Additionally, 250 tissue slices taken from 150 patients were provided as a test set. The data was collected from multiple medical centers, especially from several small centers in developing countries. All the tissues were stained by hematoxylin and eosin (HE) and scanned at 20x resolution. Out of 750 tissue slices, 725 slices were used in training, and the rest 25 slices we considered as a held-out test set.

2.2.3 Paip

Figure 2: In the above figure green contour represents viable tumor and the blue contour represents whole tumor

The PAIP dataset contains a total of 100 whole slide images scanned from liver tissue samples. Each image had an average dimension of 50,000x50,000 pixels. All the images were stained by hematoxylin and eosin, scanned at 20x magnification, and prepared from a single center, Seoul National University Hospital. The dataset included pixel-level annotation of the viable tumor and whole tumor regions. It also provided the viable tumor burden metric for each image.

Tumor burden is defined as the ratio of the viable tumor region to the whole tumor region. The viable tumor region is defined as the cancerous region. The whole tumor area is defined as the outermost boundary enclosing all the dispersed viable tumor cell nests, tumor necrosis, and tumor capsule (Fig 2). Each tissue sample contains only one whole tumor region. This metric has applications in assessing the response rates of patients to cancer treatment.

Out of the 100 images, 50 images were the publicly available training set, 10 images were reserved for validation set that was made publicly available after the challenge was completed, and the rest 40 images were the test set whose ground truth were not publicly available.

2.3 Data pre-processing

2.3.1 Tissue mask generation

Tissue mask generation was done to prevent unnecessary computations on blank spaces in tissue slice. In this step, the entire tissue region was separated from the background glass region of the slide. Here we make use of Otsu’s adaptive thresholding (otsu1979threshold) technique on the low-resolution version of the WSI. The RGB color space of the low-resolution WSI was transformed to HSV (Hue-Saturation-Value) color space, and thresholding was applied to the saturation component. Post thresholding, morphological operations were done to ensure that during patch-based inference of the WSI, the neighboring contexts were appropriately provided in the extracted patches at small tissue regions and tissue borders.

2.3.2 Patch coordinate extraction

The WSIs are gigapixel images, and the typical image sizes are of order . We proposed to train our neural networks with small patches of fixed size extracted from the WSI. We extracted coordinates as the center of the patches from the low-resolution WSI’s tissue mask. We rescaled the extracted coordinates to correspond to level-0 WSI. The randomly sampled coordinates from the WSIs corresponded to regions from the tumor, tumor boundary regions, and non-tumor tissue. From each WSI, a fixed number of patch coordinates corresponding to various regions of WSIs were extracted. Further, these extracted patch coordinates were split into three cross-validation folds.

Dataset #WSIs Average image size #Number of patches extracted
CAMELYON 628 100,000x100,000 571,029
DigestPath 725 5,000x5,000 80,000
PAIP 50 50,000x50,000 200,000

2.3.3 Data augmentation, normalization and data loading

To increase the number of data points and generalization of models to various staining and acquisition conditions, we made use of data augmentations. Augmentations like “flipping left, and right,” “90-degree rotations”, and “Gaussian blurring” along with color augmentation were performed. Color augmentation included random changes to brightness, contrast, hue, and saturation, the ranges for which were 64.0/255, 0.75, 0.25, 0.04, respectively. Additionally, in order to introduce a diversity of patches extracted from the WSI during every training cycle, we incorporated random coordinate perturbation. The random coordinate perturbation involved offsetting the center of the patch coordinates within a specified radius prior to the extraction from WSI. We fixed our radius to be 128 pixels and extracted patches of size 256x256 from the level-0 image.

Post augmentation, the images were normalized using equation 1

to scale the range of data statistics, basically transforming the data to have zero mean and unit standard deviation. Since the image dimensions were huge, computing the image statistics during training was expensive. This was circumvented by assuming the mean and standard deviation of the image to be 128, which spreads the entire region.


The data-loader prepared batches of data comprising an equal number of tumorous and non-tumorous patches, thereby facilitating a balanced dataset for training the neural network.

2.4 Network architecture

For the task of segmentation of tumor regions from patches of WSIs, we propose to utilize encoder-decoder based fully convolutional neural (FCN) network-based architectures (long2015fully)

. An encoder comprises of a series of operations (like convolution and pooling) that reduces the input size to a low dimension feature vector. The decoder upsamples the feature vectors to a larger dimension. The entire encoder-decoder is then trained end-to-end. Fully convolutional architecture substitutes fully connected layers with convolutional layers, thereby allowing input of arbitrary size. We propose an ensemble approach comprising of some of the FCN architectures that show the state of the art results on natural images, namely Densenet, InceptionResNetV2, and DeepLabV3Plus.

2.4.1 DenseNet

Figure 3: Densenet Architecture

We used a U-Net (ronneberger2015u) based encoder-decoder architecture. The main feature of U-Net is the presence of skip connections between layers of the encoder and decoder. These connections allow for the free flow of low-level information directly from the encoder to the decoder, bypassing the bottleneck. The skip-connection was made by concatenating the layer with the layer.

For the encoder, we use the DenseNet (iandola2014densenet) architecture. The input for a particular layer in the DenseNet network was concatenated to inputs of all the previous layers. This was done to prevent the loss of information through forward pass in deep networks. The decoder was composed of the bi-linear up-sampling module and convolutional layers.

2.4.2 Inception-ResNetV2

Figure 4: Inception-ResNetV2 architecture

The Inception-v4 (szegedy2017inception) (also known as Inception-ResNetV2) integrates the features of the Inception (szegedy2015going) architecture and the ResNet (he2016deep) architecture. The ResNet architecture introduced skip-connections between layers by summing the input and output of the layer and feeding that as the input for the layer. The Inception architecture contains inception blocks that have convolutions of various sizes. This waives the need to deliberate on the filter sizes for each layer.

In the Inception-ResNet’s architecture, the inception blocks have skip connections that sum the input of the block to the output.

2.4.3 DeepLabV3Plus

Figure 5: DeepLabV3Plus architecture

DeepLabV3 (chen2017rethinking) was built to obtain multi-scale context. This was done by using atrous convolutions with different rates. DeeplabV3Plus (chen2018encoder) extends this by having low-level features transported from the encoder to decoder.

2.4.4 Ensemble

Our ensemble comprised of 3 FCN models, each of the individual FCN model was trained on one of the 3-fold cross-validation splits. During inference, the predicted posterior probabilities of the segmentation maps were averaged to generate the ensemble model prediction. The models were trained and inferred on the image patch size of 256x256 at the highest resolution of WSI.

2.5 Loss function

Tumor regions were represented by a minuscule proportion of pixels in WSIs, thereby leading to class imbalance. This issue was circumvented by training the network to minimize a hybrid loss function. The hybrid cost function comprised of cross-entropy loss and a loss function based on the Dice overlap coefficient. The dice coefficient is an overlap metric used for assessing the quality of segmentation maps. The dice loss is an differentiable function that approximates to dice-coefficient and is defined using the predicted posterior probability map and ground truth binary image as defined in the Eq.

2. The cross-entropy loss is defined in Eq. 3. In the equations, is the predicted posterior probability map, and is the ground truth image.


The total loss is defined as a linear combination of the two loss components as defined in Eq. 4. The neural networks are trained by minimizing the total loss. The are empirically assigned to the individual loss components. FG and BG represent the foreground pixels that correspond to the tumor regions and the background pixels that corresponded to non-tumor regions, respectively. In this work we set and .


2.6 Uncertainty analysis

Uncertainty estimation is essential in assessing unclear diagnostic cases predicted by deep learning models. It helps pathologists/doctors to concentrate more on the uncertain regions for their analysis. The need and challenges that we might face in the context of uncertainty are discussed in (begoli2019need). There exist two main sources of uncertainty, namely (1) Aleatoric uncertainty, and (2) Epistemic uncertainty. Aleatoric uncertainty is uncertainty due to the data generation process itself. In contrast, the uncertainty induced due to the model parameters, which is the result of not estimating ideal model architectures or weights to fit the given data, is known as epistemic uncertainty (kendall2017uncertainties). Epistemic uncertainty can be approximated by using test time Bayesian dropouts as discussed in (leibig2017leveraging), which estimates uncertainty by Montecarlo simulations with Bayesian dropout.

In the proposed pipeline, we estimate aleatoric uncertainty for each model using test time augmentations, as introduced in (gal2016dropout). For epistemic uncertainty, we make use of the diversity of model architectures and calculate uncertainty as described in equation 5.



where indicates trained model, where and denotes the set of possible test time data augmentations allowed. In our case .

2.7 Training pipeline

Figure 6: Overview of training pipeline

Figure 6 describes the training strategy utilized for training each of our models. The batches for training were generated with an equal number of tumor and non-tumor patches. This was done to prevent class imbalance or manifold shift issues and enforce proper training. All three models were trained independently, with different folds of the data.

We made use of the transfer learning technique by using ImageNet

(deng2009imagenet) pre-trained weights for encoders based on DenseNet and Inception-ResNetV2. We made use of the models pre-trained on PascalVOC(everingham2010pascal)

in the case of DeeplabV3Plus. For the first two epochs, the encoder section of the models were frozen, and only the decoder weights were made trainable. For the remaining epochs, both the encoder and decoder parts were trained, and the learning rate was reduced gradually. The learning rate was decayed after every few epochs in a deterministic manner to allow for the model to gradually converge.

2.8 Inference pipeline

Figure 7: Overview of Inference Pipeline
Figure 8: (Left to Right) A tissue mask overlayed on the WSI image at low resolution (level-4), here the white region corresponds to the tissue mask; Uniform Patch Coordinate Sampling Grid, here the points on the image act as centers from which high-resolution image patches were extracted from the WSI.

Figure 7 provides an overview of the inference pipeline to generate tumor probability maps and model uncertainty maps. In a typical WSI, the tissue region occupies a smaller fraction when compared to the background glass slide, hence for faster inference, the pre-processing step involved segmentation of tissue region. This tissue segmentation mask was generated, as discussed in section 2.3.1. In order to facilitate extraction of patches from the WSI within the tissue mask region a uniform patch-coordinate sampling grid was generated at lower resolution as shown in Figure 8

. Each point in the patch sampling grid was rescaled by a factor to map to the coordinate space corresponding to WSI at its highest resolution. From these scaled coordinate points as the center, fixed-size high-resolution image patches were extracted from the WSI. We define sampling stride as the spacing between consecutive points in the patch sampling grid. The high-resolution patch size and the sampling stride controlled the overlap between consecutive extracted patches from the WSI. The main drawback with patch-based segmentation of WSI was that the smaller patches could not capture a wider context of the neighborhood. Moreover, stitching of the patch segmentation introduced boundary artifacts (blockish appearance) in the tumor probability heat-maps. We observed that the generated heat-maps were smooth when the inference was done on overlapping patches with larger patch-size and averaging the prediction probabilities at overlapping regions. Our experiments suggested that 50% overlap between consecutive neighbouring patches was the optimal choice as it ensured that a particular pixel in the WSI was seen at most 4-times during the heat-map generation. However, this approach increased the inference time by a factor of 4. We also observed that during inference increasing the patch size by a factor of 4 when compared to the patch size used during training (256x256) improved the quality of generated heat-maps. The generated heat maps obtained by single model with multiple input images varied with different test time augmentations were used in the estimation of aleatoric uncertainty map. Epistemic uncertainty map was estimated by using multiple heat maps obtained from different models for a given input image.

2.9 pN-Staging Pipeline

Figure 9: Overview of pN-Staging pipeline

The Figure 9 illustrates the overall pipeline for pN-Staging. The pipeline comprises 4 blocks as described below:

  • Pre-processing: The tissue regions in the WSIs were detected for patch extraction.

  • Heatmap generation: The extracted patches from the WSIs were passed through the inference pipeline to generate the down-scaled version of the tumor probability heatmaps.

  • Feature extraction: The heatmaps were binarized by thresholding at 0.5 and 0.9 probabilities, and at each of these thresholds, the connected components were extracted, and region properties were measured using scikit-image

    (scikit-image) library. We computed 32 geometric and morphological features of the probable metastases regions. Table 4 provides a description of all the features computed.

    No. Feature description Threshold (p)
    1 Largest tumor region’s major axis length p=0.9 & p=0.5
    2 Largest tumor region’s area p=0.5
    3 Ratio of tumor region to tissue region p=0.9
    4 Count of non-zero pixels p=0.9

    Tumor regions area: maximum, mean, variance, skewness, and kurtosis

    6 Tumor regions perimeter: maximum, mean, variance, skewness, and kurtosis p=0.9
    7 Tumor regions eccentricity: maximum, mean, variance, skewness, and kurtosis p=0.9
    8 Tumor regions extent: maximum, mean, variance, skewness, and kurtosis p=0.9
    9 Tumor regions solidity: maximum, mean, variance, skewness, and kurtosis p=0.9
    10 Mean of all region’s mean confidence probability p=0.9
    11 Number of connected regions p=0.9
    Table 4: Features extracted from the heatmap for predicting lymph node metastases type.
  • Classification: The pN-stage was assigned to the patient based on all the available lymph-node WSIs, taking into account their individual metastases type (Table 1). We employed a simple count-based rule based on pN-staging as defined in Table 2

    . For predicting the metastases type, we built an ensemble of Random Forest classifiers

    (liaw2002classification) using the extracted features.

Figure 10: Overview of viable tumor burden pipeline

2.10 Tumor burden estimation

We obtained the viable tumor region from the segmentation pipeline discussed. For the whole tumor region, initially, we attempted to use the same segmentation pipeline and model it as a binary classification problem. However, the model provided results that had an accuracy lesser than 10%. Therefore, we adopted a heuristic method to calculate the tumor burden. First, the viable tumor region was predicted. Morphological operations were applied to remove some of the false positives and fill the holes. Next, the smallest convex hull that contained the entire viable tumor region was calculated. Simultaneously, a tissue mask was obtained using otsu thresholding. The convex hull was then multiplied with the tissue mask to obtain the whole tumor segmentation. The viable tumor burden was then calculated by taking the ratio of the area of the viable tumor and whole tumor. Figure

10 show an overview of the viable tumor burden estimation. The results are shown in Figure 20.

3 Experimental analysis

In this section, we experimentally analyze the effectiveness of our proposed methodologies for segmentation and classification models. We implemented our neural networks using TensorFlow (

abadi2016tensorflow) software. We ran our experiments on multiple desktop computers with NVIDIA Titan-V GPU with 12 GB RAM, Intel Core i7-4930K 12-core CPUs @ 3.40GHz, and 48GB RAM.

3.1 Lesion detection performance on CAMELYON

In this section, we detail some of the techniques specific to CAMELYON dataset pre-processing and discuss the performance of various FCN architectures and ensemble configurations for lesion detection on the CAMELYON16 test dataset (n=139).

3.1.1 Tissue mask generation

In some of the CM17 cases, the Otsu’s thresholding failed because of the black regions in WSI. Hence, prior to the application of image thresholding operation, the pre-processing involved replacing black pixel regions in the WSI back-ground with white pixels and median blurring with a kernel of size 7x7 on the entire image. Median blur aided in the smoothing of the tissue regions and removal of noise at the tissue bordering the glass-slide region‘ while preserving the edges of the tissue. Figure 11 illustrates the pipeline for tissue mask generation with an example.

Figure 11: An illustration of the intermediate results of tissue mask generation of WSI in CM17 dataset.

3.1.2 Dataset preparation

For training our model for lesion segmentation, we used training sets of both CAMELYON16 and CAMELYON17 dataset, which had pixel-level annotations. As noted by the challenge organizers, some of the WSIs were not exhaustively annotated in the CAMELYON16 training set; we excluded them in our training set preparation. So, in total, we had 628 WSIs for training (250 WSIs from CAMELYON16 and 378 from CAMELYON17). We made a 3-fold stratified cross-validation split of the training set to maximize the utilization of the limited number of WSIs. The stratification ensured that the ratio of negative to metastases was maintained in all the three folds. From 628 WSIs, we randomly sampled 571029 coordinates whose patches included regions from the tumor and non-tumor tissue regions. A patch extracted from a WSI was labeled as a tumor patch if it had non-zero pixels labeled as tumor pixels in the pathologist’s manual annotation. Further, these extracted patch coordinates were split into 3-cross validation folds. Table 5 shows the distribution of the split in each of the folds.

Patch label No. of patch coordinates
Fold-0 Fold-1 Fold-2
Training Non-Tumor 187034 196094 190424
Training Tumor 184467 194709 187440
Validation Non-Tumor 99742 90682 96352
Validation Tumor 99786 89544 96813
Table 5: The table gives the statistics of the tumor and non-tumor patches in each of the cross-validation folds.

3.1.3 Training and inference configuration of ensemble FCN models

We experimented with following two types of Ensemble configuration:

  • Ensemble-A: It comprised of the three different FCN architectures, as described in section 2.7. The inference pipeline made use of patch size of 256 and extracted non-overlapping patches, as illustrated in Figure 9.

  • Ensemble-B: It comprised of three replicated versions of a single FCN architecture. The inference pipeline made use of a patch size of 1024 with a 50% overlap between neighboring patches, as illustrated in Figure 12.

Figure 12: The figure illustrates the overlap-stitch inference pipeline used in Ensemble-B configuration.

In both the ensemble configurations, each model in the ensemble was trained with one of the 3-fold cross-validation splits. All the models made use of pre-trained weights with the fine-tuning procedure, as described in section 2.7. The models were trained for ten epochs with a batch size of 32.

3.1.4 Lesion detection performance of Ensemble-A

Table 6 provides a summary of the Ensemble-A and its individual constituting FCN models. Our results showed that DenseNet-121 architecture had higher sensitivity and reduced False positives when compared to other FCNs in the ensemble configuration. It was also observed that Ensemble-A showed significant difference in FROC score (A.0.3) compared to its constituents. The reason for this significant boost in the performance of Ensemble-A could be attributed to the effect of averaging the heatmaps from multiple FCN models, thereby lowering the probabilities of uncertain or less confident regions and hence eliminating the False Positives. Figure 13 illustrates the heatmaps generated by Ensemble-A and its constituent FCN models on a CAMELYON16 test case.

Figure 13: The figure shows the heat-maps overlayed on the WSI by FCN models in Ensemble-A configuration. (Left to Right): DenseNet121 FCN, InceptionResNetV2 FCN, DeepLabV3plus, Ensemble, (Patch Size: 256x256, Sampling Stride: 256 pixels).
Average FPs /WSI Sensitivity
IRF (Fold-0) DF (Fold-1) DL (Fold-2) Ensemble-A
0.25 0.5 0.59 0.03 0.77
0.5 0.59 0.65 0.06 0.8
1 0.69 0.72 0.2 0.83
2 0.77 0.79 0.48 0.84
4 0.8 0.83 0.64 0.86
8 0.82 0.85 0.77 0.86
FROC Score 0.69 0.74 0.36 0.83
Table 6: The table shows the FROC scores achieved on CAMELYON16 test set (n=139) by FCN models in Ensemble-A configuration. Note the abbreviations: IRF - Inception-ResNetV2 FCN, DF- DenseNet-121 FCN, DL- DeepLabV3Plus, FP- False Positives

3.1.5 Lesion detection performance of Ensemble-B

From Table 6 it was evident that the performance of DenseNet-121 FCN performed significantly better than the other two models in Ensemble-A. Hence, we proposed to design an Ensemble-B comprising of three DenseNet-121 FCNs as its constituents, each model was trained on one of the 3 cross-validation folds. The results are summarized in Table 8. We observed an improvement in FROC score for Ensemble-B (A.0.3) when compared to Ensemble-A. We also observed that the performance of individual models in the ensemble were similar (considering FROC score). However, this observation contrasted with Ensemble-A, where the models showed significant difference in their FROC scores. This also helped us to ascertain the fact that the performance difference in individual FCN models of Ensemble-A were not because of data variation seen between individual cross-validation folds. The Figure 14 illustrates the heatmaps generated by Ensemble-B and its constituent FCN models on a CAMELYON16 testcase. The Table 7 provides a summary of FROC scores on CAMELYON16 testset (n=139) for various configurations of patch-size and sampling stride. We observed that running inference on larger patch size enabled generation of smoother heatmaps and patch extraction with overlapping stride gave better context at the patch borders thereby minimizing the blockish artifacts seen in the heatmaps. In the Figure 14 we observed that while running inference on larger patch size, the model slightly struggled at tissue borders which was evident from the confidence probability values. This was primarily because we had trained the models on patch size of 256 and our training samples lacked regions from tissue border and glass-slide regions. Since, there was a marginal difference in FROC scores between various ensemble configurations (Table 7), in the interest of minimizing the computation time we preferred to use Ensemble-A with patch-size and sampling stride set to 256 (non-overlapping patch acquisition) for running inference on CAMELYON17 testing dataset (n=500).

Model Patch size Sampling stride FROC score
Ensemble-A 256 256 0.83
Ensemble-B 256 256 0.84
Ensemble-B 1024 1024 0.86
Ensemble-B 1024 512 0.85
Table 7: The table shows the FROC scores on CAMELYON16 test set (n=139) for various configurations of model, patch-size, and sampling-stride.
Figure 14: The figure shows the heat-map overlayed on the WSI by FCN models in Ensemble-B configuration. (Left to Right): 3 DenseNet121 FCN models trained on cross-validation folds: 1, 0 and 2 respectively and Ensemble-B, (Patch Size: 1024x1024, Sampling Stride: 512 pixels).
Average FPs /WSI Sensitivity
DF (Fold-0) DF (Fold-1) DF (Fold-2) Ensemble-B
0.25 0.56 0.56 0.61 0.77
0.5 0.65 0.63 0.69 0.84
1 0.71 0.70 0.76 0.85
2 0.77 0.75 0.81 0.88
4 0.82 0.80 0.84 0.88
8 0.86 0.86 0.88 0.89
FROC Score 0.73 0.72 0.77 0.85
Table 8: The table shows the FROC scores achieved on CAMELYON16 test set (n=139) by FCN models in Ensemble-B configuration. Note the abbreviations: DF- DenseNet-121 FCN, FP- False Positives.

3.2 Lymph-node metastases type classification

In this section we detail and discuss the dataset preparation and training methodology of Random Forest classifiers for metastases type classification.

3.2.1 Dataset preparation

The CM17 training dataset (Table 3) had 100 patients and each patient had 5 WSIs with their corresponding metastases labels (total 500 WSIs). We split 100 training patients into 43 patients as train set and the remaining 57 patients as validation set. The train set comprised of patients which had atleast one of their WSIs with pixel level annotation. The Table 9 shows the distribution of WSIs in terms of metastases type between train and validation sets, our split strategy ensured that the distribution of metastases type between the two splits were similar.

Dataset No. of WSIs per each metastasis type
Negative ITC Micro Macro Total
Train set 100 26 35 44 215
Validation set 318 35 64 88 285
Table 9: Details of our CM17 train and validation set split

3.2.2 Training methodology

We generated tumor probability heatmaps for all the 500 WSIs using Ensemble-A configuration (section 3.1.3) and extracted all the features listed in Table 4

. Post generation of features, we cleaned the training set by removing some of the outlier points. The outliers were detected based on threshold-based heuristics like the presence of significantly large tumor false-positive regions in negative cases, etc. For the purpose of classifier selection, feature elimination and hyper-parameter tuning we initially trained our classifiers on the train set (n=215) and validated on the held-out validation set (n=285). Our experimentation on various classifiers showed that the optimal performance in-terms of classification accuracy (90.18%) and Cohen’s kappa score (0.9164) on held-out validation set was achieved with Random Forest classifier with 100 trees. Figure


shows the confusion matrix and the feature importance ranking by the Random Forest classifier. From Table

9, we observed that the data distribution was highly class imbalanced, with negative cases being majority class and ITC cases being minority class. This lead to mis-classifications between ITC and negative cases, as evident in the confusion matrix.

We further experimented by training another Random Forest classifier on the complete training set (n=500) in order to maximize the utilization of training points. The 5-fold cross-validation showed an average accuracy score of 90%, and its performance was similar to the model trained on train set (n=215). We refer to the above two trained models as RF-PI and RF-CI (Random Forest classifiers trained on the partial and complete training set with imbalanced class data, respectively).

(a) Confusion Matrix of RF-PI
(b) Feature Importance
Figure 17: (a) Confusion Matrix of Random Forest classifier trained on imbalanced train set predicted on held-out validation set, (b) Feature importance ranking by Random Forest classifier for metastases classification. The green bars are the feature importance of the forest, along with their inter-trees variability (standard deviation). The feature importance shows that the most important feature for metastases classification is largest tumor region’s long axis/diameter, which corroborates to pathologist way of determining the metastases type from the WSI. Note: _t90 and _t50 indicate heatmaps thresholded at 0.9 and 0.5 probability values, pixels_gt_90 indicate number of non-zero pixels in the binary image after thresholding the heatmap 0.9 probability.

3.2.3 Data balancing

In order to handle the class imbalance problem, one of the techniques proposed in the literature is oversampling by synthetically generating minority class samples using SMOTE algorithm. (chawla2002smote)

. However, this method can introduce noisy samples when the interpolated new samples lie between marginal outliers and inliers. This problem is usually addressed by removing noisy samples by using under-sampling techniques like Tomek’s link

(tomek1976two) or nearest-neighbors. We proposed to balance our training data using a SMOTETomek (batista2004study) algorithm, a combination of SMOTE and Tomek’s link performed consecutively. We separately balanced our train set (n=215) split and the complete training set (n=500). We independently trained two Random Forest classifiers using these two balanced datasets. We refer to these two trained models as RF-PB and RF-CB (Random Forest classifiers trained on Partial and Complete training set, which are class Balanced data respectively). Table 10 provides the results of the validation study performed on all four models. We observed that post-class balancing of the training dataset, the 5-fold cross-validation accuracy scores improved by a margin of 5%.

Accuracy (%)
Classifier 5-fold CV Validation set
RF-PI 87 (0.06) 90.18
RF-PB 92 (0.03) 87.02
RF-CI 89.89 (0.03) N.A
RF-CB 94.83 (0.02) N.A
Table 10: The table provides the validation results of the four Random Forest classifiers, each trained on different subsets of the training data. Note: For the models RF-PI and RF-PB, held-out validation existed, whereas, for the other two models, it was not available as it was trained on the entire training set, and hence N.A (not applicable) is mentioned in the table. For all the models, 5-fold cross-validation accuracy was estimated on their respective training sets. These values are provided as mean (standard deviation).

3.2.4 Ensembling classifiers

We proposed an ensemble strategy for combining the predictions of all the four trained Random Forest classifiers. The ensembling was based on the majority voting principle, and in case of a tie, the higher metastases category was selected. We refer to this model as RF-Ensemble.

3.3 Segmentation performance on DigestPath

Figure 18: (Clockwise from top left) WSI Image; Ground truth of the tumor; Prediction probability of the ensemble pipeline; Prediction with threshold of 0.5
Models Dice
DeepLabV3Plus 0.81
DenseNet-121 FCN 0.84
Inception-ResNetV2 FCN 0.84
Ensemble 0.86
Table 11: Segmentation results on held-out test set of DigestPath

In this section we discuss specific details on the training and inference strategies on DigestPath dataset. We split the training data (n=725) into 3-fold cross-validation sets, each set of data was used to train the individual models in the ensemble. We extracted a total of 80,000 patches from the entire training data. We trained each model with a patch size of 256 and a batch size of 32, with Adam optimizer.

In the case of inference, we made use of patches with dimension 256 x 256, with 50% of overlap and a batch size of 32. We considered 0.5 as a threshold value to generate a segmentation binary mask on a predicted probability map.

Figure 18 illustrates an example of the segmentation map generated by our ensemble model. We tested our trained models on a subset of the training set (n=25) reserved for our internal testing set, and their corresponding results are tabulated in Table 11.

3.4 Segmentation performance on PAIP

Figure 19: (Clockwise from top left) WSI Image; Ground truth of the tumor; Prediction probability of the ensemble pipeline; Prediction with threshold of 0.5

In this section we elaborate on the training and inference details for segmentation on the PAIP dataset. For the tissue mask, we first performed Otsu thresholding on the HSV version of the slide image. We then performed the closing morphological operation with a kernel size of 20, followed by an opening operation with a kernel size of 5 and finally a dilation operation with a kernel size of 20.

We split the training data (n=50) into five cross-validation folds and used three of those folds for training the models of the ensemble. The data was split into five folds as opposed to three folds to ensure that each training set had at least 40 samples. We extracted a total of 200,000 patches from the entire training data with equal contributions from each training sample. We trained with a patch size of 256 and a batch size of 32.

For the inference, we used patches of 1024 and a batch size of 16. For the threshold, we experimented with a range of values and chose 0.5 as it gave optimal segmentation performance on the validation set(n=10). We observed that setting lower threshold(0.5) resulted in false positives and higher threshold(0.9) led to under-segmentation.

Figure 19 illustrates an example of the segmentation map generated by our ensemble model. We tested our trained models on the validation set (n=10), and their corresponding results are tabulated in Table 12.

Model Jaccard Score
DeepLabV3Plus 0.681786
Inception-ResnetV2 0.685771
Densenet 0.679107
Ensemble 0.701621
Table 12: Segmentation results on validation set of PAIP

3.5 Viable Tumor Burden

Figure 20: (Left to Right) Viable tumor prediction; Whole tumor prediction; Whole tumor Ground Truth

Figure 20 shows the whole tumor region predictions obtained by the methodology described 2.10. The first image shows that this pipeline gives a good approximation to the whole tumor region. Most of the samples fit into this category. Our pipeline failed in two scenarios. One is when the segmentation pipeline predicted viable tumors in discrete dispersed regions. This is illustrated in the second example in Figure 20 where there is a spurious prediction apart from the main tumor prediction. The other failure case is when the actual whole tumor region is larger than the convex hull boundary of the actual viable tumor region. The third sample in figure 20 demonstrates this failure.

3.6 Uncertainty analysis

In this section, we demonstrate and interpret the results of our uncertainty analysis on all three digital pathology datasets.

3.6.1 DigestPath uncertainty maps

Figure 26 provides an illustration of tumor probability and uncertainty maps for a held-out test case from the DigestPath dataset. Figure 26(1,2,3,4,5)(b) corresponds to ground truth image overlayed on the tissue image, while Figure 26(1,2,3,4,5)(c) corresponds to tumor probability heatmap overlayed on tissue region. Figures 26(1, 2, 3)(d) describes aleatoric uncertainty with DenseNet, InceptionNet, and DeeplabV3Plus respectively. Since the patches considered for inference in the case of DigestPath were of 256 256, aleatoric uncertainty can be observed in various tumors region. Figure 26(4)(d) describes epistemic uncertainty, where the uncertain region clearly corresponds to the boundary of the tumor tissue region, while Figure 26(5)(d) describes combined uncertainties (average of aleatoric uncertainties for all models along with epistemic uncertainty across the models), which points out various tumorous and boundary regions. In Figure 26(4, 5)(c), we observed ensemble prediction reduced the number of false positives, their by increasing overall dice score to 0.86, which is about 0.04-0.06 of dice score improvement when compared to individual model.

(a) (1) DenseNet model predictions with aleatoric uncertainty (dice = 0.82)
(b) (2) InceptionNet model predictions with aleatoric uncertainty (dice = 0.78)
(c) (3) DeepLabv3Plus model predictions with aleatoric uncertainty (dice = 0.80)
(d) (4) Ensemble Predictions with epistemic uncertainty (dice = 0.86)
(e) (5) Ensemble predictions with combined uncertainty (dice = 0.86)
Figure 26: In the figure (a) WSI, (b) Ground truth image overlayed on the tissue region, (c) Tumor probability heatmap overlayed on the tissue region, (d) Corresponding uncertainty map.

3.6.2 PAIP uncertainty maps

In the Figure 32 provides an illustration of tumor probability and uncertainty maps for a held-out test case from PAIP dataset. Our observations were similar to as discussed in section 3.6.1.

(a) (1) DenseNet model predictions with aleatoric uncertainty (dice = 0.88)
(b) (2) InceptionNet model predictions with aleatoric uncertainty (dice = 0.88)
(c) (3) DeepLabv3Plus model predictions with aleatoric uncertainty (dice = 0.88)
(d) (4) Ensemble Predictions with epistemic uncertainty (dice = 0.91)
(e) (5) Ensemble predictions with combined uncertainty (dice = 0.91)
Figure 32: In the figure (a) WSI, (b) Ground truth image overlayed on the tissue region, (c) Tumor probability heatmap overlayed on the tissue region, (d) Corresponding uncertainty map.

3.6.3 CAMELYON uncertainty maps

In Figure 38 provides an illustration of tumor probability and uncertainty maps for a held-out test case from CAMELYON dataset. Our observations were similar to as discussed in section 3.6.1.

(a) (1) DenseNet model predictions with aleatoric uncertainty (dice = 0.91)
(b) (2) InceptionNet model predictions with aleatoric uncertainty (dice = 0.89)
(c) (3) DeepLabv3Plus model predictions with aleatoric uncertainty (dice = 0.88)
(d) (4) Ensemble Predictions with epistemic uncertainty (dice = 0.94)
(e) (5) Ensemble predictions with combined uncertainty (dice = 0.94)
Figure 38: In the figure (a) WSI, (b) Ground truth image overlayed on the tissue region, (c) Tumor probability heatmap overlayed on the tissue region, (d) Corresponding uncertainty map.

3.7 Transfer learning between cancer sites

Figure 39: (Left to Right) DigestPath Slide; Ground truth; Tumor prediction with Camelyon model; Aleatoric uncertainty

To test transfer learning between cancer sites, we first tried inferencing on the DigestPath dataset with a model trained on the CAMELYON dataset. As it can be observed from Figure 39, the model learned inherent tissue concepts that were common between both sites. The similar staining pattern and nuclei similarity could have contributed to this observation.

Figure 40: Training loss with Camelyon pre-trained weights and ImageNet pre-trained weights for the PAIP dataset

Subsequently, we conducted a comparative analysis on the model weight initialisation. We trained two models on the PAIP dataset; One pre-trained on the CAMELYON dataset and one pre-trained on the Imagenet dataset (deng2009imagenet). We observed that the former model converged at a faster rate. Figure 40 illustrates this comparison. This observation indicates that models trained on pathology datasets could serve as a better starting point compared to natural image datasets.

3.8 Model generalizability

(a) (1) DenseNet model predictions with aleatoric uncertainty
(b) (2) InceptionNet model predictions with aleatoric uncertainty
(c) (3) DeepLabv3Plus model predictions with aleatoric uncertainty
(d) (4) Ensemble Predictions with epistemic uncertainty
(e) (5) Ensemble predictions with combined uncertainty
Figure 46: In the figure (a) WSI, (b) Tumor probability heatmap overlayed on the tissue region, (c) Corresponding uncertainty map.

To test the generalizability of the model we conducted the model inferences on TCGA colon dataset (gdc), the model was able to accurately classify between tumorous and normal cells as seen in the Figure 46. Even though the model was trained on DigestPath data, our model could generalize well on datasets coming from other sources, Figures 46(1, 2, 3, 4, 5)(b) describes the predicted high probable tumor regions, while in the Figure 46(1, 2, 3, 4, 5)(c) describes corresponding aleatoric and epistemic uncertainties in the prediction. In future, we plan to extend this generalizability study by getting pathologists annotations done and precisely assessing the seg mentation performance.

3.9 Hardmining

Our patch extraction scheme from the whole-slide images were based on random uniform sampling. However, this led to some hard examples being excluded from the training set which resulted in the model’s poor performance on such regions of WSI. Therefore, we attempted to solve this issue by hardmining the poorly performing regions in WSI and fine-tuning the trained model with this hardmined set.

For the experimental analysis with the PAIP dataset, we first extracted 80,000 random patches as the initial training set and trained a model on it. We empirically defined patches that scored less than 0.4 on the Jaccard index as regions were the model performed poorly. We than inferred on the entire dataset and extracted all patches that meet this criterion. In this manner, we obtained a total of 70,000 patches as the hardmined set. We then continued training the model after substituting the training set with the hardmined set. However, at every epoch, we observed the model’s accuracy on validation set reduced. We theorized that by training only on the failure cases, the model forgets the initial training set. So, we tried an alternative where we concatenated the patches of both the initial training set and the hard-mined set and trained the model with this combined set. The model’s accuracy reduced with this method as well, albeit at a slower rate.

To investigate this occurrence, we examined the patches obtained from hardmining and analysed them. We found groups of patches that were highly similar to each other. This was because the regions where the model failed were larger in size compared to the dimensions with which the patches were extracted. Therefore, the hardmining algorithm extracts several patches from the same region. As a result, this set was less representative of the data compared to the first set of randomly extracted patches. We observed a similar pattern in the other datasets as well.

4 Challenge Results

4.1 Performance on CAMELYON17 Challenge

Method Cohen Kappa Score Rank
leeautomatic 0.9570 1
pinchaudcamelyon17 0.9386 2
Ours (RF-Ensemble) 0.9090 3
Ours (RF-PI) 0.8971 12
Ours (RF-PB) 0.9027 9
Ours (RF-CI) 0.8889 18
Ours (RF-CB) 0.9057 6
Table 13: Comparisons with different approaches to automated pN-Staging in CAMELYON17 challenge. Our score reported in the table is from open public leader board. Our best performing method stood rank-3 on the leader-board (Accessed on: 31-Dec-2019). The table shows the performance of individual Random Forest classifiers in the ensemble and RF-Ensemble classifier.

Table 13 compares the results between our proposed approach and other published results on CAMELYON17 testing dataset (n=500). Our best performing model was RF-Ensemble, which gave a Cohen’s kappa score of 0.9090, placing 3rd in Open Leaderboard (out of 120 valid submission entries).

4.2 Performance on DigestPath Challenge

In the case of the DigestPath challenge, we submitted our ensemble model and obtained a Dice score of on test set. The scores of all the top-performing teams are tabulated in the Table. 14.

Teams Dice
kuanguang 0.807
zju_realdoctor 0.792
TIA_Lab 0.787
Ours 0.782
Table 14: The table shows the top segmentation scores on DigestPath testset

4.3 Performance on PAIP 2019 Challenge

The Table 15 summarises the scores for the top entries. Task 1 is evaluated with the Jaccard metric. For Task 2, first, the absolute accuracy is calculated, and then this score is weighted with the Task 1 score of the corresponding case. For Task 1, all the participants utilized deep learning methods, albeit with different architectures. For Task 2, all the participants used deep learning methods. However, our heuristic based algorithm performed better than most deep learning methods.

Team Task 1 Task 2
FNLCR 0.789 0.752
Sichuan University 0.777 NA
Ours 0.750 0.6337
Alibaba 0.672 0.6199
Sejong University 0.665 0.6330
Table 15: The table shows the top 5 entries of PAIP 2019. Task 1 corresponds to Viable tumor segmentation and Task 2 corresponds to Viable tumor burden estimation. Note: FNLCR: Frederick National Laboratory for Cancer Research

5 Open Source Contribution

Figure 47: User Interface of the software

We developed an open-source application (digipathai) on top of our segmentation pipeline. The application 47 can load WSI images, perform segmentation, and calculate the uncertainties. It features an API with which researchers can utilize the segmentation pipeline within their applications. Conversely, the application’s modular structure allows for researchers to test their segmentation pipeline with the application’s GUI as well. The slide viewer was built using OpenSlide (goode2013openslide) and OpenSeadragon (openseadragon).

6 Discussion and Conclusion

We developed an automated end-to-end framework for tumor tissue segmentation and whole slide image analysis that showed state-of-the-art results on three publicly available histopathology challenges. We approached the problem of segmentation of gigapixel WSI images using the divide and conquer strategy by dividing the large image into computationally feasible patch sizes and running segmentation algorithms on patches and stitching segmented patches to generate the segmentation of the entire WSI. We approached the problem of patch image segmentation using fully convolutional neural networks (FCN). FCNs are encoder-decoder based architectures employed for generating dense pixel-level classification. We used some of the state-of-the-art CNNs on natural image tasks as encoders for our FCNs, and the decoder was basically a learnable upsampling module to generate dense predictions. Our segmentation framework is an ensemble comprising of multiple FCN architectures, each independently trained on different subsets of the training data. The ensemble generated the tumor probability map by averaging the posterior probability maps of all the FCNs. The ensemble approach showed superior segmentation performance when compared to its individual constituting FCNs. The patch-based segmentation methods for large-sized images suffer from loss of neighboring context information at patch borders. We attempted to address this issue during inference by using a larger patch size and averaging overlapping patch regions posterior probability maps while stitching tumor probability maps for WSI. Depending upon tissue area covered on glass-slide, our inference pipeline takes about approximately 30-75 minutes for generating the entire tumor probability map. In addition to the generation of tumor heat-map, we also incorporated a methodology for generating uncertainty maps based on model and data variability. These uncertainty maps would assist in better interpretation by pathologists and fine-tuning the model with uncertain regions. Our proposed hard-mining approach showed decreased segmentation performance because of the adopted fine-tuning procedure on the trained model. Hence, one of the directions of our future work would involve developing effective methodologies for the online extraction of good representative patches from WSI during the training phase rather than the extraction of a fixed set of random patches(leeautomatic; pinchaudcamelyon17). Other areas to explore would be in the design of efficient and multi-resolution FCN architectures for capturing multi-resolution information from WSI images (graham2019mild). Our experimental results on transfer learning showed that pre-training models with different histopathology datasets could act as good starting points for training models were pathology datasets are limited. Post-processing techniques could be one of the directions to improve generated WSI segmentation, techniques such as patch-based conditional random fields (krahenbuhl2011efficient; li2018cancer) could be employed to refine the generated segmentation masks rather than employing hardcoded threshold values. The segmentation of WSIs is usually the primary step for many analysis tasks such as metastases classification and estimation of tumor burden. In this regard, we developed an automated pipeline for lymph node metastases classification and pN-staging. We proposed an ensemble of multiple Random Forest classifiers, each trained on different subsets of the training data. The training data was prepared by extracting meaningful features from a pathologist’s viewpoint from the tumor probability maps. We also demonstrated the efficacy of our synthetic samples in addressing class imbalance datasets for such classification tasks.

Our method for tumor burden estimation used an empirical method for estimating the whole tumor region. However, it still performed on par with other deep learning approaches that modeled it as a segmentation problem. This shows that the convex hull is a good approximation and also computationally inexpensive. This method could be developed further by incorporating learning-based methods. For example, the convex hull output could be used as an initial point for active contours-based models (kass1988snakes).

7 Author Contributions

MK, HR and AK developed the generalised pipeline together. MK, HR and AK experimented on CAMELYON, PAIP and DigestPath respectively. MK supervised HR and AK on the experimentation. HR and AK developed the open-source software together. MK, HR and AK wrote and revised the manuscript. GK and BS edited the manuscript, supervised and funded the study.

Appendix A Evaluation metrics

a.0.1 Dice Coefficient

The Dice is a metric used to measure the overlap between two given sets. In the case of segmentation, dice coefficient measures the overlap between the proposed model prediction() and the expert pathologists’ ground truth (

). This was the evaluation metric used in the DigestPath19 challenge. Mathematically, dice coefficient can be expressed as equation



a.0.2 Jaccard Coefficient

The Jaccard metric, similar to the Dice coefficient, measures the intersection over union between two given sets. For the task of segmentation, Jaccard coefficient measures overlap between prediction () and expert pathologists’ ground truth (). This was the evaluation metric used in the PAIP 2019 (paip) challenge. Mathematically, Jaccard coefficient can be expressed as equation 7.


a.0.3 Froc

One of the metrics used in CM16 challenge for lesion-based evaluation was free-response receiver operating characteristic (FROC) curve. The FROC curve was defined as the plot of sensitivity versus the average number of false-positives per image. We used CM16 challenge testing dataset for evaluating the performance of our algorithms for lesion detection/localization. The detection/localization performance was summarized using Free Response Operating Characteristic (FROC) curves. This was similar to ROC analysis, except that the false positive rate on the x-axis is replaced by the average number of false positives per WSI.

In the CM16 challenge, a true positive was considered, if the location of the detected region was within the annotated ground truth lesion.

  • If there were multiple findings for a single ground truth region, they were counted as a single true positive finding and none of them were counted as a false positive.

  • All detections that were not within a specific distance from the ground truth annotations were counted as false positives.

The final FROC score was defined as the average sensitivity at 6 predefined false positive rates: 1/4, 1/2, 1, 2, 4, and 8 FPs per whole slide image.

a.0.4 Cohen’s kappa score

Cohen’s kappa (fleiss1973equivalence)

is a statistic that measures the inter-rater reliability for categorical variables. In CM17 challenge for evaluating pN-staging of the patients the metric used was Cohen’s kappa with five classes and quadratic weights. The kappa metric ranges from -1 to +1, where 1 represented perfect agreement with the raters, and 0 represented the amount of agreement that can be expected by random chance and, a negative value represented lower than chance agreement.