Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology

02/18/2019 ∙ by David Tellez, et al. ∙ Radboudumc 0

Stain variation is a phenomenon observed when distinct pathology laboratories stain tissue slides that exhibit similar but not identical color appearance. Due to this color shift between laboratories, convolutional neural networks (CNNs) trained with images from one lab often underperform on unseen images from the other lab. Several techniques have been proposed to reduce the generalization error, mainly grouped into two categories: stain color augmentation and stain color normalization. The former simulates a wide variety of realistic stain variations during training, producing stain-invariant CNNs. The latter aims to match training and test color distributions in order to reduce stain variation. For the first time, we compared some of these techniques and quantified their effect on CNN classification performance using a heterogeneous dataset of hematoxylin and eosin histopathology images from 4 organs and 9 pathology laboratories. Additionally, we propose a novel unsupervised method to perform stain color normalization using a neural network. Based on our experimental results, we provide practical guidelines on how to use stain color augmentation and stain color normalization in future computational pathology applications.



There are no comments yet.


page 2

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Computational pathology aims at developing machine learning based tools to automate and streamline the analysis of whole-slide images (WSI), i.e. high-definition images of histological tissue sections. These sections consist of thin slices of tissue that are stained with different dyes so that tissue architecture becomes visible under the microscope. In this study, we focus on hematoxylin and eosin (H&E), the most widely used staining worldwide. It highlights cell nuclei in blue color (hematoxylin), and cytoplasm, connective tissue and muscle in various shades of pink (eosin). The eventual color distribution of the WSI depends on multiple steps of the staining process, resulting in slightly different color distributions depending on the laboratory where the sections were processed, see Fig. 

1 for examples of H&E stain variation. This inter-center stain variation hampers the performance of machine learning algorithms used for automatic WSI analysis. Algorithms that were trained with images originated from a single pathology laboratory often underperform when applied to images from a different center, including state-of-the-art methods based on convolutional neural networks (CNNs) (Goodfellow et al. (2016); Bejnordi et al. (2017); Bándi et al. (2019); Tellez et al. (2018); Veta et al. (2018); Ciompi et al. (2017); Sirinukunwattana et al. (2017)). Existing solutions to reduce the generalization error in this setting can be categorized into two groups: (1) stain color augmentation, and (2) stain color normalization.

Figure 1: Example images from training and test datasets. Applications are indicated by colors and keywords: tumor detection in lymph nodes (lymph), colorectal cancer tissue classification (crc), mitosis detection (mitosis) and prostate epithelium detection (prostate). Training set images are indicated by the keyword rumc and black outline. The rest belong to test sets from other centers. Stain variation can be observed between training and test images.

1.1 Stain color augmentation

Stain color augmentation, and more generally data augmentation, has been proposed as a method to reduce CNN generalization error by simulating realistic variations of the training data (Goodfellow et al. (2016)). These artificial variations are hand-engineered to mimic the appearance of future test samples that deviate from the training manifold. Previous work on data augmentation for computational pathology has defined two main groups of augmentation techniques: (1) morphological and (2) color transformations (Bándi et al. (2019); Liu et al. (2017); Tellez et al. (2018)). Morphological augmentation spans from simple techniques such as 90-degree rotations, vertical and horizontal mirroring, or image scaling; to more advanced methods like elastic deformation (Simard et al. (2003)

), additive Gaussian noise, and Gaussian blurring. The common denominator among these transformations is the fact that only the morphology of the underlying image is modified and not the color appearance, e.g. Gaussian blurring simulates out of focus artifacts which is a common issue encountered with WSI scanners. Conversely, color augmentation leaves morphological features intact and focuses on simulating stain color variations instead. Common color augmentation techniques borrowed from Computer Vision include brightness, contrast and hue perturbations. Recently, researchers have proposed other approaches more tailored to mimic specific H&E stain variations by perturbing the images directly in the H&E color space (

Tellez et al. (2018)).

1.2 Stain color normalization

Stain color normalization reduces stain variation by matching the color distribution of the training and test images. Traditional approaches try to normalize the color space by estimating a color deconvolution matrix that allows identifying the underlying stains (

Reinhard et al. (2001); Macenko et al. (2009)). More recent methods use machine learning algorithms to detect certain morphological structures, e.g. cell nuclei, that are associated with certain stains, improving the result of the normalization process (Khan et al. (2014); Bejnordi et al. (2016)

). Deep generative models, i.e. variational autoencoders and generative adversarial networks (

Kingma and Welling (2013); Goodfellow et al. (2014)), have been used to generate new image samples that match the template data manifold (Cho et al. (2017); Zanjani et al. (2018)). Moreover, color normalization has been formulated as a style transfer task where the style is defined as the color distribution produced by a particular lab (Bug et al. (2017)). However, despite their success and widespread adoption as a preprocessing tool in a variety of computational pathology applications (Bejnordi et al. (2017); Bándi et al. (2019); Veta et al. (2018)

), they are not always effective and can produce images with color distributions that deviate from the desired color template. In this study, we propose a novel unsupervised approach that leverages the power of deep learning to solve the problem of stain normalization. We reformulate the problem of stain normalization as an image-to-image translation task and train a neural network to solve it. We do so by feeding the network with heavily augmented H&E images and training the model to reconstruct the original image without augmentation. By learning to remove this color variation, the network effectively learns to perform stain color normalization in unseen images whose color distribution deviates from that of the training set.

1.3 Multicenter evaluation

Despite the wide adoption of stain color augmentation and stain color normalization in the field of computational pathology, the effects on performance of these techniques have not been systematically evaluated. Existing literature focuses on particular applications, and does not quantify the relationship between these techniques and CNN performance (Ciompi et al. (2017); Bejnordi et al. (2017); Bándi et al. (2019); Veta et al. (2018)). In this study, we aim to overcome this limitation by comparing these techniques across four representative applications including multicenter data. We selected four patch-based classification tasks where a CNN was trained with data from a single center only, and evaluated in unseen data from multiple external pathology laboratories. We chose four relevant applications from the literature: (1) detecting the presence of mitotic figures in breast tissue (Tellez et al. (2018)); (2) detecting the presence of tumor metastases in breast lymph node tissue (Bándi et al. (2019)); (3) detecting the presence of epithelial cells in prostate tissue (Bulten et al. (2019)); and (4) distinguishing among 9 tissue classes in colorectal cancer (CRC) tissue (Ciompi et al. (2017)). All test datasets presented a substantial and challenging stain color deviation from the training set, as can be seen in Fig. 1. We trained a series of CNNs following an identical training protocol while varying the stain color normalization and stain color augmentation techniques used during training. This thorough evaluation allowed us to establish a ranking among the methods and measure relative performance improvements among them.

1.4 Contributions

Our contributions can be summarized as follows:

  • We systematically evaluated several well-known stain color augmentation and stain color normalization algorithms in order to quantify their effects on CNN classification performance.

  • We conducted the previous evaluation using data from a total of 9 different centers spanning 4 relevant classification tasks: mitosis detection, tumor metastasis detection in lymph nodes, prostate epithelium detection, and multiclass colorectal cancer tissue classification.

  • We formulated the problem of stain color normalization as an unsupervised image-to-image translation task and trained a neural network to solve it.

The paper is organized as follows. Sec. 2 and Sec. 3 describe the materials and methods thoroughly. Experimental results are explained in Sec. 4, followed by Sec. 5 and Sec. 6 where the discussion and final conclusion are stated.

2 Materials

We collected data from a variety of pathology laboratories for four different applications. In all cases, we used images from the Radboud University Medical Centre (Radboudumc or rumc) exclusively to train the models for each of the four classification tasks. Images from the remaining centers were used for testing purposes only. We considered RGB patches of 128x128 pixels extracted from annotated regions. Examples of these patches are shown in Fig. 1. The following sections describe each of the four classification tasks.

2.1 Mitotic figure detection

In this binary classification task, the goal was to accurately classify as positive samples those patches containing a mitotic figure in their center, i.e. a cell undergoing division. In order to train the classifier, we used 14 H&E WSIs from triple negative breast cancer patients, scanned at

resolution, with annotations of mitotic figures obtained as described in (Tellez et al. (2018)). We split the slides into training (6), validation (4) and test (4), and extracted a total of 1M patches. We refer to this set as mitosis-rumc.

For the external dataset, we used publicly available data from the TUPAC Challenge (Veta et al. (2018)), i.e. 50 cases of invasive breast cancer with manual annotations of mitotic figures scanned at resolution. We extracted a total of 300K patches, and refer to this dataset as mitosis-tupac.

2.2 Tumor metastasis detection

The aim of this binary classification task was to identify patches containing metastatic tumor cells. We used publicly available WSIs from the Camelyon17 Challenge (Bándi et al. (2019)). This cohort consisted of 50 exhaustively annotated H&E slides of breast lymph node resections from breast cancer patients from 5 different centers (10 slides per center), including Radboudumc. They were scanned at resolution and the tumor metastases were manually delineated by experts.

We used the 10 WSIs from the Radboudumc to train the classifier, split into training (4), validation (3) and test (3), and extracted a total of 300K patches. We refer to this dataset as lymph-rumc. We used the remaining 40 WSIs as external test data, extracting a total of 1.2M patches, and assembling 4 different test sets (one for each center). We named them according to their center’s name acronym: lymph-umcu, lymph-cwh, lymph-rh and lymph-lpe.

Figure 2: Summary of the data augmentation techniques and datasets used in this study, organized in columns and rows respectively. Patches on the leftmost column depict the original input images and the rest of patches are augmented versions of them. Augmentations performed in the grayscale color space are depicted on the right for one sample dataset only. Basic augmentation is included in all cases.

2.3 Prostate epithelium detection

The goal of this binary classification task was to identify patches containing epithelial cells in prostate tissue. We trained the classifier with 25 H&E WSIs of prostate resections from the Radboudumc scanned at resolution, with annotations of epithelial tissue as described in (Bulten et al. (2019)). We split this cohort into training (13), validation (6) and test (6), and extracted a total of 250K patches. We refer to it as prostate-rumc.

We used two test datasets for this task. First, we selected 10 H&E slides of prostate resections from the Radboudumc with different staining and scanning conditions, resulting in substantially different stain appearance (see prostate-rumc2 in Fig. 1). This test set was manually annotated as described in (Bulten et al. (2019)) and named prostate-rumc2. We extracted 75K patches from these WSIs. Second, we used publicly available images from 20 H&E slides of prostatectomy specimens with manual annotations of epithelial tissue obtained as described in (Bulten et al. (2019); Gertych et al. (2015)). We extracted 65K patches from them and named the test set prostate-cedar.

2.4 Colorectal cancer tissue type classification

In this multiclass classification task, the goal was to distinguish among 9 different colorectal cancer (CRC) tissue classes, namely: 1) tumor, 2) stroma, 3) muscle, 4) lymphocytes, 5) healthy glands, 6) fat, 7) blood cells, 8) necrosis and debris, and 9) mucus. We used 54 H&E WSIs of colorectal carcinoma tissue from the Radboudumc scanned at resolution to train the classifier, with manual annotations of the 9 different tissue classes. We split this cohort into training (24), validation (15) and test (15), extracted a total of 450K patches, and named it crc-rumc.

We used two external datasets for this task. First, a set of 74 H&E WSIs from rectal carcinoma patients with annotations of the same 9 classes, as described in (Ciompi et al. (2017)). We extracted 35K patches and refer to this dataset as crc-labpon. Second, we used a publicly available set of H&E image patches from colorectal carcinoma patients (Kather et al. (2016)). Annotations for 6 tissue types were available: 1) tumor, 2) stroma, 3) lymph, 4) healthy glands, 5) fat, and 6) blood cells, debris and mucus. We extracted 4K patches in total, and refer to this dataset as crc-heidelberg.

2.5 Multi-organ dataset

For the purpose of training a network to solve the problem of stain color normalization, we created an auxiliary dataset by aggregating patches from mitosis-rumc, lymph-rumc, prostate-rumc and crc-rumc in a randomized and balanced manner. We discarded all labels since they were not needed for this purpose. We preserved a total of 500K patches for this set and called it the multi-organ dataset.

3 Methods

In this study, we evaluated the effect in classification performance of several methods for stain color augmentation and stain color normalization. This section describes these methods.

Figure 3: Visual comparison of the stain color normalization techniques used in this study. Rows correspond to the four tested techniques and columns to datasets, with green for lymph, blue for mitosis, yellow for prostate and red for colorectal.

3.1 Stain color augmentation

We assume a homogeneous stain color distribution for the training images and a more varied color distribution for the test images. Notice that it is challenging for a classification model trained solely with to generalize well to due to potential stain differences among sets. To solve this problem, stain color augmentation defines a preprocessing function that transforms images of the training set to present an alternative and more diverse color distribution :


on the condition that:


In practice, heavy data augmentation is used to satisfy Eq. 2. In order to simplify our experimental setup, we grouped several data augmentation techniques into the following categories attending to the nature of the image transformations. Examples of the resulting augmented images are shown in Fig. 2.

Basic. This group included 90 degree rotations, and vertical and horizontal mirroring.

Morphology. We extended basic with several transformations that simulate morphological perturbations, i.e. alterations in shape, texture or size of the imaged tissue structures, including scanning artifacts. We included basic augmentation, scaling, elastic deformation (Simard et al. (2003)), additive Gaussian noise (perturbing the signal-to-noise ratio), and Gaussian blurring (simulating out-of-focus artifacts).

Brightness & contrast (BC). We extended morphology with random brightness and contrast image perturbations (Haeberli and Voorhies (1994)).

Hue-Saturation-Value (HSV). We extended the BC augmentation by randomly shifting the hue and saturation channels in the HSV color space (Van der Walt et al. (2014)). This transformation produced substantially different color distributions when applied to the training images. We tested two configurations depending on the observed color variation strength, called HSV-light and HSV-strong.

Hematoxylin-Eosin-DAB (HED). We extended the BC augmentation with a color variation routine specifically designed for H&E images (Tellez et al. (2018)). This method followed three steps. First, it disentangled the hematoxylin and eosin color channels by means of color deconvolution using a fixed matrix. Second, it perturbed the hematoxylin and eosin stains independently. Third, it transformed the resulting stains into regular RGB color space. We tested two configurations depending on the observed color variation strength, called HED-light and HED-strong.

During training, we selected the value of the augmentation hyper-parameters randomly within certain ranges to achieve stain variation. We tuned all ranges manually via visual examination. In particular, we used a scaling factor between , elastic deformation parameters and , additive Gaussian noise with , Gaussian blurring with , brightness intensity ratio between , and contrast intensity ratio between . For HSV-light and HSV-strong, we used hue and saturation intensity ratios between  and , respectively. For HED-light and HED-strong, we used intensity ratios between  and , respectively, for all HED channels.

3.2 Stain color normalization

Figure 4: Network-based stain color normalization. From left to right: patches from the training set are transformed with heavy color augmentation and fed to a neural network. This network is trained to reconstruct the original appearance of the input images by removing color augmentation, effectively learning how to perform stain color normalization.

Stain color normalization reduces color variation by transforming the color distribution of training and test images, i.e. and , to that of a template . It performs such transformation using a normalization function that maps any given color distribution to the template one:


By matching and

, the problem of stain variance vanishes and the model no longer requires to generalize to unseen stains in order to perform well. We evaluated several methods that implement

(see Fig. 3), and propose a novel technique based on neural networks.

Identity. We performed no transformation on the input patches, serving as a baseline method for the rest of techniques.

Grayscale. In this case, transformed images from RGB to grayscale space, removing most of the color information present in the patches. We hypothesized that this color information is redundant since most of the signal in H&E images is present in morphological and structural patterns, e.g. the presence of a certain type of cell.

LUT-based. We implemented an approach that uses tissue morphology to perform stain color normalization (Bejnordi et al. (2016)). This popular method has been used by numerous researchers in recent public challenges (Bándi et al. (2019); Bejnordi et al. (2017)). It detects cell nuclei in order to precisely characterize the H&E chromatic distribution and density histogram for a given WSI. First, it does so for a given template WSI, e.g. an image from the training set, and a target WSI. Second, the color distributions of the template and target WSIs are matched, and the color correspondence is stored in a look-up table (LUT). Finally, this LUT is used to normalize the color of the target WSI.


. We developed a novel approach to perform stain color normalization based on unsupervised learning and neural networks, see Fig. 

4. We parameterized the normalization function with a neural network and trained it end-to-end to remove the effects of data augmentation. Even though it is not possible to invert the many-to-many augmentation function , we can learn a partial many-to-one function that maps any arbitrary color distribution to a template distribution :


Since can normalize (Eq. 4), and is a superset of and (Eq. 2), we conclude that can effectively normalize and (Eq. 2).

We trained to perform image-to-image translation using the multi-organ dataset. During training, images were heavily augmented and fed to the network. The model was tasked with reconstructing the images with their original appearance, before augmentation. We used a special configuration of the HSV augmentation where we kept the color transformation only, i.e. did not include basic, morphology and BC. We used the maximum intensity for the transformation hyper-parameters, i.e. hue, saturation and value channel ratios between  . The strength of this augmentation resulted in images with drastically different color distributions, sometimes compressing all color information into grayscale. In order to invert this complex augmentation, we hypothesized that the network learned to associate certain tissue structures with their usual color appearance.

Normalization Augmentation lymph-cwh lymph-lpe lymph-rh lymph-umcu mitosis-tupac prostate-cedar prostate-rumc2 crc-heidelberg crc-labpon Ranking
Network HSV-light 0.946(0.006) 0.962(0.001) 0.941(0.002) 0.965(0.004) 0.992(0.001) 0.872(0.012) 0.957(0.000) 0.901(0.003) 0.980(0.001) 1.7(1.1) (*)
Network HED-light 0.949(0.004) 0.968(0.001) 0.942(0.002) 0.963(0.003) 0.989(0.003) 0.862(0.010) 0.958(0.000) 0.907(0.002) 0.980(0.001) 3.0(1.9)
Identity HSV-strong 0.955(0.003) 0.965(0.003) 0.929(0.002) 0.973(0.002) 0.988(0.003) 0.886(0.004) 0.945(0.008) 0.903(0.003) 0.977(0.001) 4.5(2.2)
Identity HED-light 0.952(0.003) 0.976(0.001) 0.946(0.008) 0.968(0.003) 0.996(0.001) 0.879(0.010) 0.957(0.001) 0.895(0.002) 0.973(0.001) 4.8(1.9)
Identity HED-strong 0.950(0.004) 0.959(0.004) 0.936(0.006) 0.957(0.006) 0.992(0.002) 0.872(0.005) 0.945(0.003) 0.921(0.004) 0.967(0.002) 5.5(2.2)
LUT HED-strong 0.934(0.005) 0.941(0.005) 0.925(0.005) 0.963(0.004) 0.989(0.002) 0.871(0.005) 0.945(0.002) 0.946(0.001) 0.956(0.002) 5.6(2.8)
Network HSV-strong 0.953(0.002) 0.964(0.003) 0.946(0.002) 0.964(0.005) 0.991(0.002) 0.852(0.006) 0.951(0.002) 0.894(0.004) 0.975(0.002) 5.7(3.1)
Network HED-strong 0.956(0.002) 0.959(0.002) 0.940(0.003) 0.965(0.004) 0.985(0.005) 0.861(0.008) 0.943(0.002) 0.917(0.003) 0.974(0.002) 7.6(2.3)
LUT HSV-strong 0.923(0.008) 0.939(0.003) 0.928(0.005) 0.947(0.007) 0.987(0.002) 0.862(0.006) 0.949(0.003) 0.940(0.002) 0.962(0.002) 10.0(2.4)
LUT HED-light 0.914(0.010) 0.926(0.010) 0.923(0.005) 0.932(0.017) 0.993(0.001) 0.852(0.018) 0.948(0.002) 0.940(0.002) 0.966(0.003) 10.0(3.6)
Network BC 0.944(0.002) 0.950(0.003) 0.903(0.002) 0.934(0.005) 0.983(0.004) 0.869(0.008) 0.953(0.003) 0.879(0.005) 0.981(0.001) 10.1(2.7)
Grayscale BC 0.956(0.003) 0.962(0.003) 0.935(0.005) 0.961(0.002) 0.989(0.002) 0.851(0.002) 0.939(0.003) 0.885(0.003) 0.972(0.000) 12.2(2.5)
LUT HSV-light 0.894(0.005) 0.936(0.005) 0.921(0.003) 0.942(0.006) 0.987(0.001) 0.860(0.009) 0.951(0.002) 0.946(0.002) 0.971(0.002) 12.5(1.5)
LUT BC 0.925(0.023) 0.948(0.024) 0.853(0.014) 0.790(0.055) 0.985(0.003) 0.848(0.016) 0.951(0.003) 0.922(0.005) 0.973(0.002) 12.8(2.5)
Network Morphology 0.939(0.009) 0.949(0.006) 0.890(0.010) 0.950(0.008) 0.980(0.005) 0.823(0.020) 0.913(0.010) 0.867(0.002) 0.977(0.001) 15.4(1.1)
Identity HSV-light 0.888(0.012) 0.951(0.008) 0.942(0.004) 0.930(0.021) 0.962(0.014) 0.905(0.004) 0.949(0.001) 0.894(0.003) 0.976(0.000) 16.3(3.0)
Grayscale Morphology 0.943(0.009) 0.820(0.019) 0.922(0.004) 0.941(0.010) 0.991(0.005) 0.816(0.004) 0.910(0.008) 0.812(0.008) 0.929(0.005) 16.4(0.7)
Network Basic 0.944(0.003) 0.954(0.006) 0.887(0.009) 0.959(0.003) 0.969(0.004) 0.815(0.017) 0.905(0.005) 0.854(0.005) 0.977(0.002) 17.7(1.6)
Grayscale Basic 0.940(0.006) 0.692(0.057) 0.926(0.009) 0.938(0.017) 0.992(0.001) 0.661(0.035) 0.882(0.007) 0.798(0.005) 0.934(0.002) 19.0(0.7)
LUT Morphology 0.898(0.006) 0.920(0.006) 0.801(0.019) 0.874(0.022) 0.969(0.007) 0.803(0.007) 0.895(0.012) 0.903(0.005) 0.939(0.006) 19.2(0.8)
LUT Basic 0.908(0.009) 0.894(0.027) 0.809(0.020) 0.772(0.064) 0.951(0.008) 0.741(0.016) 0.906(0.010) 0.888(0.011) 0.930(0.012) 21.0(0.0)
Identity BC 0.899(0.005) 0.634(0.089) 0.741(0.015) 0.177(0.042) 0.906(0.030) 0.704(0.054) 0.936(0.006) 0.761(0.010) 0.684(0.008) 22.3(0.4)
Identity Morphology 0.811(0.024) 0.671(0.089) 0.673(0.024) 0.214(0.156) 0.986(0.006) 0.602(0.020) 0.374(0.171) 0.724(0.009) 0.569(0.025) 23.1(0.9)
Identity Basic 0.811(0.008) 0.563(0.277) 0.790(0.042) 0.406(0.336) 0.965(0.008) 0.624(0.047) 0.631(0.159) 0.705(0.026) 0.556(0.051) 23.6(0.5)

(*) Significantly better than all the other methods (rows) except for the second entry (pairwise T-test with

Table 1: Experimental results ranking stain color augmentation and stain color normalization

methods. Values correspond to AUC scores, except for the last column, averaged across 5 repetitions with standard deviation shown between parenthesis. Each column represents a different external test dataset, with the last column

Ranking indicating the position of each method within the global benchmark, computed as described in Sec. 4.1. Top-3 best results per column are highlighted in bold for viewing purposes.

We used an architecture inspired by U-Net (Ronneberger et al. (2015)

), with a downward path of 5 layers of strided convolutions (

Springenberg et al. (2014)

) with 32, 64, 128, 256 and 512 3x3 filters, stride of 2, batch normalization (BN) (

Ioffe and Szegedy (2015)

) and leaky-ReLU activation (LRA) (

Maas et al. (2013)). The upward path consisted of 5 upsampling layers, each one composed of a pair of nearest-neighbor upsampling and a convolutional operation (Odena et al. (2016)

), with 256, 128, 64, 32 and 3 3x3 filters, BN and LRA; except for the final convolutional layer that did not have BN and used the hyperbolic tangent (tanh) as activation function. We used long skip connections to ease the synthesis upward path (

Ronneberger et al. (2015)), and applied L2 regularization with a factor of .

We minimized the mean squared error (MSE) loss using stochastic gradient descent with Adam optimization (

Kingma and Ba (2014)) and 64-sample mini-batch, decreasing the learning rate by a factor of 10 starting from

very time the validation loss stopped improving for 4 consecutive epochs until

. Finally, we selected the weights corresponding to the model with the lowest validation loss during training.

3.3 CNN Classifiers

In order to measure the effect of stain color augmentation and stain color normalization, we trained a series of identical CNNs to perform patch classification using different combinations of these techniques. For training and validation purposes, we used the rumc datasets described in Sec. 2.

The architecture of such CNN classifiers consisted of 9 layers of strided convolutions with 32, 64, 64, 128, 128, 256, 256, 512 and 512 3x3 filters, stride of 2 in the even layers, BN and LRA; followed by global average pooling; 50% dropout; a dense layer with 512 units, BN and LRA; and a linear dense layer with either 2 or 9 units depending on the classification task, followed by a softmax. We applied L2 regularization with a factor of .

We minimized the cross-entropy loss using stochastic gradient descent with Adam optimization and 64-sample class-balanced mini-batch, decreasing the learning rate by a factor of 10 starting from very time the validation loss stopped improving for 4 consecutive epochs until . Finally, we selected the weights corresponding to the model with the lowest validation loss during training.

4 Experimental results

We conducted a series of experiments in order to quantify the impact in performance of the different stain color augmentation and stain color normalization methods introduced in the previous section across four different classification tasks. We trained a CNN classifier for each combination of organ, color normalization and data augmentation method under consideration. In the case of grayscale normalization, we only tested basic, morphology and BC augmentation techniques. We conducted 96 different experiments, repeating each 5 times using different random initialization for the network parameters, accounting for a total of 480 trained CNNs.

4.1 Evaluation

We evaluated the area under the receiver-operating characteristic curve (AUC) of each CNN in each external test set. In the case of multiclass classification, we considered the weighted average, i.e. we calculated the individual AUC per label and averaged the resulting values weighted by the number of true instances for each label. We reported the mean and standard deviation of the resulting AUC for each experiment across repetitions in Tab. 


In order to establish a ranking among methods, shown in the rightmost column in Tab. 1, we performed the following computation. We aggregated the results into a single score per organ by ranking each dataset column individually, and performed an average across test sets within the same organ. Notice that we perform an average across ranking scores, rather than AUC scores, following (Demšar (2006)). Once we have computed a score per organ for each method, we repeated the procedure by ranking each organ column individually, which we denoted as organ ranking. At this point, we could observe method performance in each organ. In order to calculate a global ranking among the methodologies under consideration, we computed the average of two metrics for each experiment. The first metric was the ranking of the simple average of organ ranking across columns (organs). This score assessed the overall performance across all organs. For the second metric, we took the worst performance across organ ranking instead of the average. We did so to penalize methods that performed poorly in a particular organ. Finally, the average of these two metrics constituted the score used for the global ranking, reported in the Ranking column of Tab. 1. We measured statistical significance of the ranking results by performing a pairwise T-test (Jones et al. (2014)) between the methods’ ranks in 5 repetitions.

4.2 Effects of stain color augmentation

Results in Tab. 1 show that stain color augmentation was crucial to obtain top classification performance, regardless of the stain color normalization technique used (see top-10 methods). Moreover, notice that including color augmentation, either HSV or HED, was key to obtaining top performance since using BC augmentation alone produced poor results (best method with BC achieved a rank of 10.1 only). We did not find, however, any substantial performance difference between using HED or HSV color augmentation. Similarly, we found that strong and light color augmentations achieved similar performance. Regarding non-color augmentation, i.e. basic, morphology and BC, BC obtained the best results across all stain color normalization setups, followed by morphology and basic augmentation, as was expected.

4.3 Effects of stain color normalization

According to results in Tab. 1, top performance was achieved when using our proposed network-based color normalization. Remarkably, we found that this difference was significant when compared to the runner up identity method (pairwise T-test , rank 1.7 and 4.5 respectively). When compared to other normalization methods, network-based (best rank 1.7) proved to be superior to the more traditional LUT-based normalization (best rank 5.6) and the simpler grayscale transformation (best rank 12.2). Moreover, the LUT-based normalization performed similarly to the identity transformation regardless of the data augmentation used. Remarkably, our results showed that stain color normalization alone, i.e. without stain color augmentation, is insufficient to achieve top classification performance in Tab. 1.

5 Discussion

Our experimental results indicate that stain color augmentation improved classification performance drastically by increasing the CNN’s ability to generalize to unseen stain variations. This was true for most of the experiments regardless of the type of stain color normalization technique used. Moreover, we found HSV and HED color transformations to be the key ingredients to improve performance since removing them, i.e. using BC augmentation, yielded a lower AUC under all circumstances. Remarkably, we observed hardly any performance difference between HSV or HED, and strong or light variation intensity. We found an exception to this trend in experiments using HSV-light augmentation with identity and LUT-based normalization which obtained mediocre performance. Based on these observations, we concluded that CNNs are mostly insensitive to the type and intensity of the color augmentation used in this setup, as long as one of the methods is used. However, CNNs trained with simpler stain color normalization techniques exhibited more sensitivity to the intensity of color augmentation, i.e. they required a stronger augmentation in order to perform well. Finally, the fact that experiments with grayscale images achieved poor performance (best rank 12.2) was an indication that color provided useful information to the model, however, without proper data augmentation this signal could act as distracting noise, increasing overfitting and generalization error due to stain variation.

Regarding stain color normalization, we found that our proposed network-based normalization method outperformed all the other methods analyzed in this study. Furthermore, it achieved the best overall performance when combined with HSV-light augmentation (rank 1.7). These results corroborate our idea of reformulating the problem of stain color normalization as an unsupervised image-to-image translation task and solving it using a neural network trained end-to-end. This network was specifically optimized to reduce stain variation from heavily color-augmented images. Therefore, it was more suited to deal with difficult cases, e.g. patches presenting imaging artifacts or extreme stain variations, than other normalization methods not trained end-to-end with such a strong color variation. Despite this remarkable result, we observed that all stain color normalization techniques obtained a poor performance when no color augmentation was used. It seemed that even in the case of excellent stain normalization, color information could serve as a source of overfitting (network-based combined with BC obtained a rank), worsening with suboptimal normalization. Furthermore, the small relative improvement in terms of AUC between the best method (network-based and HSV-light) and the third method (identity and HSV-strong) cast doubt whether normalizing training and test images was worth the extra computational burden. We concluded that using the stain color normalization methods evaluated in this paper without proper stain color augmentation is insufficient to reduce the generalization error caused by stain variation and results in poor model performance.

Due to computational constraints, we limited the type and number of experiments performed in this study to patch-based classification tasks, ignoring other modalities such as segmentation, instance detection or WSI classification. However, we believe this limitation to have little impact in the conclusions of this study since the problem of generalization error has identical causes and effects in other modalities. In order to reduce the number of experiments, we avoided quantifying the impact of individual augmentation techniques, e.g. scaling augmentation alone, but grouped them into categories instead. Similarly, we limited the hyper-parameters’ ranges to certain set of values, e.g. light or strong stain augmentation intensity. Nevertheless, according to the experimental results, we believe that testing a wider range of hyper-parameter values would not alter the main conclusions of this study.

6 Conclusion

For the first time, we quantified the effect of stain color augmentation and stain color normalization in classification performance across four relevant computational pathology applications using data from 9 different centers. We found out that any type of stain color augmentation, i.e. HSV or HED transformation, should always be used. In addition, color augmentation can be combined with network-based stain color normalization to achieve the highest possible classification performance. In setups with reduced computational resources, color normalization could be omitted, resulting in a slight performance reduction. Finally, we recommend tuning the intensity of the color augmentation to light or strong in case color normalization is enabled or disabled, respectively.


This study was supported by a Junior Researcher grant from the Radboud Institute of Health Sciences (RIHS), Nijmegen, The Netherlands; a grant from the Dutch Cancer Society (KUN 2015-7970); and another grant from the Dutch Cancer Society and the Alpe d’HuZes fund (KUN 2014-7032). The authors would like to thank Dr. Babak Ehteshami Bejnordi for providing the code for the LUT-based

stain color normalization algorithm; the developers of Keras (

Chollet et al. (2015)), the open source tool that we used to run our deep learning experiments; and Nvidia Corporation for donating GPUs to support our experiments.


  • Bándi et al. (2019) Bándi, P., Geessink, O., Manson, Q., van Dijk, M., Balkenhol, M., Hermsen, M., Bejnordi, B.E., Lee, B., Paeng, K., Zhong, A., et al., 2019. From detection of individual metastases to classification of lymph node status at the patient level: the CAMELYON17 challenge. IEEE Transactions on Medical Imaging 38, 550–560.
  • Bejnordi et al. (2016) Bejnordi, B.E., Litjens, G., Timofeeva, N., Otte-Höller, I., Homeyer, A., Karssemeijer, N., van der Laak, J.A., 2016. Stain specific standardization of whole-slide histopathological images. IEEE Transactions on Medical Imaging 35, 404–415.
  • Bejnordi et al. (2017) Bejnordi, B.E., Veta, M., Van Diest, P.J., Van Ginneken, B., Karssemeijer, N., Litjens, G., Van Der Laak, J.A., Hermsen, M., Manson, Q.F., Balkenhol, M., et al., 2017. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318, 2199–2210.
  • Bug et al. (2017) Bug, D., Schneider, S., Grote, A., Oswald, E., Feuerhake, F., Schüler, J., Merhof, D., 2017. Context-based normalization of histological stains using deep convolutional features, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, pp. 135–142.
  • Bulten et al. (2019) Bulten, W., Bándi, P., Hoven, J., van de Loo, R., Lotz, J., Weiss, N., van der Laak, J., van Ginneken, B., Hulsbergen-van de Kaa, C., Litjens, G., 2019. Epithelium segmentation using deep learning in H&E-stained prostate specimens with immunohistochemistry as reference standard. Scientific Reports 9, 864.
  • Cho et al. (2017) Cho, H., Lim, S., Choi, G., Min, H., 2017.

    Neural stain-style transfer learning using GAN for histopathological images, in: Asian Conference on Machine Learning.

  • Chollet et al. (2015) Chollet, F., et al., 2015. Keras. https://keras.io.
  • Ciompi et al. (2017) Ciompi, F., Geessink, O., Bejnordi, B.E., de Souza, G.S., Baidoshvili, A., Litjens, G., van Ginneken, B., Nagtegaal, I., van der Laak, J., 2017. The importance of stain normalization in colorectal tissue classification with convolutional networks, in: Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on, IEEE. pp. 160–163.
  • Demšar (2006) Demšar, J., 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30.
  • Gertych et al. (2015) Gertych, A., Ing, N., Ma, Z., Fuchs, T.J., Salman, S., Mohanty, S., Bhele, S., Velásquez-Vacca, A., Amin, M.B., Knudsen, B.S., 2015. Machine learning approaches to analyze histological images of tissues from radical prostatectomies. Computerized Medical Imaging and Graphics 46, 197–208.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Advances in Neural Information Processing Systems, pp. 2672–2680.
  • Goodfellow et al. (2016) Goodfellow, I., et al., 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
  • Haeberli and Voorhies (1994) Haeberli, P., Voorhies, D., 1994.

    Image processing by linear interpolation and extrapolation.

    IRIS Universe Magazine 28, 8–9.
  • Ioffe and Szegedy (2015) Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International Conference on Machine Learning, pp. 448–456.
  • Jones et al. (2014) Jones, E., Oliphant, T., Peterson, P., 2014. Scipy: open source scientific tools for python .
  • Kather et al. (2016) Kather, J.N., Weis, C.A., Bianconi, F., Melchers, S.M., Schad, L.R., Gaiser, T., Marx, A., Zöllner, F.G., 2016. Multi-class texture analysis in colorectal cancer histology. Scientific Reports 6, 27988.
  • Khan et al. (2014) Khan, A.M., et al., 2014. A nonlinear mapping approach to stain normalization in digital histopathology images using image-specific color deconvolution. IEEE Transactions on Biomedical Engineering 61, 1729–1738.
  • Kingma and Ba (2014) Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization, in: International Conference on Learning Representations.
  • Kingma and Welling (2013) Kingma, D.P., Welling, M., 2013. Auto-encoding variational bayes, in: International Conference on Learning Representations.
  • Liu et al. (2017) Liu, Y., Gadepalli, K., Norouzi, M., Dahl, G.E., Kohlberger, T., Boyko, A., Venugopalan, S., Timofeev, A., Nelson, P.Q., Corrado, G.S., et al., 2017. Detecting cancer metastases on gigapixel pathology images. arXiv preprint arXiv:1703.02442 .
  • Maas et al. (2013) Maas, A.L., Hannun, A.Y., Ng, A.Y., 2013. Rectifier nonlinearities improve neural network acoustic models, in: International Conference on Machine Learning.
  • Macenko et al. (2009) Macenko, M., et al., 2009. A method for normalizing histology slides for quantitative analysis, in: Biomedical Imaging: From Nano to Macro, 2009. ISBI’09. IEEE International Symposium on, IEEE. pp. 1107–1110.
  • Odena et al. (2016) Odena, A., Dumoulin, V., Olah, C., 2016. Deconvolution and checkerboard artifacts. Distill 1.
  • Reinhard et al. (2001) Reinhard, E., Adhikhmin, M., Gooch, B., Shirley, P., 2001. Color transfer between images. IEEE Computer Graphics and Applications 21, 34–41.
  • Ronneberger et al. (2015) Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer. pp. 234–241.
  • Simard et al. (2003) Simard, P.Y., Steinkraus, D., Platt, J.C., et al., 2003. Best practices for convolutional neural networks applied to visual document analysis., in: International Conference on Document Analysis and Recognition, IEEE. pp. 958–962.
  • Sirinukunwattana et al. (2017) Sirinukunwattana, K., et al., 2017. Gland segmentation in colon histology images: The glas challenge contest. Medical image analysis 35, 489–502.
  • Springenberg et al. (2014) Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M., 2014. Striving for simplicity: The all convolutional net, in: International Conference on Learning Representations.
  • Tellez et al. (2018) Tellez, D., Balkenhol, M., Otte-Höller, I., van de Loo, R., Vogels, R., Bult, P., Wauters, C., Vreuls, W., Mol, S., Karssemeijer, N., et al., 2018. Whole-Slide Mitosis Detection in H&E Breast Histology Using PHH3 as a Reference to Train Distilled Stain-Invariant Convolutional Networks. IEEE Transactions on Medical Imaging 37, 2126–2136.
  • Veta et al. (2018) Veta, M., Heng, Y.J., Stathonikos, N., Bejnordi, B.E., Beca, F., Wollmann, T., Rohr, K., Shah, M.A., Wang, D., Rousson, M., et al., 2018. Predicting breast tumor proliferation from whole-slide images: the TUPAC16 challenge. arXiv preprint arXiv:1807.08284 .
  • Van der Walt et al. (2014) Van der Walt, S., Schönberger, J.L., Nunez-Iglesias, J., Boulogne, F., Warner, J.D., Yager, N., Gouillart, E., Yu, T., 2014. Scikit-image: image processing in python. PeerJ 2.
  • Zanjani et al. (2018) Zanjani, F.G., Zinger, S., Bejnordi, B.E., van der Laak, J.A., de With, P.H., 2018. Stain normalization of histopathology images using generative adversarial networks, in: International Symposium on Biomedical Imaging, IEEE. pp. 573–577.