Weakly supervised training of pixel resolution segmentation models on whole slide images

05/30/2019 ∙ by Nicolas Pinchaud, et al. ∙ ContextVision AB 0

We present a novel approach to train pixel resolution segmentation models on whole slide images in a weakly supervised setup. The model is trained to classify patches extracted from slides. This leads the training to be made under noisy labeled data. We solve the problem with two complementary strategies. First, the patches are sampled online using the model's knowledge by focusing on regions where the model's confidence is higher. Second, we propose an extension of the KL divergence that is robust to noisy labels. Our preliminary experiment on CAMELYON 16 data set show promising results. The model can successfully segment tumor areas with strong morphological consistency.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pathologists generally base their diagnosis on the presence of localized morphological bio-markers in histopathological images. Training deep neural networks on Whole Slide Images (WSIs) for automatic disease detection and localization is the main approach for the development of decision support tools. However it requires large annotated datasets that are difficult to construct because annotations at the pixel resolution are often received through a costly, noisy, and low bit-rate channel. Weakly supervised approaches aim to tackle the problem of data acquisition by relying on a supervision available on the WSI level. For cancer detection task, the WSIs are labeled as

or depending on the cancer presence. This information is cheaper to acquire, more robust, and allows construction of larger datasets. A model is trained using this high level labeling to retrieve localized, pixel level, disease information.

We propose to use a segmentation model that produces pixel level unnormalized log evidences (or logits) that represent localized cancer presence probabilities in WSIs. We form WSI level cancer presence probability from aggregation of pixel level logits. We can train the model to maximise the mutual information between this probability and the WSIs label. Our assumption is that this serves as a proxy to maximize the pixel level mutual information between pixel wise logits and the (unknown) cancer localization.

However this training scheme requires the segmentation model to work on all the pixels of a WSI at a time. Since a WSI often contain an order of pixels, that makes this approach intractable. To tackle this problem, we propose to train the model on patches randomly sampled from the WSIs. A patch is given a label that is inherited from the slide it is sampled from to form training samples. The drawback is that this can lead patches to be incorrectly labeled. For instance, the cancerous region of a malignant WSI can represent less than of the tissue. Therefore, there is high chance to sample patches from benign regions, they will be incorrectly labeled as malignant. This produces noisy supervision that would prevent the model to converge. We propose to address this with two things. First, we propose an extension of the Kullback–Leibler (KL) divergence that is robust to noisy labels. Second, we sample patches dynamically during training. The sampling is made using a distribution defined by the pixel wise cancer probability map given by the model. Patches are sampled more frequently from regions were the model gives higher cancer probability. In the beginning of the training, the sampling distribution is uniform over the whole slide. As the model converges, the patches are more often sampled from regions containing cancer, thus reducing labeling noise and further improving convergence.

We demonstrate our approach on the CAMELYON 16 dataset [1] and show that the model is able to segment tumors at pixel resolution. Even more, the segmentations are strongly correlated with the morphological structures of the tissues. We evaluated the ability of the model to classify slides as malign or benign. The model reached a ROC AUC score of which is a promising score given that only one set of hyper-parameters have been tested.

2 Related work

Image segmentation at pixel-level under weak supervision has been studied in [2, 3, 4]. These approaches propose to work under the Multiple Instance Learning framework (MIL). The images are seen as bag of pixels or patches. The elements of the bags have an underlying label that is not available. However a label at the bag level is available and relate to the label of its elements. A negative (i.e. ) bag contains only negative elements, while a positive (i.e.

) bag contains at least one positive element. A convolutional neural network (CNN) is used to produce pixel (patch) level features that are aggregated through a pooling function to form the bag level class prediction. However these approaches need the image to fit into memory and do not scale on WSIs that contain giga pixels.

Weakly supervised approaches on WSIs divide the slide into a grid of tiles [5, 6, 7]. Each tile is processed through a CNN to produce a score that allows them to be ranked. The top/bottom tiles are then used to train a WSI classifier. The gradients are propagated into the scoring CNN up to some depth allowing to learn the scores. The scores of tiles are then interpreted as disease localization confidences. These approaches have several constraints. First, they produce disease location at tile resolution and not at pixel resolution. Second, they are limited in their ability to train complex models. In [6]

the tiles are, prior to the training, mapped into a fixed feature space using a deep neural network trained on ImageNet

[8]. A simpler parameterized function is trained over this feature space to produce the scores, consequently limiting the representation power of the learned model. In [5]

, the scoring CNN is applied on all the tiles at the beginning of each epochs, the training is made only on the top tile of each slide thereafter. This leads to slow convergence that requires larger dataset.

In [9]

the authors propose to scale the training of a CNN on WSIs using the streaming stochastic gradient descent method

[10]. This approach does not allow to train segmentation model architectures because it requires upper feature maps layers to fit in memory which prevents the usage of up-convolution layers. The authors propose to extract a saliency map using the gradients at the input pixels.

In [11]

the authors propose the use of a deep recurrent attention model that classifies a WSI using information provided from a limited number of patches (or glimpses) using visual attention method. Disease localisation is inferred from the glimpses location.

Our approach combines the ability the train pixel resolution disease segmentation models on WSIs without constraining complexity of the learned CNN. Moreover, we do not train on pre-tiled slides, instead we learn on patches dynamically extracted during training according to the current knowledge state of the model.

3 Model

Our model is defined using a segmentation model represented by a parameterized function . can be an instance of any kind of state-of-the-art segmentation model such as U-net[12] or DeepLab[13]. For any given image of dimension with channels, the model gives representing the un-normalized log probabilities (logits) for benign and malign classes for each pixels. We get the classes probabilities of a pixel using the softmax function:

In a weakly supervised setup the pixel level annotation is missing. Instead, the supervision is available on the WSI level. A WSI can either be benign or malign. A benign WSI have all its pixels belonging to the benign class while a malign WSI has at least one of its pixel being malignant. Hence we can translate the segmentation problem to a constrained WSI binary classification problem. The model outputs the WSI’s class by computing the slide level benign and malign logits (we note respectively and ) with:

A WSI is malignant if at least one of its pixels is malignant. We can translate that relationship with the following expression of :

However, such strategy results into slow and noisy learning of the model because the gradients would flow through only one pixel at a time. Therefore, we generalize to a top K expression:

where is the percentile. For low we retrieve the max definition of and for we get the same average expression of . As for the pixels probabilities, we compute the WSI’s classes probabilities using the softmax function. We note and the probability of the WSI to be respectively benign and malignant.

If the dimensions

are not too large, the model can be trained using the cross-entropy loss function on the WSI level probabilities. However, in general

is large and the training is intractable.

3.1 Training on patches using dynamic sampling

Instead of training on the full WSI which is intractable, we propose to train on dynamically extracted patches. The model is trained by classifying the patches instead of the WSIs. The patches inherit the class of the slide they are extracted from. In that setup a patch extracted from a benign slide have all its pixels being benign. However, a patch extracted from a malign slide have high chance to contain no malign pixels. This breaks the assumption that at least one pixel have to be malign within and makes the patch level supervision noisy. We propose to tackle this problem by using an extended version of the KL divergence loss that is robust to noise:

where is the true distribution and the model’s distribution. Since:

we can properly extend for the terms where , and particularly we have .

If , then we have iff

. The vector

allows to control the shape of the loss for the different classes.

3.1.1 Tuning under noisy supervision

Let be the probability that a patch from a malign slide does not contain any malign pixel. These patches are indistinguishable from patches extracted from benign slides and will confuse the model. Since they contain no malign pixel they should be classified as benign. We can leverage the noise induced by the label inconsistency by adjusting .

Let’s suppose we have a data-set made of and slides. Let’s suppose that we can extract one pixel patches from a slide such that where is a benign pixel such that we have and . Let’s define a trivial model with , such that and . Ideally, we would like the model to classify the benign pixel as benign with probability one, i.e. we would like .

Optimization of this model using would give the following gradient over :

where is the ratio of benign slides.

In the case of the KL divergence, obtained with , under a balanced data-set , the minimum would be reached on . For instance, a noise of would give .

We can leverage the noise and obtain lower value of by tuning . The minimum on is obtained when which gives the following value of :

If we have and set , if and if we wish to get , we can set . The figure 1 shows how the parameter

shapes the loss for two Bernoulli distributions. The false positives are less penalized than the false negatives. This affects the gradient by reducing the slope of the loss for the miss-classified examples. Note that our approach is different than scaling the gradients differently for each class. In fact, with this approach, also called

balanced cross entropy in [14], the gradient ratio between a well classified and a miss-classified example would remain constant. In our approach this ratio depends on the parameter . Also, we differ from the focal loss [14] as this approach is designed to down weight well classified examples to focalize more on the miss-classified ones. This relies on the assumption that the label information is not noisy.

Figure 1: left: The divergence between two Bernoulli distributions with parameters and , with and . The figure shows the asymmetry induced by . The loss penalizes differently the false positives and the false negatives. right: Loss of the model given and under a noise of and with . The red line shows the optimal for a given . We can see that higher values of allow better mitigation of the noise with lower values of optimal .

3.1.2 Training with dynamic sampling of patches

We train by sampling patches from slides according to a distribution defined by the probability map of the model on the slides. Given a WSI , the probability that the pixel is according to the model is . The probability to sample a patch centered on the pixel is given by:

where controls the entropy of the distribution. With , the distribution is uniform and all pixels are sampled with equal probability, while higher values narrow the distribution on pixels having higher classification confidence.

The training is performed using two processes. The mapping process computes the probability maps on slides using up-to-date model weights. The use of fully convolutional segmentation models for allows to speedup the computation of the probability maps by allowing to infer the model on larger patches. The training process samples patches from the latest probability map to feed a shuffle buffer from which training batches are built and used to update the model’s parameters. The two processes are running synchronously in order to optimize the up-to-dateness of the probability maps from which the patches are sampled; and in parallel allowing the training to run without bottleneck. See the figure 2 for an illustration of the training pipeline.

Slides sampling strategy

The mapping process samples malign and benign slides with the same rate. Benign slides are sampled with an emphasis on those containing high segmentation error according to the following probability:

The probability maps are taken from the latest map computed by the mapping process. A default value is used for slides that haven’t been computed yet. We ensure that every slides are sampled at least once every epochs.

Figure 2: Representation of the training pipeline: two processes run synchronously and in parallel. A mapping process computes cancer probability maps on slides that are used by a training process to extract patches used to train the model.

4 Experiments


We trained a model on the CAMELYON 16 dataset [1], which consists of 400 WSIs. The dataset has been developed for the task consisting of detection of metastases in H&E stained WSIs of lymph node sections. The tumors have been annotated at the pixel level by pathologists. For training we only used the slide level cancer presence information to train the model in a weakly supervised setup. We used the pixel level annotations for evaluation.

Segmentation model

We instantiated with a modified U-net[12]

network. We replaced the convolutions with separable convolutions. We added residual skip connections and batch-normalization. We set the starting number of filters at 32 in the first layer. The model is trained on a resolution of 1 micrometer per pixel.


We pre-trained by initializing it with an auto-encoder trained with the same network architecture but without the lateral skip connections. It has been trained to reconstruct the input using the loss.

Data augmentation

We augmented the patches with random rotation, mirroring, elastic deformation and color jittering.


We have set , , and .


We measured the ability of the model to discriminate malignant and benign slides. We scored the slides with their maximum cancer pixel probability value. Given such scoring, the model could reach a classification performance of ROC AUC.

The quality of the segmentation has been measured using the same metric as the one proposed for the CAMELYON 16 challenge. It is the average sensitivity at 6 predefined false positive rates: , , , , , and FPs per whole slide image. The model reached an average sensitivity of . The figure 3 shows examples of successfully detected metastases. We can see that the segmentations are consistent with the morphology and have strong agreement with the ground truth.

Figure 3: Examples of successfully detected metastases from the test set (image 27 and 1). The cancer probabilities are overlaid on the tissue with a coloring going from blue to red. Red indicates higher probability value. The red outline represents the ground truth annotation.
Figure 4: left: Model’s ROC curve for slide classification task. right: Model’s FROC curve for slide segmentation task.

5 Conclusion

We have proposed a novel method to train a segmentation model at pixel resolution in a weakly supervised setup. We have shown that the model trained with scarce WSI level supervision was able to retrieve cancer localization at pixel resolution. Preliminary experiment have shown promising results in term of classification and segmentation performances inviting for further investigation.


  • [1] Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes van Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A. W. M. van der Laak, , and the CAMELYON16 Consortium.

    Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast CancerMachine Learning Detection of Breast Cancer Lymph Node MetastasesMachine Learning Detection of Breast Cancer Lymph Node Metastases.

    JAMA, 318(22):2199–2210, 12 2017.
  • [2] Deepak Pathak, Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional multi-class multiple instance learning, 2014.
  • [3] Pedro O. Pinheiro and Ronan Collobert. From image-level to pixel-level labeling with convolutional networks, 2014.
  • [4] Li Yao, Jordan Prosky, Eric Poblenz, Ben Covington, and Kevin Lyman. Weakly supervised medical diagnosis and localization from multiple resolutions, 2018.
  • [5] Gabriele Campanella, Vitor Werneck Krauss Silva, and Thomas J. Fuchs. Terabyte-scale deep multiple instance learning for classification and localization in pathology, 2018.
  • [6] Pierre Courtiol, Eric W. Tramel, Marc Sanselme, and Gilles Wainrib. Classification and disease localization in histopathology using only global labels: A weakly-supervised approach, 2018.
  • [7] Zhipeng Jia, Xingyi Huang, Eric I-Chao Chang, and Yan Xu. Constrained deep weak supervision for histopathology image segmentation. 2017.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • [9] Geert Litjens Hans Pinckaers M.D., Wouter Bulten. High resolution whole prostate biopsy classification using streaming stochastic gradient descent, 2019.
  • [10] Hans Pinckaers and Geert Litjens. Training convolutional neural networks with megapixel images, 2018.
  • [11] Alexandre Momeni, Marc Thibault, and Olivier Gevaert. Deep recurrent attention models for histopathological image analysis. bioRxiv, 2018.
  • [12] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. CoRR, abs/1505.04597, 2015.
  • [13] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  • [14] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection, 2017.