Weakly supervised multiple instance learning histopathological tumor segmentation

by   Marvin Lerousseau, et al.

Histopathological image segmentation is a challenging and important topic in medical imaging with tremendous potential impact in clinical practice. State of the art methods relying on hand-crafted annotations that reduce the scope of the solutions since digital histology suffers from standardization and samples differ significantly between cancer phenotypes. To this end, in this paper, we propose a weakly supervised framework relying on weak standard clinical practice annotations, available in most medical centers. In particular, we exploit a multiple instance learning scheme providing a label for each instance, establishing a detailed segmentation of whole slide images. The potential of the framework is assessed with multi-centric data experiments using The Cancer Genome Atlas repository and the publicly available PatchCamelyon dataset. Promising results when compared with experts' annotations demonstrate the potentials of our approach.



There are no comments yet.


page 7


Weakly supervised pan-cancer segmentation tool

The vast majority of semantic segmentation approaches rely on pixel-leve...

Distill-to-Label: Weakly Supervised Instance Labeling Using Knowledge Distillation

Weakly supervised instance labeling using only image-level labels, in li...

Weakly Supervised Segmentation with Multi-scale Adversarial Attention Gates

Large, fine-grained image segmentation datasets, annotated at pixel-leve...

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication

Multiple instance learning (MIL) is a powerful tool to solve the weakly ...

Weakly supervised one-stage vision and language disease detection using large scale pneumonia and pneumothorax studies

Detecting clinically relevant objects in medical images is a challenge d...

Label Cleaning Multiple Instance Learning: Refining Coarse Annotations on Single Whole-Slide Images

Annotating cancerous regions in whole-slide images (WSIs) of pathology s...

A Macro-Micro Weakly-supervised Framework for AS-OCT Tissue Segmentation

Primary angle closure glaucoma (PACG) is the leading cause of irreversib...

Code Repositories


Whole Slide Image segmentation with weakly supervised multiple instance learning on TCGA | MICCAI2020 https://arxiv.org/abs/2004.05024

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In digital pathology whole slide images (WSI) are considered the golden standard for primary diagnosis [17, 14]. The use of computer-assisted image analysis tools is becoming a main stream for automatic quantitative or semi-quantitative analysis for pathologists, including the discovery of novel predictive biomarkers [13]

. However, the quality and the lack of standards of digital scanners on top of the tumor phenotype variability remain the main challenges for machine learning methods in this domain. A central objective in digital pathology is the accurate identification of cell or tissue of interest. Tumor tissue computational staining could be used for slide screening towards increasing the efficiency of pathologists. Automatically computed tumor maps could derive attention mechanisms for microscopic regions of interest 

[5], or be combined with automatic detection of lymphocytes [16] to further characterize the tumor and immune micro-environment for treatment response prediction [3].

Traditionally, segmentation is tackled by leveraging pixel-wise or patch-wise ground-truth annotations [9]

. This is highly problematic in digital pathology, notably due to the colossal size of WSIs with respect to the biological components, and the overall lack of digitization of specimens. The high variance of clinical samples contributes on the deficiency of generalization, as illustrated in  

[4] while for the CAMELYON16 challenge [2] front-runner solution reports times higher classification errors on data from organs that were not included on training (out-of-distribution).

A standard multiple instance learning (MIL) scheme [8]

deals with classifying an entity (bag) from its constituents (instances). The MIL paradigm is particularly suited to histopathological image analysis due to its ability on reasoning on subsets of data (patches) that is often a computational necessity in histo-pathology. The general approach of MIL consists in learning a model that embeds an instance into a latent space

. Then, the collection (usually of fixed size) of instance latent vectors is forwarded into a gathering-function which outputs the predicted bag probability, using different principles such as max-pooling 


, support vector machine 


, or even attention-based neural networks 

[12]. Recent large-scale histopathological studies provides promising solutions based on the MIL scheme [6, 5, 4]. Such diagnostics generally indicate whether a selected slide is non-neoplastic (normal), or the predicted subtype of apparent tumor tissue without really indicating the tumoral regions in the slide.

In such context, there is two different approaches of MIL: classifying bags (or slides), and training an instance classifier (or segmentation) model. For instance, studies such as  [6, 4, 5] use max-pooling MIL and its relaxed formulation [19] to first train a segmentation-like model, and investigate various ways to combine these heatmaps predictions into a slide prediction. These works have demonstrated that MIL is viable for classifying WSI, with reported AUC for tumor vs normal slide classification above , but lack extensive evaluation for the MIL-driven segmentation performance. Noteworthily, accurate slide-based classification measures could lead to erroneous assessment regarding instance-level (i.e. segmentation) performance, as demonstrated in [11] where max-pooling based MIL is shown to recall only very discriminative tumor patches, with very high slide-level performance but very poor pixel-level qualitative performance.

In this paper, we demonstrate a weakly supervised segmentation scheme that is able to generate tumor segmentation models using annotations from conventional clinical practice of physician’s assessment. The contributions of this paper are: (i) a generic meta-algorithm, generating labels from WSI binary values intended to train detailed WSI segmentation models, (ii) a training scheme providing instance-level predictions, trained only with binary WSI annotations (iii) the release of 6481 automatically generated tumor maps for publicly available slides of TCGA, an order of magnitude above previous released WSI tumor maps, (iv) an open-source solution of the formulated framework, including WSI pre-processing tools.

2 Weakly supervised learning for tissue-type segmentation in histopathological images

Contextually, we consider a set of training WSIs, where each slide together with their labels where with we denote the nectitic tissue and the tumor tissue. Labels consist of binary labels indicating whether tumor tissue is present within . The goal is to learn a tumor segmentation model, or a patch-based classifier, relying only on those binary annotations. In a fully supervised setup, a batch of patches are randomly extracted along with their annotations (binary values or microscopic segmentation maps) from each in order to train a segmentation model. However, such annotations are difficult to obtain when is quite big something really common in histopathology, so the aim of the framework is to generate a set of proxy patch-level ground truth labels by exploiting properties from the available global labels.

By construction, a WSI with label indicates that all extracted patches are of class , which is an information equivalent to a fully supervised learning scheme. However, in slides with label , tumor tissue could possibly be in any extent on the . Alternatively, in WSI of label , normal tissue can theorically cover no pixel up to the entire region in the slide except one pixel. We integrate this property by proposing a training scheme on which two parameters and are used in a deterministic process for each training slide of label :

  • assign a positive label to the patches ensuring at least are of class

  • assign a negative label to the patches with minimum are of class

  • discard other patches from the loss computation

In such a setup, represents the minimum assumed relative area of tumor tissue, and similarly with the normal tissue extent. Because the explicit process is deterministic, the framework is identified by the values of and . It is also noteworthy that such that would produce contradictory proxy labels for %

for the patches, which could only impede training by diminishing the signal-to-noise ratio. Therefore, the possible space of these two parameters is defined as


Formally, given a loss function

(e.g. average binary cross-entropy or any gradient-based loss), the formulated framework aims at minimizing the following empirical risk on a training set :

where is defined as the set of patches extracted from a slide for which the predicted probability lies within the th and the th percentiles of the predictions , and are constants for batch averaging and class imbalance for both classes, and refers to the binary ground-truth of WSI. Minimizing this empirical risk will guide models into recalling sufficiently enough positive ot tumoral patches () per slide but not too much () while maintaining a low level of false positive in negative slides . The formulated approach is generic, in the sense that it can be used to train a large scope of machine learning models, including neural networks, with patch-based or pixel-wise predictions, and it can be coupled with any suited loss function. It produces trained segmentation models, readily available to produce heatmaps without intervention of the formulated pipeline nor and parameters.

Figure 1: Illustration of the processing of a batch of patches from a positive WSI. A unique ResNet50 model with parameters is inferred on all images of the batch. For a given configuration of (here, ), these predictions are first used to create a proxy image-wise labels vector, then combined with the proxy label to compute batch error for further parameters

update with backpropagation.

3 Implementation details and Dataset

3.1 Framework setup and Architecture details

We perform a benchmark of a representative population of the framework parameters space . Specifically, is sampled starting from 0 with increment of (or 20%) for both and , resulting in configurations (e.g. , , and so on). Of those, the configurations with are discarded, as this would imply that the framework provides only labels contradicting with the assumption of our empirical risk formulation. At the end only sampled configurations had been used.

Each configuration is used to train a ResNet50 architecture [10], which has been shown suited for histopathology image analysis in a multitude of tasks [15], and can be used without the global average pooling layer to yield outputs per

pixel-wide input image. Pre-training is used with initialization on a well-performing snapshot on ImageNet 


. At each epoch, each training slide is sampled once. Upon sampling, a batch size of 150 patches of size

are randomly sampled at x magnification in the tissue region of a WSI. Data augmentation is used independently on each image, with random rotations and flips, RGB-normalization from channel-wise training averages and variances, and color jitter were also applied. The model is then concurrently inferred on the patches, and a proxy-vector is constructed with the formulated pipeline as illustrated for patches in Figure 1. Specifically, the patches of highest probabilities are associated with a proxy value of 1, and the patches of lowest probabilities with a proxy value of 0. A masking vector of size is concurrently filled such that patches with no attributed label are discarded. The proxy vector is then coupled with the model’s predictions minus the discarded ones, to compute image-wise binary cross-entropy loss which is averaged and retro-propagated across all non masked predictions.The error signal is used to tune model parameters using Adam optimizer with learning rate of and default parameters. Each configuration is trained for 20 epochs on V100 NVIDIA GPU, for a training time of hours, or a total benchmark training time of hours. Finally, for our experiments we set up both and to after grid search.

3.2 Dataset

The dataset consists of 6481 whole slide images from The Cancer Genome Atlas, issued from kidney (2334), bronchus and lung (2168), and breast (1979) WSIs locations.These locations were selected on TCGA as the first indexed, and we perform no slide filtering nor slide selection to be coherent with standard clinical practices. This dataset was divided in training, validation and testing sets on a case basis, with 65%, 15%, and 20% of cases respectively. For the rest of the paper this test set is denoted as ”In-distribution”. Each selected configuration are trained using the training set, with hyper-parameters optimized on the validation set. Then, their performance is assessed on the testing set. For extensive quantitative performance assessment, expert pathologists annotated 130 slides from this testing set (45 breast, 40 kidney, 45 bronchus and lung), thus measuring in-distribution generalization. Annotations were computed at 20x magnification by a junior pathologist on a in-house annotation tool by contouring tumor tissue regions, and were modified until validation by a senior pathologist.

The same protocol was applied on additional slides extracted from locations which are not used in the previous cohort. Specifically, 35 WSIs from colon, 35 from ovary and 30 from corpus uteri are pixel-wise annotated and used to measure generalization performance of models to unseen tissue environments, which we denote as ”out-of-location”. We pinpoint that these annotations were not used during training nor validation, but only to assess testing segmentation performance of the produced models. For training, we use diagnostic labels extracted directly from TCGA, for which each slide name contain a code indicating the type of specimen (e.g. ”Solid Tissue Normal” or ”Primary Solid Tumor”)111https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes. Notably, normal slides are explicitly discerned from slide with apparent pathological tissue. In such context, each slide is associated with a binary value indicating whether tumor tissue is apparent in the slide, or whether the slide is of normal tissue only. To further compare with results from the community, we infer all models on the PatchCamelyon dataset [18]. The dataset consists in patches extracted from formalin-fixed, paraffin-embedded (FFPE) tissues from sentinel lympho nodes. This testing set is particularly challenging for the benchmarked models, since they are not trained on FFPE slides, which are visually highly different from flash-frozen ones.

4 Results and discussion

For performance assessment, all trained ResNet50 models are inferred on the testing slides. The resulting heatmaps are compared to segmentations maps provided by the pathologists. All configurations are found to converge to sup-random in-distribution performance, except for the two extrema configurations and , as displayed in Table 1. The average in-distribution AUC is , with optimal AUC of for

. Precision and recall are extracted after threshold selection on validation set and displayed in Figure 

2. The parameter seems to influence the recall at the the expense of precision. Upon performance introspection by location, all configurations report the worse performance for bronchus and lung locations, with two times lower AUC performance compared to kidney and breast locations. Concerning the out-of-distribution cohort, the average AUC is , which is close to in-distribution performance, although lower when omitting bronchus and lung locations. There is no evident pattern for configurations that yield improved out-of-location results.

= 0.2 0.4 0.6 0.8 1.0
= 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0 0.2 0.4 0 0.2 0
.786 .804 .749 .681 .566 .720 .726 .767 .685 .766 .758 .650 .589 .619 .256
In-distribution .964 .952 .960 .930 .874 .967 .953 .959 .935 .957 .960 .940 .946 .946 .926
.866 .732 .709 .583 .257 .783 .710 .762 .658 .790 .787 .673 .785 .695 .404
Out-of-location .984 .974 .972 .958 .917 .980 .972 .978 .966 .981 .980 .968 .980 .970 .933
Table 1: Pixelwise AUC for the 15 framework configurations on hold-out testing set (In-distribution) and unseen testing set from locations not used in training (Out-of-location). Grey results take background into account, black ones are computed by completely discarding background.

Some visual representations of two different samples testing are presented on Figure 3. In particular, in the figure we present the WSI image together with the pixel-wise annotations of the pathologist. Moreover, different segmentation maps depending on the configuration are also presented. One can observe that visually there are or configurations that are close to the expert’s annotation. These configurations are in line with our quantitative results. Additional post-processing strategies would potentially boost the performance of our framework.

Figure 2: Quantitative results for the 15 benchmarked configurations on the hold-out testing set from bronchus and lung, kidney, and breast locations. Each subplot (4 in total) displays a pixelwise measure, as indicated in its sup-title, for each configuration in a matrix format. AUC: area under the ROC curve.
Figure 3: Unfiltered predicted tumor maps on hold-out testing samples for the 15 benchmarked framework configurations. 2 WSI and their corresponding results are displayed in a matrix-format.

To test the generalization of our method, we performed also experiments on the PatchCamelyon dataset [18]. In particular we found the most of the configurations ( out of the ) reporting quite low AUC, between and . However, configurations are found to generalize to some extent, that is with AUC, with , and with . Although these results are far from report AUC of obtained with fully supervised models specifically trained on this dataset [18], the results suggest the presented framework could train models which can grasp generic disciminative cancer features from multiple types of slides in broad biological context.

5 Conclusion and future works

In this paper we propose a weakly supervised model with provides segmentation maps of WSIs, trained only with binary annotations over the entire WSIs. From our experiments we saw that usually to configurations are expected to yield respectively high precision, high recall, and high overall performance for WSIs of different organs and tumor coverage. The findings in this paper highlight the potential of weakly supervised learning in histopathological image analysis, which is known to be heavily impeded by annotation bottleneck. With the complete open-source releases of both the complete WSI pre-processing pipeline, the presented training framework, as well as the inference pipeline, the presented approach can be used off-the-shelf for pan-cancer tumor segmentation using its entire k flash-frozen WSI, or other type of tissue segmentation, such as necrosis or stromal tissue. The public release of

automatically generated tumor maps, with an expected AUC above 0.932, should lower the barrier of entry to pathomics by bypassing tumor annotation efforts. There are many ways to fine-tune a segmentation model using the formulated framework, such as with more appropriate deep learning architectures or with more extensive hyper-parameters optimization. We believe the most impactful future works will revolve around the proxy-generation labels from more sophisticated slide labels which would yield higher information while remaining fast to obtain, essentially trading annotation time for performance.


  • [1] S. Andrews, I. Tsochantaridis, and T. Hofmann (2003) Support vector machines for multiple-instance learning. In Advances in neural information processing systems, pp. 577–584. Cited by: §1.
  • [2] B. E. Bejnordi, M. Veta, P. J. Van Diest, B. Van Ginneken, N. Karssemeijer, G. Litjens, J. A. Van Der Laak, M. Hermsen, Q. F. Manson, M. Balkenhol, et al. (2017) Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama 318 (22), pp. 2199–2210. Cited by: §1.
  • [3] M. Binnewies, E. W. Roberts, K. Kersten, V. Chan, D. F. Fearon, M. Merad, L. M. Coussens, D. I. Gabrilovich, S. Ostrand-Rosenberg, C. C. Hedrick, et al. (2018) Understanding the tumor immune microenvironment (time) for effective therapy. Nature medicine 24 (5), pp. 541–550. Cited by: §1.
  • [4] G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. W. K. Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra, and T. J. Fuchs (2019) Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine 25 (8), pp. 1301–1309. Cited by: §1, §1, §1.
  • [5] G. Campanella, V. W. K. Silva, and T. J. Fuchs (2018) Terabyte-scale deep multiple instance learning for classification and localization in pathology. arXiv preprint arXiv:1805.06983. Cited by: §1, §1, §1.
  • [6] N. Coudray, P. S. Ocampo, T. Sakellaropoulos, N. Narula, M. Snuderl, D. Fenyö, A. L. Moreira, N. Razavian, and A. Tsirigos (2018) Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nature medicine 24 (10), pp. 1559. Cited by: §1, §1.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §3.1.
  • [8] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez (1997) Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89 (1-2), pp. 31–71. Cited by: §1.
  • [9] Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew (2018) A review of semantic segmentation using deep neural networks. International journal of multimedia information retrieval 7 (2), pp. 87–93. Cited by: §1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
  • [11] L. Hou, D. Samaras, T. M. Kurc, Y. Gao, J. E. Davis, and J. H. Saltz (2016)

    Patch-based convolutional neural network for whole slide tissue image classification

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2424–2433. Cited by: §1.
  • [12] M. Ilse, J. M. Tomczak, and M. Welling (2018) Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712. Cited by: §1.
  • [13] A. Janowczyk and A. Madabhushi (2016) Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases. Journal of pathology informatics 7. Cited by: §1.
  • [14] A. R. Jara-Lazaro, T. P. Thamboo, M. Teh, and P. H. Tan (2010) Digital pathology: exploring its applications in diagnostic surgical pathology practice. Pathology 42 (6), pp. 512–518. Cited by: §1.
  • [15] R. Mormont, P. Geurts, and R. Marée (2018)

    Comparison of deep transfer learning strategies for digital pathology

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2262–2271. Cited by: §3.1.
  • [16] J. Saltz, R. Gupta, L. Hou, T. Kurc, P. Singh, V. Nguyen, D. Samaras, K. R. Shroyer, T. Zhao, R. Batiste, et al. (2018) Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell reports 23 (1), pp. 181–193. Cited by: §1.
  • [17] N. Stathonikos, M. Veta, A. Huisman, and P. J. van Diest (2013) Going fully digital: perspective of a dutch academic pathology lab. Journal of pathology informatics 4. Cited by: §1.
  • [18] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, and M. Welling (2018) Rotation equivariant cnns for digital pathology. In International Conference on Medical image computing and computer-assisted intervention, pp. 210–218. Cited by: §3.2, §4.
  • [19] W. Zhu, Q. Lou, Y. S. Vang, and X. Xie (2017) Deep multi-instance networks with sparse label assignment for whole mammogram classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 603–611. Cited by: §1.