Accounting for Dependencies in Deep Learning Based Multiple Instance Learning for Whole Slide Imaging

by   Andriy Myronenko, et al.

Multiple instance learning (MIL) is a key algorithm for classification of whole slide images (WSI). Histology WSIs can have billions of pixels, which create enormous computational and annotation challenges. Typically, such images are divided into a set of patches (a bag of instances), where only bag-level class labels are provided. Deep learning based MIL methods calculate instance features using convolutional neural network (CNN). Our proposed approach is also deep learning based, with the following two contributions: Firstly, we propose to explicitly account for dependencies between instances during training by embedding self-attention Transformer blocks to capture dependencies between instances. For example, a tumor grade may depend on the presence of several particular patterns at different locations in WSI, which requires to account for dependencies between patches. Secondly, we propose an instance-wise loss function based on instance pseudo-labels. We compare the proposed algorithm to multiple baseline methods, evaluate it on the PANDA challenge dataset, the largest publicly available WSI dataset with over 11K images, and demonstrate state-of-the-art results.


page 1

page 2

page 3

page 4


Kernel Self-Attention in Deep Multiple Instance Learning

Multiple Instance Learning (MIL) is weakly supervised learning, which as...

Deep Instance-Level Hard Negative Mining Model for Histopathology Images

Histopathology image analysis can be considered as a Multiple instance l...

Multi-Attention Multiple Instance Learning

A new multi-attention based method for solving the MIL problem (MAMIL), ...

ReMix: A General and Efficient Framework for Multiple Instance Learning based Whole Slide Image Classification

Whole slide image (WSI) classification often relies on deep weakly super...

A Multiple-Instance Learning Approach for the Assessment of Gallbladder Vascularity from Laparoscopic Images

An important task at the onset of a laparoscopic cholecystectomy (LC) op...

Deep Multiple Instance Learning for Airplane Detection in High Resolution Imagery

Automatic airplane detection in aerial imagery has a variety of applicat...

Sparse convolutional context-aware multiple instance learning for whole slide image classification

Whole slide microscopic slides display many cues about the underlying ti...

1 Introduction

Whole slide images (WSI) are digitizing histology slides often analysed for diagnosis of cancer [3]. WSI can contain several billions pixels, and are commonly tiled into smaller patches for processing to reduce the computational burden (Figure 1

). Another reason to use patches is because the area of interest (tumor cells) occupies only a tiny fraction of the image, which impedes the performance of conventional classifiers, most of which assume that the class object occupies a large central part of the image. Unfortunately, patch-wise labels are usually not available, since the detailed annotations are too costly and time-consuming. An alternative to supervised learning is weakly-supervised learning, where only a single label per WSI is available.

Multiple Instance Learning (MIL) is a weakly supervised learning algorithms, which aims to train a model using a set of weakly labeled data [5, 13]

. Usually a single class label is provided for a bag of many unlabeled instances, indicating that at least one instance has the provided class label. It has many applications in computer vision and language processing 

[4], however learning from bags raises important challenges that are unique to MIL. In context of histopathology, a WSI represents a bag, and the extracted patches (or their features) represent instances (we often use these notations interchangeably).

With the advent of convolutional neural networks (CNN), deep learning based MIL has become the mainstream methodological choice for WSI [9]. Campanella et al. [3]

was one of the first works to conduct a large study on over 44K WSI, laying the foundation for MIL applications in clinical practise. Since the instance labels are not known, classical MIL algorithm usually selects only one (or a few) instances based on the maximum of the prediction probability at the current iteration. Such approach is very time consuming, as all patches need to be inferenced, but only a single patch contributes to the training of CNNs at each iteration. Ilse et al. 

[14] proposed to use an attention mechanism (a learnable weights per instance) to utilize all image patches, which we also adopt.

More recent MIL methods include works by Zhao et al. [18], who proposed to pre-train a feature extractor based on the variational auto-encoder, and use a graph convolutional network for final classification.

Hashimoto et al. [7] proposed to combine MIL with domain adverserial normalization and multiple scale learning. Lu et al. [11] precomputed patch-level features (using pretrained CNN) offline to speed up training, and proposed an additional clustering-based loss to improve generalization during MIL training. Maksoud et al. [12] proposed to use a hierarchical approach to process the down-scaled WSI first, followed by by high resolution processing when necessary. Such approach demonstrated significant reduction in processing time, while maintaining the baseline accuracy.

Figure 1: An example of patch extraction from WSI from the PANDA challenge dataset [2]. We tile the image and retain only the foreground patches, out of which we take a random subset to form a bag.

We observed that most MIL methods assume no dependencies among instances, which is seldom true especially in histopathology [9]. Furthermore, a lack of instance-level loss supervision creates more opportunities for CNNs to overfit.

In this work, we propose a deep learning based MIL algorithm for WSI classification with the following contributions:

  • we propose to explicitly account for dependencies between instances during training. We embed transformer encoder [15] blocks into the classification CNN to capture the dependencies between instances.

  • we propose an instance-wise loss supervision based on instance pseudo-labels. The pseudo-labels are computed based on the ensemble of several models, by aggregating the attention weights and instance-level predictions.

We evaluate the proposed method on PANDA challenge [2] dataset, which is currently the largest publicly available WSI dataset with over 11000 images, against the baseline methods as well as against the Kaggle challenge leaderboard with over 1000 competing teams, and demonstrate state-of-the-art (SOTA) classification results.

2 Method

MIL aims to classify a bag of instances as positive if at least one of the instances is positive. The number of instances could vary between the bags. Individual instance labels are unknown, and only the bag level label is provided:


which is equivalent to definition using a Max operator.

Training a model whose loss is based on the maximum over instance labels is problematic due to vanishing gradients [14], and the training process becomes slow since only a single patch contributes to the optimization. Ilse et al. [14] proposed to use all image patches as linear combination weighted by attention weights. Consider to be instance embeddings, e.g features of a CNN final layer after average pooling. Then a linear combination of patch embeddings is


where the attention weights of patch embeddings are ,

where and

are parameters. The attention weights are computed using a multilayer perceptron (MLP) network with a single hidden layer.

2.1 Dependency between instances

The assumption of no dependency between the bag instances often does not hold. For example, for grading the severity of prostate cancer, pathologists need to find two distinct tumor growth patterns in the image and assign Gleason scores to each [1]. Then the International Society of Urological Pathology (ISUP) grade is calculated, based on the combination of major and minor Gleason patterns. ISUP grade indicates a severity of the tumor and plays a crucial role in treatment planning. Here, we propose to use the self-attention to account for dependencies between instances. In particular, we adopt the transformer, which was initially introduced to capture long range dependencies between words in sentences [15] and later applied to vision [6]. Whereas traditional convolutions are local operation, the self-attention block of Transformers computes attention between all combinations of tokens at a larger range directly.

A key component of transformer blocks is a scaled dot product self-attention which is defined as , where queries , keys , and values

matrices are all derived as linear transformations of the input (in our case the instance features space

). The self-attention is performed several times with different, learned linear projections in parallel (multi-head attention). In addition to self-attention, each of the transformer encoder layers also contains a fully connected feed-forward network and layer normalization (see Figure 2[15, 6].

Figure 2: Model architecture overview. The backbone CNN (blue) extracts features at different scales, which are spatially averaged-pooled before feeding into the transformer encoder layers (green), to account for dependencies between instances. The input to the network is . Where is the batch size, is the number of instances (patches extracted from a single whole slide image), and is the spatial patch size.

We propose two variants of utilizing transformers. In the simplest case we attach a transformer encoder block only to the end of the backbone classification CNN after avg pooling. The idea is similar to the approach proposed in Visual transformers, but before avg pooling [6]. The difference here is that in Visual transformers, the goal was to account for dependencies between the spatial regions (16px16px) of the same patch. Whereas we want to account for the dependencies among the patches. Another relevant work was proposed by Wang et al. [16] to utilize self-attention within MIL, but for text-based disease symptoms classification. We maintain the dimensionality of encoded data, so that the input, output and hidden dimensionality of the transformer encoder are the same. We call it Transformer MIL.

We also consider a variant of a deeper integration of the transformer with the backbone CNN. We attach separate transformer encoder blocks after each of the main ResNet blocks [8] to capture the patch encodings at different levels of its feature pyramid. The output of the first transformer encoder is concatenated with next feature scale space of ResNet (after average pooling), and is fed into the next level transformer encoder, up until the final encoder layer, followed by the attention layer. We want to capture dependencies between patches at multiple scales, since different level of CNN output features include different semantic information. Such a Pyramid Transformer MIL network is shown in Figure 2.

2.2 Instance level semi-supervision and pseudo-labeling

Figure 3: An example ISUP grade 5 prostate cancer WSI. (a) Green mask overlay shows ground truth location of cancer regions (provided in the PANDA dataset [2]). (b) an additional heat map overlay visualizes our pseudo-labels of ISUP 5 (weighted by attention), achieved from training on weak (bag-level) labels only. Notice the close agreement between the dense pseudo-labels and the ground truth. In practice, pseudo-labels are computed per patch; here we used a sliding-window approach for dense visualization.

One of the challenges of MIL training is the lack of instance labels to guide the optimization process. A somewhat similar issue is encountered in semi-supervised learning 


, where pseudo-labels are used either offline or on the fly based on some intermediate estimates or another network’s predictions. Here, we propose to generate pseudo-labels for each image patch and use the additional patch-wise loss to assist the optimization process.


where the total loss includes a bag-level loss (based on the ground truth labels) and a patch level loss (based on the pseudo-labels). We use cross-entropy loss function for both bag-level and patch-level losses.

We opt for a simple approach to generate pseudo-labels based on ensembling of several identical models trained from random initialization. The final ensembled labels are hard label (rounded to the nearest classes). Consider a trained network, its bag-level prediction output is based on the final output vector

(see Eq. 2), followed by a linear projection onto the number of output classes:


here we assumed a final sigmoid function (but the same holds with softmax). We approximate the individual instance level prediction as

Train N MIL models (N=5) for all patches in the bag do
       Run inference on patches for each image Ensemble predictions of attention weights and instance classes if bag label is not zero then
             for patches with top 10% of highest attention weights, assign the ensembled labels as pseudo-label for patches with top 10% of lowest attention weights, assign the zero labels as the pseudo-label otherwise flag the patch as unknown pseudo-label
             assign zero pseudo-labels for all patches, since here we know that all patches must have zero labels
       end if
end for
Pseudocode 1 Pseudo-labels assignment

Pseudocode 1 shows the algorithm to compute the pseudo-labels. For some patches, whose ensembled attention weights are neither small nor large (defined by 10% threshold), we do not assign any pseudo-labels, and mark then and unknown to exclude from the

loss. Given the pseudo-labels we re-optimize the model using the additional patch-wise loss. The 10% heuristic was chosen to retain only most confident patches, that contribute the most to the final bag-level classification. A relevant approach was recently proposed by Lerousseau et al. 

[10]. However the goal of their work is a dense segmentation map, and not the improvements to the global classification accuracy, and the pseudo-labels are calculated differently, through thresholding of current prediction probability estimates on the fly.

3 Experiments

We implemented our method in PyTorch 


and trained it on 4 NVIDIA Tesla V100 16GB GPUs, batch size of 16. For the classification backbone, we use ResNet50 pretrained on ImageNet 

[8]. For the transformer layers, we keep a similar configuration as in [15], with 4 stacked transformer encoder blocks. The lower pyramid level transformer has dimensionality of 256 for both input and hidden. The final transformer encoder has input dimension of 2308 (a concatenation of ResNet50 output features and the previous transformer outputs). We use Adam optimizer with initial learning rate of for CNN parameters, and

for transformer parameters, then gradually decrease it using cosine learning rate scheduler for 50 epochs. We use 5-fold cross validations to tune the parameters. For transformer layers only, we use weight decay of

and no dropout.

3.0.1 PANDA dataset

Prostate cANcer graDe Assessment (PANDA) challenge dataset consists of  11K whole-slide images from two centers [2]. Currently, this is the largest public WSI dataset available. The grading process consisted of finding and classifying cancer tissue into Gleason patterns based on the architectural growth patterns of the tumor [1]. Consequently, it is converted into an ISUP grade on a 1-5 scale, based on the presence of two distinct Gleason patterns. The dataset was provided as part of the Panda kaggle challenge, which attracted more than 1000 teams, with the goal to predict the most accurate ISUP grades. Each individual image on average is about 25,000px25,000px RGB. The challenge also includes a hidden dataset, whose images were graded by multiple pathologists. The private dataset labels are not publicly available, but can be used to asses your model blindly via Kaggle website (invisible to the public as the challenge is closed now). In our experiments, we use a medium resolution input images (4x smaller than the highest resolution).

3.0.2 Patch selection

To extract patches from WSI, we tile the the image into a grid of 224px224px patches. At each iteration, the grid has a random offset from the top left corner, to ensure randomness of the patches. We then retain only the foreground patches. From the remaining patches, we maintain only a random subset (K=56), which is a trade-off between covering the tissue content and GPU memory limits (see Figure 1). We use batch size 16, which makes the data input size at each iteration. During testing, inference is done using all foreground patches.

3.1 Results

3.1.1 Transformer MIL

We evaluate and compare our method to the Attention MIL and its Gated Attention MIL [14], as well as to a classical MIL with Max operator [3]

. For evaluation metrics we use Accuracy, Area Under Curve (AUC) and Quadratic Weighted Kappa (QWK) of ISUP grade prediction (see Table 

1). QWK metric measures the similarity between the predictions and targets, with a maximum value of 1. QWK was chosen as the main metric during the PANDA challenge [2], since it is more appropriate for the tasks with predicted classes being severity grades/levels. All metrics are computed using our 5-fold (80%/20% training/validation) splits, except for the Leaderboard column results, which come from the evaluation on kaggle challenge hidden private test-set. Even though the challenge is closed now, it allows for blind submission of the code snippet, which runs on the PANDA hidden set and outputs the final QWK number. These results are not added to the kaggle leaderboard, and are allowed only for post-challenge evaluations. Table 1 shows that the proposed two transforms based approaches outperform other methods both in our validation sets, and on the challenge hidden set. We have also inspected the self-attention matrices and found that for many cases, they have have distinct off-diagonal high value elements. In particular, instances with WSI tumor cells of different Gleason scores have higher off-diagonal values, indicating that such a combination is valuable for the final classification, which was captured by the transformer self-attention.

3.1.2 Patch-wise pseudo-labels

We train 5 models and ensemble their patch-level predictions. We use . We show the performance of adding the pseudo-labels supervision in Table 2. In all cases the performance has improved compared to the baselines shown in Table 1 by . Table 2 also shows the QWK results of the winners (top 3 places) of the PANDA kaggle challenge. Notice that our single model results are on par with the winners of the challenge (who all use ensembling of several models). We also experimented with ensembling, and the ensemble of our 10 models, achieves the leaderboard QWK of 0.94136, which would have been the first place in the leaderboard.

We have also tried but found no benefit of repeating pseudo-labeling several rounds, because the pseudo-label values almost do not change after the 1st round.

Accuracy AUC QWK Leaderboard
Attention MIL [14]
Gated attention MIL [14]
Max MIL [3]
Transformer MIL
Pyramid Transformer MIL
Table 1: Evaluations results on PANDA dataset. The Leaderboard column shows the QWK results of the private leaderboard of Kaggle’s challenge, which allows direct comparison to more then 1000 participants.
QWK (val) QWK (Leaderboard)
Attention MIL [14] + Pseudo-labels
Transformer MIL + Pseudo-labels
Pyramid Transformer MIL + Pseudo-labels
First place - Panda kaggle challenge [2] -
Second place - Panda kaggle challenge [2] -
Third place - Panda kaggle challenge [2] -
Pyramid Transformer MIL (ours, ensemble of 10) -
Table 2: Evaluation results of adding pseudo-labels to our baseline transformer MIL approaches. We also include the results of the top three places of this challenge333 all use ensembling of several models). Our results indicate that pseudo-labeling further improves the performance, with our single model providing results on par with the top winning teams.

4 Discussion and Conclusion

We proposed a new deep learning based MIL approach for WSI classification with the following two main contributions: the addition of the transformer module to account for dependencies among instances and the instance-level supervision loss using pseudo-labels. We evaluated the method on PANDA challenge prostate WSI dataset, which includes over 11000 images. To put in perspective, most recently published SOTA methods evaluated their performance on datasets with the order of only several hundred images [18, 7, 11, 12]. Furthermore, we compared our results directly to the leaderboard of the PANDA kaggle challenge with over 1000 participating teams, and demonstrated that our single model performance is on par with the top three winning teams, as evaluated blindly on the same hidden private test-set. Finally, recently proposed visual transformers [6] have shown a capability to replace the classification CNN completely, allowing for the possibility to create deep learning based MIL model solely based on the transformer blocks; we leave these investigations for future research.


  • [1] W. Bulten, M. Balkenhol, J. A. Belinga, A. Brilhante, A. Cakır, L. Egevad, M. Eklund, X. Farre, K. Geronatsiou, V. Molinie, G. Pereira, P. Roy, G. Saile, P. Salles, E. Schaafsma, J. Tschui, A. Vos, I. P. I. E. Panel, H. van Boven, R. Vink, J. van der Laak, C. H. der Kaa, and G. Litjens (2021) Artificial intelligence assistance significantly improves gleason grading of prostate biopsies by pathologists. Modern Pathology. 34, pp. 660––671. Cited by: §2.1, §3.0.1.
  • [2] W. Bulten, G. Litjens, H. Pinckaers, P. Ström, M. Eklund, L. Egevad, H. Grönberg, P. R. Kimmo Kartasalo, T. Häkkinen, S. Dane, and M. Demkin (2020) The panda challenge: prostate cancer grade assessment using the gleason grading system. In MICCAI challenge, External Links: Link Cited by: Figure 1, §1, Figure 3, §3.0.1, §3.1.1, Table 2.
  • [3] G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. W. K. Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra, and T. J. Fuchs (2019) Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature Medicine 25, pp. 1301–1309. Cited by: §1, §1, §3.1.1, Table 1.
  • [4] M. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon (2018) Multiple instance learning: a survey of problem characteristics and applications. Pattern Recognition 77, pp. 329–353. Cited by: §1.
  • [5] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez (1997) Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89, pp. 31–71. Cited by: §1.
  • [6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §2.1, §2.1, §2.1, §4.
  • [7] N. Hashimoto, D. Fukushima, R. Koga, Y. Takagi, K. Ko, K. Kohno, M. Nakaguro, S. Nakamura, H. Hontani, and I. Takeuchi (2020) Multi-scale domain-adversarial multiple-instance cnn for cancer subtype classification with unannotated histopathological images. In CVPR, Cited by: §1, §4.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), Cited by: §2.1, §3.
  • [9] C. L.Srinidhi, O. Ciga, and A. L.Martel (2021) Deep neural network models for computational histopathology: a survey. Medical Image Analysis 67, pp. 329–353. Cited by: §1, §1.
  • [10] M. Lerousseau, M. Vakalopoulou, M. Classe, J. Adam, E. Battistella, A. Carre, T. Estienne, T. Henry, E. Deutsch, and N. Paragios (2020) Weakly supervised multiple instance learninghistopathological tumor segmentation. In MICCAI, Vol. 12265, pp. 470–479. Cited by: §2.2.
  • [11] M. Y. Lu, D. F. K. Williamson, T. Y. Chen, R. J. Chen, M. Barbieri, and F. Mahmood (2021) Data efficient and weakly supervised computational pathologyon whole slide images. Nature Biomedical Engineering 19. Cited by: §1, §4.
  • [12] S. Maksoud, K. Zhao, P. Hobson, A. Jennings, and B. C. Lovell (2020) SOS: selective objective switch for rapid immunofluorescence whole slide image classification. In CVPR, Cited by: §1, §4.
  • [13] O. Maron and T. Lozano-Perez (1998) A framework for multiple-instance learning. In NIPS, pp. 570–576. Cited by: §1.
  • [14] M. W. Maximilian Ilse (2018) Attention-based deep multiple instance learning. In ICML, pp. 2127–2136. Cited by: §1, §2, §3.1.1, Table 1, Table 2.
  • [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Vol. 30. Cited by: 1st item, §2.1, §2.1, §3.
  • [16] Z. Wang, J. Poon, and S. K. Poon (2019) AMI-net+: A novel multi-instance neural network for medical diagnosis from incomplete and imbalanced data. Aust. J. Intell. Inf. Process. Syst. 15 (3), pp. 8–15. Cited by: §2.1.
  • [17] Q. Xie, M. Luong, E. Hovy, and Q. V. Le (2020)

    Self-training with noisy student improves imagenet classification

    In CVPR, pp. 10687–10698. Cited by: §2.2.
  • [18] Y. Zhao, F. Yang, Y. Fang, H. Liu, N. Zhou, J. Zhang, J. Sun, S. Yang, B. Menze, X. Fan, and J. Yao (2020) Predicting lymph node metastasis using histopathological images based on multiple instance learning with deep graph convolution. In CVPR, Cited by: §1, §4.