Self-similarity Student for Partial Label Histopathology Image Segmentation

07/19/2020 ∙ by Hsien-Tzu Cheng, et al. ∙ 10

Delineation of cancerous regions in gigapixel whole slide images (WSIs) is a crucial diagnostic procedure in digital pathology. This process is time-consuming because of the large search space in the gigapixel WSIs, causing chances of omission and misinterpretation at indistinct tumor lesions. To tackle this, the development of an automated cancerous region segmentation method is imperative. We frame this issue as a modeling problem with partial label WSIs, where some cancerous regions may be misclassified as benign and vice versa, producing patches with noisy labels. To learn from these patches, we propose Self-similarity Student, combining teacher-student model paradigm with similarity learning. Specifically, for each patch, we first sample its similar and dissimilar patches according to spatial distance. A teacher-student model is then introduced, featuring the exponential moving average on both student model weights and teacher predictions ensemble. While our student model takes patches, teacher model takes all their corresponding similar and dissimilar patches for learning robust representation against noisy label patches. Following this similarity learning, our similarity ensemble merges similar patches' ensembled predictions as the pseudo-label of a given patch to counteract its noisy label. On the CAMELYON16 dataset, our method substantially outperforms state-of-the-art noise-aware learning methods by 5% and the supervised-trained baseline by 10% in various degrees of noise. Moreover, our method is superior to the baseline on our TVGH TURP dataset with 2% improvement, demonstrating the generalizability to more clinical histopathology segmentation tasks.



There are no comments yet.


page 7

page 13

page 14

page 21

page 22

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overview of our application scenario. The cancerous regions and patches are masked by the cyan color. a. Pathologists manually annotate partial cancer regions in WSIs because of time constraints or misinterpretation. b. A partial label WSI with some lesions omission c. Our model inference with patches. d. The model output of patches combines to WSI for lesions prediction. This prediction can be used as a pseudo ground truth label for model training or for pathologists to review.

Digital pathology and deep learning (DL) techniques possess the potential to transform the clinical practice of pathology diagnosis. Conventionally, the gold standard for pathology diagnosis is the process of pathologists inspecting hematoxylin and eosin (H

E) stained tissue specimens on glass slides using optical microscopes. This is time-consuming and error-prone. With the rising adoption of digital pathology, the digitized slide, namely whole slide image (WSI), mitigates the aforementioned issue. However, making diagnosis via manual inspection in gigapixel WSI is still labor intensive. The DL algorithms, particularly tailored to analyze WSI [10], would empower digital pathology and subsequently automate pathology diagnosis.

Identification of cancer regions in gigapixel WSI is considered the bottleneck of pathology diagnosis. Breaking this bottleneck, the CAMELYON16 challenge [2] serves as a milestone toward the automated segmentation of lymph node metastases in WSI. With detailed pixel-level annotation provided in this CAMELYON16 dataset, several fully-supervised DL algorithms have demonstrated the segmentation performance on par with pathologists [2, 10]

. Still, amassing large scale WSI datasets for other diseases with the annotation comparable to CAMELYON16 requires a team of skilled pathologists and brings another bottleneck for the automation of digital pathology in clinical practice. To be scalable in clinics, DL algorithms, which leverage semi-supervised and weakly-supervised learning frameworks, may be effective.

To relieve the need of fine-grained annotations in WSI, semi-supervised and weakly-supervised learning frameworks deserve careful consideration. Recent work has successfully applied multiple instance learning (MIL) framework to detect cancer regions in WSI with weak labels (patient-level diagnosis) [3, 26]

. Nevertheless, research of the kind heavily depends on large datasets (more than 40,000 WSIs) to learn useful feature representations. This limits its applicability to the automation of digital pathology. Semi-supervised learning framework, on the other hand, demonstrates its potential to learn from datasets with part of them being completely or partially labeled

[20, 3]. Such application scenario resembles the common clinical context, where pathologists miss small cancer regions and fail to identify the distinct boundary of tumor cells out of benign ones in WSI. Therefore, our paper aims to solve this issue with the technique originated from semi-supervised learning.

Inherent noises in partial label WSI may impede the learning ability of DL models. To alleviate the negative influence of noisy labels, teacher-student learning paradigm [22], common in semi-supervised learning, tends to be helpful. Different from most of the semi-supervised works applying this paradigm to generate pseudo ground truths for unlabeled samples, those pseudo ground truths can also be used to eliminate or counteract noisy samples [17]. Motivated by such paradigm, we further propose an approach to tackle the inherent noises originated from the modeling process with partial label WSI. Fig. 1 illustrates the application scenario of our proposed method and the inherent noises our method attempts to mitigate.

To accelerate the automation of digital pathology and deal with the subsequent modeling issue from partial label WSI, our proposed teacher-student model features the following strategies and contributions:

  1. We propose Self-similarity Student, a teacher-student based model embedded with self-similarity learning and similarity predictions ensemble, to recognize cancer lesions from noisy label patches in partial label WSIs.

  2. Our self-similarity learning approach is motivated by the nature of tissue morphology, learning representation with a similarity loss that enforces nearby patches in a WSI to be closer in feature space.

  3. Our similarity ensemble approach generates pseudo label of a given patch in the partial label WSI. The similarity ensembled pseudo label is updated based on the consensus between predictions ensemble of the patch and its nearby patches, making the pseudo label more robust.

  4. The result on the CAMELYON16 dataset shows that our Self-similarity Student method achieves more than 10% performance boost compared to the supervised-trained baseline and more than 5% to the best previous art.

  5. The result of our method shows 2% improvement over the baseline on our TVGH TURP cancer dataset, demonstrating the generalizability to more clinical histopathology segmentation tasks.

2 Related Work

We discuss the relevant literature in two aspects, namely, recent research efforts on designing automatic analysis techniques for digital histopathology, and deep learning methods dealing with the semi-supervised scenario or noisy data, each of which is closely related to the problem setting of our method.

2.0.1 Digital histopathology

Concerning the extremely large sizes of WSIs (around

 pixels), designing automatic and effective machine learning techniques for histopathological image analysis is much needed in clinical practice

[10]. Introduced by [8], a CNN-based model has been proposed for patch-wise WSI classification, following a count-based aggregation for WSI-level classification. Lee and Paeng [13]

further adopt CNNs, comprising a patch-level detector and a slide-level classifier, for WSI metastasis detection and pN-stage classification of breast cancer. In

[21], Takahama et al. propose to explore the global information from semantic segmentation to enhance the local classification performance on WSIs. To achieve this goal, they establish a DNN model that combines a patch-based classification module and a whole slide segmentation module. Campanella et al. [3] develop a clinical-grade computational pathology framework that utilizes multiple instance learning (MIL) based deep learning techniques to carry out a thorough study over a cancer dataset of WSIs from patients. The MIL setting allows the convenience of skipping annotating WSIs at the pixel level and also yields good classification performance. For automated segmentation of cancer regions in gigapixel WSIs, several works have addressed the challenging problem with deep learning in either fully supervised [23], or weakly supervised setting [26]. By assuming the correlation between image features of cancer subtypes and image magnifications, Tokunaga et al. [23] propose the adaptive weighting multi-field-of-view CNN to carry out semantic segmentation for pathological WSIs. More recently, Xu et al. [26] introduce the CAMEL framework to address histopathology image segmentation in a weakly-supervised manner. Driven by MIL-based label enrichment, their method requires only image-level labels of training data, and progressively predicts instance-level labels and then pixel-level labels. For the post-process to combine patches, [14, 15] proposed to combine patches in multiple level of overlapping for smoother and less noisy WSI lesion segmentation. To clearly demonstrate the robustness of our method on dealing with noisy labeled data, we only conduct basic post-process, combining non-overlapped patches to WSIs, in all of our experiments.

2.0.2 Noisy label and semi-supervised learning.

The issue of noisy labeling in histopathological imaging is a major concern. In practice, the unreliable labeling is unavoidable in that manually annotating huge-size WSIs beyond image-level is inherently a daunting task. In addition, noisy labeling could also result from partially annotating WSIs as we aim to address in this work. Le et al. [12] develop a noisy label classification (NLC) to predict regions of pancreatic cancer in WSIs. Their method leverages a small set of clean samples to yield a weighting scheme for alleviating the effect of noisy training data. The self-ensemble label filtering (SELF) introduced in [17] first uses the training dataset-wise running averages of the network predictions to filter noisy labels and then applies semi-supervised learning to achieve model training. SELF is shown to outperform other noise-aware techniques across different datasets and architectures. On semi-supervised learning, the temporal ensembling framework introduced in [11] is a pioneering effort on proposing self-ensembling mechanism that uses ensemble predictions to improve the quality of predictions on unknown labels. Their method is demonstrated to achieve significant improvements on standard benchmark datasets such as CIFAR-10 and CIFAR-100. Xie et al. [25]

develop a self-training method that a teacher model is learned from the labeled ImageNet images and is used to annotate pseudo labels from extra unlabeled data. Then the augmented training data of labeled and pseudo labeled images are used to learn the student model, where noise is injected to achieve generalization. The roles of teacher and student are then switched and the learning process is repeatedly carried out to obtain the final model. Our work differs from these previous arts in the following: (a) pseudo-labels from teacher-student model are used to counteract noisy labels instead of assigning them to unlabeled samples. (b) our pseudo-label of each patch is generated from the consensus of predictions ensemble of its similar patches.

3 Methodology

In this section, we elaborate our Self-similarity Student method for cancerous region segmentation in partial label WSIs. The preliminaries and notations are firstly shown in Sec. 3.1. Next, we provide an overview of our proposed algorithm in Sec. 3.2. After that, the method to construct similarity embedding is introduced in Sec. 3.3. At last, we describe the details of Self-similarity Student for noisy label learning in Sec. 3.4

3.1 Preliminaries and Notations

Given a complete label WSI dataset , we generate our partial label dataset for modeling and hold out clean test set for final evaluation. In partial label dataset , k cancer lesions ( or ) per WSI are kept and the remaining ones are relabeled as non-cancerous regions. The and stand for the top k largest and random k cancer lesions respectively. Moreover, patches , which represent patch-label pairs , are sampled from . Since is partially annotated, label of each patch may be noisy. For each of a given WSI, we further sample its similar and dissimilar patches according to distance , producing its similar and dissimilar bag of patches. The details of how we generate and are illustrated in Fig. 3 and described in Sec. 3.3.

As to the teacher-student model, we consider to be a model with corresponding weights and augmentation . Thus, we write the teacher model as and student model as . Additionally, we denote the feature embedding of a given patch from teacher model and student model by and . Similarly, the feature embeddings of similar and dissimilar pair ( and ) for a given patch () from teacher model are expressed as and respectively.

Figure 2:

Overview of our model training pipeline refers to Algorithm 1. For each epoch, we first train our student-teacher model (colored in blue) by means of

, , , with noisy label and pseudo label . , , and student weights EMA are calculated in every batch for the update. After and updated, we loop all patches again (colored in green), conducting EMA predictions ensemble then similarity ensemble to update our ensembled predictions and pseudo label which will be used in the next epoch.

3.2 Overview of Self-similarity Student

Fig. 2 illustrates an overview of our proposed approach. To deal with the noisy label patches , our Self-similarity Student method builds upon two core concepts: teacher-student learning and similarity embedding. We build up our teacher and student models similar to the previous works [22, 11]. That is, the teacher model is the exponential moving average (EMA) of the student model , which makes the intermediate representations more stable and facilitate student to learn robust representations against noisy samples. Besides the EMA of model weights for teacher model, we further leverage the ensemble of teacher model predictions for each patch to make the pseudo-label of each patch more consistent [1, 17]. Different from [17] using predictions ensemble to filter noisy samples, the predictions ensemble of each patch in our approach serves as the basis of its pseudo-labels for complementary supervision, both counteracting noisy patches and making the most of original labels from .

In addition to the predictions ensemble, we introduce a similarity learning method to make the pseudo-labels of each patch more robust. This proposed method is motivated by the intrinsic property of tissue morphology in WSIs and the unsupervised visual representation learning work [7]. Because nearby patches, which share similar morphological characteristics, tend to have the same labels, the generation of the pseudo-label of a given patch could refer to its neighboring ones, namely . Here, the pseudo-label of a given patch is defined as a function . The and are the teacher predictions ensemble of a given patch and that of its similar pairs respectively. In this paper, we implement function as the average of and . Following this intuition, we further apply similarity loss, which is a unique version of contrastive loss, to encourage the feature embeddings of nearby patches ( and its ) to become closer and those of dissimilar ones ( and its ) to be distinct in feature space. Details of the construction of similarity embedding by applying similarity loss are described in Sec. 3.3 and in pseudo code provided in Algorithm 1.

Figure 3: Similarity sampling. To simulate the scenario of training our model by only partial label WSIs, each -lesion-remained is sampled from the . From we sample noisy label patches , illustrating 2 example patches and . By the with distance threshold denoted in the orange grid, we can further sample the similar patches in orange box and dissimilar patches in gray box of each patch . The noisy label case occurs here because of the false benign (Benign*) label of which should be revised as Cancer by our model.
Noisy set of patches sampled from
EMA momentum
Distance threshold and max epoch
= model weights gradient optimizer, e.g. Adam
Initialize Initialize teacher model
Initialize Initialize student model
Initialize all pseudo label
Initialize all ensembled predictions
for all  do
      Similarity sampling (Eq. 1)
end for
for  do Main training loop
     for all ,  do
          a. Student forward
          b. Teacher forward (Ignore and )
          c. Cross entropy loss (Eq. 3)
          d. Similarity embedding (Eq. 2)
          Update student’s weights
          e. Update teacher’s weights
     end for
     for all  do f. Predictions ensemble
          Update ensembled predictions per patch
     end for
     for all ,  do g. Similarity ensemble as pseudo label
          Consensus of nearby predictions
     end for
end for
Algorithm 1 Overview of our Self-similarity Student algorithm

3.3 Construction of Similarity Embedding

Inspired by pathologists’ empirical knowledge and the work [6], we formulate a similarity sampling strategy with respect to the distance between patches on a WSI, detailed in Fig. 3. For each patch , we sample multiple patches and by distance-based similarity sampling strategy within a WSI:


where denotes the coordinates of a patch on its original WSI and stands for the total number of patches in a WSI. Hence, there are similar patches having Euclidean length to within distance and ) dissimilar patches outside of that threshold.

To learn similarity embeddings in teacher-student models, a unique form of InfoNCE [18] is considered in this paper. The formulation for similarity loss is as follows:


where  [24] is a temperature hyper-parameter, is the student feature embedding for , and and respectively are the teacher feature embeddings of its similar patches and dissimilar patches corresponding to .

For time and memory efficiency to calculate in each iteration, we simplify equation (2) by randomly sampling one patch from and that from for a given patch and further deriving . This could be regarded as the log loss of a two-class softmax-based classifier, attempting to classify as . The dot products between the student feature embedding and the teacher feature embeddings, and , can be viewed as the local similarity measurements.

3.4 Self-similarity Student for Noisy Label Learning

In our proposed approach, the Self-similarity Student learns to both cluster local similar patches and classify each patch with the supervision from original labels and pseudo-labels . Unlike semi-supervised learning methods treating pseudo-labels from teacher-student paradigm as ground truths for unlabeled samples, the pseudo-labels in our method are used to counteract noisy labels in . With the similarity constraint mentioned in Sec. 3.3, a pseudo-label of a given patch , which is derived from the consensus between its predictions ensemble and that of its similar patches , is guaranteed to be more stable and robust to noises.

Overall, we apply two categories of losses to encourage our Self-similarity Student to learn from noisy patch-label pairs : and . Specifically, includes the cross entropy loss between student predictions and original labels , shown in equation (3), and the cross entropy loss between student predictions and pseudo-labels .


The overall lose function of our method can be written as


4 Experimental Result

In this section, we first elaborate our experimental details about the implementation and dataset settings in Sec.9 and Sec. 4.2. Next, the performance comparisons between our method and other variants are reported in Sec. 4.3, Sec. 4.4 are Sec. 4.5. Lastly, Sec. 4.6 shows the performance on the TVGH TURP dataset, suggesting the generalization potential of our method to other clinical WSI data. For performance evaluation, we use the dice similarity coefficient (DSC) as our patch-level metric and the free-response receiver operating characteristic (FROC) curve as our lesion-level metric. The FROC curve is defined as the plot of sensitivity versus the average number of false-positives per slide and the final FROC score is the average sensitivity at 6 predefined false positive rates, including , , , , , false positives per slide [2].

4.1 Implementation Details

We implement all baseline methods and our variants based on DenseNet121 [9] baseline with its official ImageNet [5]

pretrained weight from PyTorch 

[19] model zoo. For optimizer settings, we use Adam optimizer with learning rate 1e-4 and weight decay 4e-5, dividing learning rate by 2 per 50 epochs. The default random seed is set to 2020, dropout rate to 0.2, and batch size to 48 in all our experiments for fair comparison. Referred to common stain augmentations for WSIs [13] and those used in robust mean-teacher based methods [4, 25], we choose several augmentations to train all our models, as augmentations are proven crucial to the mean-teacher based model performance. For similarity sampling, we choose considering empirical cancer lesion diameter from pathologists’ view. We also set , , , , and in all our experiments. All our models are trained and tested for inference using one Nvidia RTX 2080 Ti GPU.

Here, we describe implementation details about the state-of-the-art baselines, as in Sec 4.3. We conduct similarity sampling on our method and ablation study only. For student network in Noisy Student [25], we scale up the default dropout 2.5 times and all default augmentation functions 1.5 times. The teacher networks in both Mean Teacher [22] are updated every batch by EMA of student’s weights, while Predictions Ensemble [17] and Noisy Student are updated every epoch. We also change label filtering [17] to our relabel mechanism for balancing positive and negative samples. To the best of our knowledge, Mean Teacher, Predictions Ensemble, Noisy Student, and other relevant teacher-student methods are not designed for and experimented on noisy labeled histopathology WSIs yet.

4.2 Dataset

The CAMELYON16 dataset consists of 270 WSIs for training and 130 WSIs for testing. We randomly sample 243 WSIs as and 27 WSIs as from 270 WSIs. The 130 WSIs () are used for final evaluation. To simulate the clinical context of having unidentified cancer regions in WSI, we create partial label dataset and out of and by retaining and cancer regions in each WSI. As shown in Table 1, we choose the number of to be 1, 2, and 3. For example, means that only the largest caner lesion in each WSI is kept and the rest is regarded as benign tissue. It could be recognized that tasks are more challenging than tasks due to the injected noises.

To further sample patches from WSIs in and , we run OTSU thresholding [27] and set a foreground-background ratio to extract foreground tissues. Following [12], cancerous patches are defined to have more than intersection of their areas with either or cancerous regions, and benign patches are the ones fully from the area outside of those cancerous regions. The resulting patch-label pairs with resolution are sampled from magnification WSIs (0.972 ), i.e. the receptive field of a patch covers .

Label Complete 1 2 3 1 2 3
Benign 588286 597500 592454 590482 606342 604331 600122
Cancer 21730 12516 17562 19534 2819 5188 8358
Ratio of Correct Cancer Patches 57.60% 80.82% 89.89% 12.97% 23.87% 38.46%
Ratio of Noisiness 42.40% 19.18% 10.11% 87.03% 76.13% 61.54%
Number of Cancer Lesions 91 150 193 91 150 193
Average Size of Lesions () 6.7814 5.7527 4.9959 2.0587 1.8747 2.5277
Table 1: Number of patches in training set and statistics for various and tasks. “Complete” denotes the task using the original with clean ground truth. We sample from for and tasks. The ratio of noisiness declines as more lesions per WSI are correctly labeled.

4.3 Comparison with Previous Arts

To benchmark our Self-similarity Student, we further implement several state-of-the-art methods, including Mean Teacher [22], Noisy Student [25], and Predictions Ensemble [17]. Demonstrated in Table. 3, the results of our method outperform previous arts evaluated on and tasks. Our method achieves 93.76 DSC & 36.9 FROC on the task, and 85.56 DSC & 31.88 FROC on the more challenging task, which achieves more than 10% performance boost compared to the supervised-trained baseline and more than 5% performance boost compared to the best previous art.

Moreover, Fig. 5 shows the qualitative comparison between our method and the baselines. Inferred from these results, our Self-similarity Student could both correctly identify more cancer regions and cancer cells (patches) with only a few false positives. For the implementation details of baseline methods, see Supp. Sec. 2. For the illustration of effectiveness of our self-similarity embedding method, see Supp. Sec. 3. For more qualitative results, see Supp. Sec. 4.

Baseline DenseNet121 [9] 83.08 29.99 63.08 28.09
Mean Teacher [22] 86.83 34.13 74.45 28.45
Noisy Student [25] 84.90 34.21 77.46 30.06
Prediction Ensemble [17] 88.60 33.41 75.59 30.20
Self-sim Student (Ours) 93.76 36.90 85.56 31.88
Table 2: Comparison with Previous Arts. All results are trained with single cancer lesion per WSI and evaluated on the clean testing set . Our method outperforms other techniques in task (only one largest cancer region is annotated per WSI) and the more challenging task (only one random cancer region is annotated per WSI)
Figure 4: FROC result. Our method, colored in red, yields the highest FROC score compared to baselines. Moreover, at the relatively low false positive rate, our method outperforms all the baselines with higher sensitivity.
Figure 5: Qualitative comparison with previous arts. Red boxes indicate zoomed-in region-of-interest. The regions in green color indicate true positives and yellow indicate false positives. The cyan color denotes ground truth (false negatives if no prediction overlapped). The top row demonstrates our method with lower false positives, especially compared to Mean Teacher. The lower two rows show that our method achieves higher sensitivity compared to all other baselines.

4.4 Comparison with Various Label Ratio

In this section, we compare the performances of the models trained from partially labeled training set described in Sec. 4.2. The results in Table 3 indicate that our method achieves comparable or better performance than DenseNet121 baseline in all experimental settings. These further suggest that our Self-similarity Student can still learn to discriminate most of the cancer regions from benign parts even in the situation, where there are cancer regions unidentified in training set.

Method Metrics Complete 1 2 3 1 2 3
Baseline DSC 92.68 83.08 87.40 90.41 63.08 70.02 78.39
DenseNet121 [9] FROC 41.12 29.99 38.37 40.18 28.09 35.90 33.22
Self-sim Student DSC 90.49 93.76 93.98 91.29 85.56 87.57 90.20
(Ours) FROC 39.52 36.90 38.94 39.16 31.88 36.54 35.08
Table 3: Performance of models trained in various noisy label ratios. Complete denotes the results trained from original ground truths. All results are from the clean testing set, demonstrating that our model is capable of learning from limited noisy ground truth while still having competent performance.

4.5 Ablation Study

To demonstrate the contributions by each part of our Self-similarity Student method, we conduct a set of experiments in task. Baseline indicates the DenseNet121 network trained with whereas Sim-embedding shows the performance of Baseline plus . Furthermore, Pred-ensemble uses the teacher predictions ensemble of each patch as its pseudo label while Sim-ensemble derives pseudo-labels by averaging the predictions ensemble from patches and their corresponding similar pairs.

Shown in Table. 4, each component in Self-similarity Student contributes to its overall performance. Most importantly, this justifies the effectiveness of our method, which makes pseudo-labels more robust against noisy labels by using predictions ensemble from similar pairs.

Ablation Study DSC FROC
Baseline Sim-embedding Pred-ensemble Sim-ensemble
63.08 28.09
68.42 28.24
75.59 30.20
85.56 31.88
Table 4: Ablation Study. Abbreviations: Sim-ensemble stands for similarity ensemble; Sim-embedding stands for similarity embedding using loss learning; Pred-ensemble stands for predictions ensemble.
Figure 6: Qualitative result on the testing set of TVGH TURP dataset. The color code is the same as Fig. 5. Our Self-similarity Student is able to predict cancer regions more precisely than the baseline with better grouped patterns.

4.6 Generalizability of our method

To evaluate the potential generalizability of our method, we conduct experiments on the transurethral resection of prostate (TURP) data from the Department of Pathology, Taipei Veterans General Hospital (TVGH). The TVGH TURP dataset consists of 71 WSIs with annotated cancerous lesions, defined as regions with Gleason score greater than . The training set consists of 58 WSIs within 13 WSIs for validation. The actual size of each WSI is about with 0.25 pixel spacing. We follow the same patch extraction strategy in CAMELYON16 dataset, producing totally 459273 patches from zoomed TURP WSIs with receptive field . Our method achieves 77.24 DSC on the 13 WSIs testing set, which is better than the supervise-trained baseline with 75.36 DSC. The qualitative result is shown in Fig. 6 and Supp. Sec. 4.

5 Conclusion

Computer vision technique is the key to accelerate the automation of digital pathology in clinical practice, particularly the identification of cancer regions in WSI. In this research, we propose a teacher-student framework, Self-similarity Student, to address partial label WSI, which relieves the burden on pathologists. The result shows that our method outperforms previous arts at least in terms of DSC, suggesting that Self-similarity Student possesses more robust representations against noisy labels. Following these meticulous experiments, the advantage of similarity ensemble for modeling with partial label WSI is verified. More importantly, our approach is capable of generalizing to TURP dataset, which identifies cancer regions out of benign ones with fewer false positives. To sum, Self-similarity Student can be a potent method to tackle the problems originated from partial label WSI dataset. In future work, we aim to pursue the potential of Self-similarity Student for semi-supervised learning in small-sized WSI datasets, truly contributing to the automation of cancer region delineation.

5.0.1 Acknowledgment.

We thank Yi-Chin Tu, the chairman of Taiwan AI Labs, for the generous support of this project. We also thank the intensive assistance made by the Department of Pathology and Laboratory Medicine, Taipei Veterans General Hospital. Lastly, we appreciate Tsun-Hsiao Wang at National Yang-Ming University for his contribution on delineating cancerous regions in WSIs of TVGH TURP dataset.


  • [1] Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. In: Advances in neural information processing systems. pp. 3365–3373 (2014)
  • [2] Bejnordi, B.E., Veta, M., Van Diest, P.J., Van Ginneken, B., Karssemeijer, N., Litjens, G., Van Der Laak, J.A., Hermsen, M., Manson, Q.F., Balkenhol, M., et al.: Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama 318(22), 2199–2210 (2017)
  • [3] Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Silva, V.W.K., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine 25(8), 1301–1309 (2019)
  • [4] Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. arXiv preprint arXiv:1909.13719 (2019)
  • [5]

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

  • [6] Gildenblat, J., Klaiman, E.: Self-supervised similarity learning for digital pathology. arXiv preprint arXiv:1905.08139 (2019)
  • [7] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019)
  • [8]

    Hou, L., Samaras, D., Kurc, T.M., Gao, Y., Davis, J.E., Saltz, J.H.: Patch-based convolutional neural network for whole slide tissue image classification. In: Proceedings of the ieee conference on computer vision and pattern recognition. pp. 2424–2433 (2016)

  • [9] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
  • [10] Komura, D., Ishikawa, S.: Machine learning methods for histopathological image analysis. Computational and structural biotechnology journal 16, 34–42 (2018)
  • [11] Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)
  • [12] Le, H., Samaras, D., Kurc, T., Gupta, R., Shroyer, K., Saltz, J.: Pancreatic cancer detection in whole slide images using noisy label annotations. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 541–549. Springer (2019)
  • [13] Lee, B., Paeng, K.: A robust and effective approach towards accurate metastasis detection and pn-stage classification in breast cancer. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 841–850. Springer (2018)
  • [14] Lin, H., Chen, H., Dou, Q., Wang, L., Qin, J., Heng, P.A.: Scannet: A fast and dense scanning framework for metastastic breast cancer detection from whole-slide image. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 539–546. IEEE (2018)
  • [15] Lin, H., Chen, H., Graham, S., Dou, Q., Rajpoot, N., Heng, P.A.: Fast scannet: Fast and dense analysis of multi-gigapixel whole-slide images for cancer metastasis detection. IEEE transactions on medical imaging 38(8), 1948–1958 (2019)
  • [16] McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints (Feb 2018)
  • [17] Nguyen, D.T., Mummadi, C.K., Ngo, T.P.N., Nguyen, T.H.P., Beggel, L., Brox, T.: Self: Learning to filter noisy labels with self-ensembling. arXiv preprint arXiv:1910.01842 (2019)
  • [18] Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
  • [19] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems. pp. 8024–8035 (2019)
  • [20] Peikari, M., Salama, S., Nofech-Mozes, S., Martel, A.L.: A cluster-then-label semi-supervised learning approach for pathology image classification. Scientific reports 8(1), 1–13 (2018)
  • [21] Takahama, S., Kurose, Y., Mukuta, Y., Abe, H., Fukayama, M., Yoshizawa, A., Kitagawa, M., Harada, T.: Multi-stage pathological image classification using semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 10702–10711 (2019)
  • [22] Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in neural information processing systems. pp. 1195–1204 (2017)
  • [23] Tokunaga, H., Teramoto, Y., Yoshizawa, A., Bise, R.: Adaptive weighting multi-field-of-view cnn for semantic segmentation in pathology. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12597–12606 (2019)
  • [24] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3733–3742 (2018)
  • [25] Xie, Q., Hovy, E., Luong, M.T., Le, Q.V.: Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252 (2019)
  • [26] Xu, G., Song, Z., Sun, Z., Ku, C., Yang, Z., Liu, C., Wang, S., Ma, J., Xu, W.: Camel: A weakly supervised learning framework for histopathology image segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 10682–10691 (2019)
  • [27] Zhang, J., Hu, J.: Image segmentation based on 2d otsu method with histogram analysis. In: 2008 International Conference on Computer Science and Software Engineering. vol. 6, pp. 105–108. IEEE (2008)

6 Augmentations for Model Training

We apply augmentations and for training teacher-studnet models [4, 13, 25] on partial label WSIs. Several augmentations are chosen as shown in Table. 5. For network dropout, we set all teacher models’ dropout to 0.0 whereas student models’ following Table. 6

. We use Noisy augmentation hyperparameters to train the student model in Noisy Student, and Normal augmentation hyperparameters to train the other teacher-student models.

Augmentations Normal Noisy
Contrast (delta) (0.75, 1.25) (0.5, 1.875)
Brightness (delta) (-0.2, 0.2) (-0.3, 0.3)
Hue (delta) (-0.05, 0.05) (-0.075, 0.075)
Stain/Color Saturation (delta) (0.8, 1.2) (0.533, 1.8)

Flip (probability)

Resize (scale) (0.9, 1.1) (0.6, 1.35)
Crop (resolution) (224, 224)
Rotation (degree) (-180, 180)
Deformation Translation (delta) (-0.05, 0.05) (-0.075, 0.075)
Network Dropout (ratio) 0.2 0.5
Table 5: Augmentations.
Noisy set of patches sampled from
EMA momentum and weight
Distance threshold and max epoch
= model weights gradient optimizer, e.g. Adam
Initialize Initialize teacher model
Initialize Initialize student model
Initialize all pseudo label
for  do Main training loop
     for all  do
          Student forward
          Teacher forward
          Cross entropy loss
          Consistency loss (Eq. 5)
          Update student’s weights
          Update teacher’s weights per batch
     end for
      Update teacher’s weights per epoch
     for all  do Predictions ensemble
          Update pseudo label
     end for
end for
Algorithm 2 Teacher-student Model
Mean Teacher 0.999 1 0 1
Noisy Student 1 0 0 0
Pred-ensemble 0.999 1 0.9 0
Table 6: Hyper parameters of baseline previous arts.

7 Implementation Details of Baseline Methods

In this section, we describe more about the implementation of our baseline previous art methods. We implement Mean Teacher [22], Noisy Student [25], and Pred-ensemble [17] for comparison with our Self-similarity Student method. The teacher-student model training pipeline is illustrated in Algorithm 1. In each epoch, both teacher model and student model inference the noisy label patch . The supervised loss between the (pseudo) label and student model prediction is the calculated by cross entropy loss . Following [22], the consistency loss (Eq. 5) is used to constrain the feature map outputs of teacher model and student model to have similar predictions:


where and are feature maps from fully-connected layers of teacher-student models. After the optimizer backpropagated the loss, model weights ensemble and predictions ensemble are applied according to different settings respectively. Table. 6 shows all the hyperparameters settings of different previous art baselines. For per-batch EMA momentum and per-epoch EMA momentum , 1 denotes no weights update and 0 denotes entirely weights replacement of teacher model from student model. For predictions ensemble momentum , 1 denotes no label update and 0 denotes the entirely pseudo label update by the teacher model prediction. As indicated in Table. 6, following their original settings, we update Mean Teacher and Pred-ensemble every batch while update Noisy Student every epoch. We set for Pred-ensemble and only apply on Mean Teacher. Note that we use pseudo label mechanism, instead of eliminating patches by noisy label filtering, to stabilize the training process without overly filtering.

Figure 7: Qualitative result of the distribution of feature embeddings with UMAP. The cyan colored points denote benign patches and light-green colored points denote cancerous patches. Our method is able to learn a more compact feature representation for both benign and cancerous patches.

8 Feature Embedding of Self-similarity Student

To illustrate the effectiveness of similarity learning, we use UMAP [16] dimension reduction algorithm to visualize the feature embeddings derived from our method and the baselines. Specifically, we sample 30000 patches from our testing set and apply UMAP on the feature embeddings (n dimension=1024). As shown in Fig. 7, with similarity learning, Self-similarity Student can learn a more compact distribution of feature embeddings, which support its advantage in identifying cancerous patches over the baseline.

9 Additional Qualitative Result

More qualitative results on TVGH TURP dataset are illustrated in Fig. 8. Moreover, the results on CAMELYON16 dataset are shown in Fig. 10, and Fig. 9. Our Self-sim Student consistently outperforms other previous arts in multiple morphology patterns on TVGH TURP cancer dataset and CAMELYON16 dataset.

Figure 8: Additional qualitative result on TVGH TURP dataset. The regions in green color indicate true positives and yellow indicate false positives. The cyan color denotes ground truth (false negatives if no prediction overlapped). Our Self-similarity Student is able to predict cancer regions more precisely than the baseline with better grouped patterns.
Figure 9: Additional qualitative result on CAMELYON16 dataset. The regions in green color indicate true positives and yellow indicate false positives. The cyan color denotes ground truth (false negatives if no prediction overlapped).
Figure 10: Additional qualitative result on CAMELYON16 dataset. The regions in green color indicate true positives and yellow indicate false positives. The cyan color denotes ground truth (false negatives if no prediction overlapped).