Digital pathology and deep learning (DL) techniques possess the potential to transform the clinical practice of pathology diagnosis. Conventionally, the gold standard for pathology diagnosis is the process of pathologists inspecting hematoxylin and eosin (HE) stained tissue specimens on glass slides using optical microscopes. This is time-consuming and error-prone. With the rising adoption of digital pathology, the digitized slide, namely whole slide image (WSI), mitigates the aforementioned issue. However, making diagnosis via manual inspection in gigapixel WSI is still labor intensive. The DL algorithms, particularly tailored to analyze WSI , would empower digital pathology and subsequently automate pathology diagnosis.
Identification of cancer regions in gigapixel WSI is considered the bottleneck of pathology diagnosis. Breaking this bottleneck, the CAMELYON16 challenge  serves as a milestone toward the automated segmentation of lymph node metastases in WSI. With detailed pixel-level annotation provided in this CAMELYON16 dataset, several fully-supervised DL algorithms have demonstrated the segmentation performance on par with pathologists [2, 10]
. Still, amassing large scale WSI datasets for other diseases with the annotation comparable to CAMELYON16 requires a team of skilled pathologists and brings another bottleneck for the automation of digital pathology in clinical practice. To be scalable in clinics, DL algorithms, which leverage semi-supervised and weakly-supervised learning frameworks, may be effective.
To relieve the need of fine-grained annotations in WSI, semi-supervised and weakly-supervised learning frameworks deserve careful consideration. Recent work has successfully applied multiple instance learning (MIL) framework to detect cancer regions in WSI with weak labels (patient-level diagnosis) [3, 26]
. Nevertheless, research of the kind heavily depends on large datasets (more than 40,000 WSIs) to learn useful feature representations. This limits its applicability to the automation of digital pathology. Semi-supervised learning framework, on the other hand, demonstrates its potential to learn from datasets with part of them being completely or partially labeled[20, 3]. Such application scenario resembles the common clinical context, where pathologists miss small cancer regions and fail to identify the distinct boundary of tumor cells out of benign ones in WSI. Therefore, our paper aims to solve this issue with the technique originated from semi-supervised learning.
Inherent noises in partial label WSI may impede the learning ability of DL models. To alleviate the negative influence of noisy labels, teacher-student learning paradigm , common in semi-supervised learning, tends to be helpful. Different from most of the semi-supervised works applying this paradigm to generate pseudo ground truths for unlabeled samples, those pseudo ground truths can also be used to eliminate or counteract noisy samples . Motivated by such paradigm, we further propose an approach to tackle the inherent noises originated from the modeling process with partial label WSI. Fig. 1 illustrates the application scenario of our proposed method and the inherent noises our method attempts to mitigate.
To accelerate the automation of digital pathology and deal with the subsequent modeling issue from partial label WSI, our proposed teacher-student model features the following strategies and contributions:
We propose Self-similarity Student, a teacher-student based model embedded with self-similarity learning and similarity predictions ensemble, to recognize cancer lesions from noisy label patches in partial label WSIs.
Our self-similarity learning approach is motivated by the nature of tissue morphology, learning representation with a similarity loss that enforces nearby patches in a WSI to be closer in feature space.
Our similarity ensemble approach generates pseudo label of a given patch in the partial label WSI. The similarity ensembled pseudo label is updated based on the consensus between predictions ensemble of the patch and its nearby patches, making the pseudo label more robust.
The result on the CAMELYON16 dataset shows that our Self-similarity Student method achieves more than 10% performance boost compared to the supervised-trained baseline and more than 5% to the best previous art.
The result of our method shows 2% improvement over the baseline on our TVGH TURP cancer dataset, demonstrating the generalizability to more clinical histopathology segmentation tasks.
2 Related Work
We discuss the relevant literature in two aspects, namely, recent research efforts on designing automatic analysis techniques for digital histopathology, and deep learning methods dealing with the semi-supervised scenario or noisy data, each of which is closely related to the problem setting of our method.
2.0.1 Digital histopathology
Concerning the extremely large sizes of WSIs (around
pixels), designing automatic and effective machine learning techniques for histopathological image analysis is much needed in clinical practice. Introduced by , a CNN-based model has been proposed for patch-wise WSI classification, following a count-based aggregation for WSI-level classification. Lee and Paeng 
further adopt CNNs, comprising a patch-level detector and a slide-level classifier, for WSI metastasis detection and pN-stage classification of breast cancer. In, Takahama et al. propose to explore the global information from semantic segmentation to enhance the local classification performance on WSIs. To achieve this goal, they establish a DNN model that combines a patch-based classification module and a whole slide segmentation module. Campanella et al.  develop a clinical-grade computational pathology framework that utilizes multiple instance learning (MIL) based deep learning techniques to carry out a thorough study over a cancer dataset of WSIs from patients. The MIL setting allows the convenience of skipping annotating WSIs at the pixel level and also yields good classification performance. For automated segmentation of cancer regions in gigapixel WSIs, several works have addressed the challenging problem with deep learning in either fully supervised , or weakly supervised setting . By assuming the correlation between image features of cancer subtypes and image magnifications, Tokunaga et al.  propose the adaptive weighting multi-field-of-view CNN to carry out semantic segmentation for pathological WSIs. More recently, Xu et al.  introduce the CAMEL framework to address histopathology image segmentation in a weakly-supervised manner. Driven by MIL-based label enrichment, their method requires only image-level labels of training data, and progressively predicts instance-level labels and then pixel-level labels. For the post-process to combine patches, [14, 15] proposed to combine patches in multiple level of overlapping for smoother and less noisy WSI lesion segmentation. To clearly demonstrate the robustness of our method on dealing with noisy labeled data, we only conduct basic post-process, combining non-overlapped patches to WSIs, in all of our experiments.
2.0.2 Noisy label and semi-supervised learning.
The issue of noisy labeling in histopathological imaging is a major concern. In practice, the unreliable labeling is unavoidable in that manually annotating huge-size WSIs beyond image-level is inherently a daunting task. In addition, noisy labeling could also result from partially annotating WSIs as we aim to address in this work. Le et al.  develop a noisy label classification (NLC) to predict regions of pancreatic cancer in WSIs. Their method leverages a small set of clean samples to yield a weighting scheme for alleviating the effect of noisy training data. The self-ensemble label filtering (SELF) introduced in  first uses the training dataset-wise running averages of the network predictions to filter noisy labels and then applies semi-supervised learning to achieve model training. SELF is shown to outperform other noise-aware techniques across different datasets and architectures. On semi-supervised learning, the temporal ensembling framework introduced in  is a pioneering effort on proposing self-ensembling mechanism that uses ensemble predictions to improve the quality of predictions on unknown labels. Their method is demonstrated to achieve significant improvements on standard benchmark datasets such as CIFAR-10 and CIFAR-100. Xie et al. 
develop a self-training method that a teacher model is learned from the labeled ImageNet images and is used to annotate pseudo labels from extra unlabeled data. Then the augmented training data of labeled and pseudo labeled images are used to learn the student model, where noise is injected to achieve generalization. The roles of teacher and student are then switched and the learning process is repeatedly carried out to obtain the final model. Our work differs from these previous arts in the following: (a) pseudo-labels from teacher-student model are used to counteract noisy labels instead of assigning them to unlabeled samples. (b) our pseudo-label of each patch is generated from the consensus of predictions ensemble of its similar patches.
In this section, we elaborate our Self-similarity Student method for cancerous region segmentation in partial label WSIs. The preliminaries and notations are firstly shown in Sec. 3.1. Next, we provide an overview of our proposed algorithm in Sec. 3.2. After that, the method to construct similarity embedding is introduced in Sec. 3.3. At last, we describe the details of Self-similarity Student for noisy label learning in Sec. 3.4
3.1 Preliminaries and Notations
Given a complete label WSI dataset , we generate our partial label dataset for modeling and hold out clean test set for final evaluation. In partial label dataset , k cancer lesions ( or ) per WSI are kept and the remaining ones are relabeled as non-cancerous regions. The and stand for the top k largest and random k cancer lesions respectively. Moreover, patches , which represent patch-label pairs , are sampled from . Since is partially annotated, label of each patch may be noisy. For each of a given WSI, we further sample its similar and dissimilar patches according to distance , producing its similar and dissimilar bag of patches. The details of how we generate and are illustrated in Fig. 3 and described in Sec. 3.3.
As to the teacher-student model, we consider to be a model with corresponding weights and augmentation . Thus, we write the teacher model as and student model as . Additionally, we denote the feature embedding of a given patch from teacher model and student model by and . Similarly, the feature embeddings of similar and dissimilar pair ( and ) for a given patch () from teacher model are expressed as and respectively.
3.2 Overview of Self-similarity Student
Fig. 2 illustrates an overview of our proposed approach. To deal with the noisy label patches , our Self-similarity Student method builds upon two core concepts: teacher-student learning and similarity embedding. We build up our teacher and student models similar to the previous works [22, 11]. That is, the teacher model is the exponential moving average (EMA) of the student model , which makes the intermediate representations more stable and facilitate student to learn robust representations against noisy samples. Besides the EMA of model weights for teacher model, we further leverage the ensemble of teacher model predictions for each patch to make the pseudo-label of each patch more consistent [1, 17]. Different from  using predictions ensemble to filter noisy samples, the predictions ensemble of each patch in our approach serves as the basis of its pseudo-labels for complementary supervision, both counteracting noisy patches and making the most of original labels from .
In addition to the predictions ensemble, we introduce a similarity learning method to make the pseudo-labels of each patch more robust. This proposed method is motivated by the intrinsic property of tissue morphology in WSIs and the unsupervised visual representation learning work . Because nearby patches, which share similar morphological characteristics, tend to have the same labels, the generation of the pseudo-label of a given patch could refer to its neighboring ones, namely . Here, the pseudo-label of a given patch is defined as a function . The and are the teacher predictions ensemble of a given patch and that of its similar pairs respectively. In this paper, we implement function as the average of and . Following this intuition, we further apply similarity loss, which is a unique version of contrastive loss, to encourage the feature embeddings of nearby patches ( and its ) to become closer and those of dissimilar ones ( and its ) to be distinct in feature space. Details of the construction of similarity embedding by applying similarity loss are described in Sec. 3.3 and in pseudo code provided in Algorithm 1.
3.3 Construction of Similarity Embedding
Inspired by pathologists’ empirical knowledge and the work , we formulate a similarity sampling strategy with respect to the distance between patches on a WSI, detailed in Fig. 3. For each patch , we sample multiple patches and by distance-based similarity sampling strategy within a WSI:
where denotes the coordinates of a patch on its original WSI and stands for the total number of patches in a WSI. Hence, there are similar patches having Euclidean length to within distance and ) dissimilar patches outside of that threshold.
To learn similarity embeddings in teacher-student models, a unique form of InfoNCE  is considered in this paper. The formulation for similarity loss is as follows:
where  is a temperature hyper-parameter, is the student feature embedding for , and and respectively are the teacher feature embeddings of its similar patches and dissimilar patches corresponding to .
For time and memory efficiency to calculate in each iteration, we simplify equation (2) by randomly sampling one patch from and that from for a given patch and further deriving . This could be regarded as the log loss of a two-class softmax-based classifier, attempting to classify as . The dot products between the student feature embedding and the teacher feature embeddings, and , can be viewed as the local similarity measurements.
3.4 Self-similarity Student for Noisy Label Learning
In our proposed approach, the Self-similarity Student learns to both cluster local similar patches and classify each patch with the supervision from original labels and pseudo-labels . Unlike semi-supervised learning methods treating pseudo-labels from teacher-student paradigm as ground truths for unlabeled samples, the pseudo-labels in our method are used to counteract noisy labels in . With the similarity constraint mentioned in Sec. 3.3, a pseudo-label of a given patch , which is derived from the consensus between its predictions ensemble and that of its similar patches , is guaranteed to be more stable and robust to noises.
Overall, we apply two categories of losses to encourage our Self-similarity Student to learn from noisy patch-label pairs : and . Specifically, includes the cross entropy loss between student predictions and original labels , shown in equation (3), and the cross entropy loss between student predictions and pseudo-labels .
The overall lose function of our method can be written as
4 Experimental Result
In this section, we first elaborate our experimental details about the implementation and dataset settings in Sec.9 and Sec. 4.2. Next, the performance comparisons between our method and other variants are reported in Sec. 4.3, Sec. 4.4 are Sec. 4.5. Lastly, Sec. 4.6 shows the performance on the TVGH TURP dataset, suggesting the generalization potential of our method to other clinical WSI data. For performance evaluation, we use the dice similarity coefficient (DSC) as our patch-level metric and the free-response receiver operating characteristic (FROC) curve as our lesion-level metric. The FROC curve is defined as the plot of sensitivity versus the average number of false-positives per slide and the final FROC score is the average sensitivity at 6 predefined false positive rates, including , , , , , false positives per slide .
4.1 Implementation Details
pretrained weight from PyTorch model zoo. For optimizer settings, we use Adam optimizer with learning rate 1e-4 and weight decay 4e-5, dividing learning rate by 2 per 50 epochs. The default random seed is set to 2020, dropout rate to 0.2, and batch size to 48 in all our experiments for fair comparison. Referred to common stain augmentations for WSIs  and those used in robust mean-teacher based methods [4, 25], we choose several augmentations to train all our models, as augmentations are proven crucial to the mean-teacher based model performance. For similarity sampling, we choose considering empirical cancer lesion diameter from pathologists’ view. We also set , , , , and in all our experiments. All our models are trained and tested for inference using one Nvidia RTX 2080 Ti GPU.
Here, we describe implementation details about the state-of-the-art baselines, as in Sec 4.3. We conduct similarity sampling on our method and ablation study only. For student network in Noisy Student , we scale up the default dropout 2.5 times and all default augmentation functions 1.5 times. The teacher networks in both Mean Teacher  are updated every batch by EMA of student’s weights, while Predictions Ensemble  and Noisy Student are updated every epoch. We also change label filtering  to our relabel mechanism for balancing positive and negative samples. To the best of our knowledge, Mean Teacher, Predictions Ensemble, Noisy Student, and other relevant teacher-student methods are not designed for and experimented on noisy labeled histopathology WSIs yet.
The CAMELYON16 dataset consists of 270 WSIs for training and 130 WSIs for testing. We randomly sample 243 WSIs as and 27 WSIs as from 270 WSIs. The 130 WSIs () are used for final evaluation. To simulate the clinical context of having unidentified cancer regions in WSI, we create partial label dataset and out of and by retaining and cancer regions in each WSI. As shown in Table 1, we choose the number of to be 1, 2, and 3. For example, means that only the largest caner lesion in each WSI is kept and the rest is regarded as benign tissue. It could be recognized that tasks are more challenging than tasks due to the injected noises.
To further sample patches from WSIs in and , we run OTSU thresholding  and set a foreground-background ratio to extract foreground tissues. Following , cancerous patches are defined to have more than intersection of their areas with either or cancerous regions, and benign patches are the ones fully from the area outside of those cancerous regions. The resulting patch-label pairs with resolution are sampled from magnification WSIs (0.972 ), i.e. the receptive field of a patch covers .
|Ratio of Correct Cancer Patches||57.60%||80.82%||89.89%||12.97%||23.87%||38.46%|
|Ratio of Noisiness||42.40%||19.18%||10.11%||87.03%||76.13%||61.54%|
|Number of Cancer Lesions||91||150||193||91||150||193|
|Average Size of Lesions ()||6.7814||5.7527||4.9959||2.0587||1.8747||2.5277|
4.3 Comparison with Previous Arts
To benchmark our Self-similarity Student, we further implement several state-of-the-art methods, including Mean Teacher , Noisy Student , and Predictions Ensemble . Demonstrated in Table. 3, the results of our method outperform previous arts evaluated on and tasks. Our method achieves 93.76 DSC & 36.9 FROC on the task, and 85.56 DSC & 31.88 FROC on the more challenging task, which achieves more than 10% performance boost compared to the supervised-trained baseline and more than 5% performance boost compared to the best previous art.
Moreover, Fig. 5 shows the qualitative comparison between our method and the baselines. Inferred from these results, our Self-similarity Student could both correctly identify more cancer regions and cancer cells (patches) with only a few false positives. For the implementation details of baseline methods, see Supp. Sec. 2. For the illustration of effectiveness of our self-similarity embedding method, see Supp. Sec. 3. For more qualitative results, see Supp. Sec. 4.
|Baseline DenseNet121 ||83.08||29.99||63.08||28.09|
|Mean Teacher ||86.83||34.13||74.45||28.45|
|Noisy Student ||84.90||34.21||77.46||30.06|
|Prediction Ensemble ||88.60||33.41||75.59||30.20|
|Self-sim Student (Ours)||93.76||36.90||85.56||31.88|
4.4 Comparison with Various Label Ratio
In this section, we compare the performances of the models trained from partially labeled training set described in Sec. 4.2. The results in Table 3 indicate that our method achieves comparable or better performance than DenseNet121 baseline in all experimental settings. These further suggest that our Self-similarity Student can still learn to discriminate most of the cancer regions from benign parts even in the situation, where there are cancer regions unidentified in training set.
4.5 Ablation Study
To demonstrate the contributions by each part of our Self-similarity Student method, we conduct a set of experiments in task. Baseline indicates the DenseNet121 network trained with whereas Sim-embedding shows the performance of Baseline plus . Furthermore, Pred-ensemble uses the teacher predictions ensemble of each patch as its pseudo label while Sim-ensemble derives pseudo-labels by averaging the predictions ensemble from patches and their corresponding similar pairs.
Shown in Table. 4, each component in Self-similarity Student contributes to its overall performance. Most importantly, this justifies the effectiveness of our method, which makes pseudo-labels more robust against noisy labels by using predictions ensemble from similar pairs.
4.6 Generalizability of our method
To evaluate the potential generalizability of our method, we conduct experiments on the transurethral resection of prostate (TURP) data from the Department of Pathology, Taipei Veterans General Hospital (TVGH). The TVGH TURP dataset consists of 71 WSIs with annotated cancerous lesions, defined as regions with Gleason score greater than . The training set consists of 58 WSIs within 13 WSIs for validation. The actual size of each WSI is about with 0.25 pixel spacing. We follow the same patch extraction strategy in CAMELYON16 dataset, producing totally 459273 patches from zoomed TURP WSIs with receptive field . Our method achieves 77.24 DSC on the 13 WSIs testing set, which is better than the supervise-trained baseline with 75.36 DSC. The qualitative result is shown in Fig. 6 and Supp. Sec. 4.
Computer vision technique is the key to accelerate the automation of digital pathology in clinical practice, particularly the identification of cancer regions in WSI. In this research, we propose a teacher-student framework, Self-similarity Student, to address partial label WSI, which relieves the burden on pathologists. The result shows that our method outperforms previous arts at least in terms of DSC, suggesting that Self-similarity Student possesses more robust representations against noisy labels. Following these meticulous experiments, the advantage of similarity ensemble for modeling with partial label WSI is verified. More importantly, our approach is capable of generalizing to TURP dataset, which identifies cancer regions out of benign ones with fewer false positives. To sum, Self-similarity Student can be a potent method to tackle the problems originated from partial label WSI dataset. In future work, we aim to pursue the potential of Self-similarity Student for semi-supervised learning in small-sized WSI datasets, truly contributing to the automation of cancer region delineation.
We thank Yi-Chin Tu, the chairman of Taiwan AI Labs, for the generous support of this project. We also thank the intensive assistance made by the Department of Pathology and Laboratory Medicine, Taipei Veterans General Hospital. Lastly, we appreciate Tsun-Hsiao Wang at National Yang-Ming University for his contribution on delineating cancerous regions in WSIs of TVGH TURP dataset.
-  Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. In: Advances in neural information processing systems. pp. 3365–3373 (2014)
-  Bejnordi, B.E., Veta, M., Van Diest, P.J., Van Ginneken, B., Karssemeijer, N., Litjens, G., Van Der Laak, J.A., Hermsen, M., Manson, Q.F., Balkenhol, M., et al.: Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama 318(22), 2199–2210 (2017)
-  Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Silva, V.W.K., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine 25(8), 1301–1309 (2019)
-  Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. arXiv preprint arXiv:1909.13719 (2019)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
-  Gildenblat, J., Klaiman, E.: Self-supervised similarity learning for digital pathology. arXiv preprint arXiv:1905.08139 (2019)
-  He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019)
Hou, L., Samaras, D., Kurc, T.M., Gao, Y., Davis, J.E., Saltz, J.H.: Patch-based convolutional neural network for whole slide tissue image classification. In: Proceedings of the ieee conference on computer vision and pattern recognition. pp. 2424–2433 (2016)
-  Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
-  Komura, D., Ishikawa, S.: Machine learning methods for histopathological image analysis. Computational and structural biotechnology journal 16, 34–42 (2018)
-  Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)
-  Le, H., Samaras, D., Kurc, T., Gupta, R., Shroyer, K., Saltz, J.: Pancreatic cancer detection in whole slide images using noisy label annotations. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 541–549. Springer (2019)
-  Lee, B., Paeng, K.: A robust and effective approach towards accurate metastasis detection and pn-stage classification in breast cancer. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 841–850. Springer (2018)
-  Lin, H., Chen, H., Dou, Q., Wang, L., Qin, J., Heng, P.A.: Scannet: A fast and dense scanning framework for metastastic breast cancer detection from whole-slide image. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 539–546. IEEE (2018)
-  Lin, H., Chen, H., Graham, S., Dou, Q., Rajpoot, N., Heng, P.A.: Fast scannet: Fast and dense analysis of multi-gigapixel whole-slide images for cancer metastasis detection. IEEE transactions on medical imaging 38(8), 1948–1958 (2019)
-  McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints (Feb 2018)
-  Nguyen, D.T., Mummadi, C.K., Ngo, T.P.N., Nguyen, T.H.P., Beggel, L., Brox, T.: Self: Learning to filter noisy labels with self-ensembling. arXiv preprint arXiv:1910.01842 (2019)
-  Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
-  Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems. pp. 8024–8035 (2019)
-  Peikari, M., Salama, S., Nofech-Mozes, S., Martel, A.L.: A cluster-then-label semi-supervised learning approach for pathology image classification. Scientific reports 8(1), 1–13 (2018)
-  Takahama, S., Kurose, Y., Mukuta, Y., Abe, H., Fukayama, M., Yoshizawa, A., Kitagawa, M., Harada, T.: Multi-stage pathological image classification using semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 10702–10711 (2019)
-  Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in neural information processing systems. pp. 1195–1204 (2017)
-  Tokunaga, H., Teramoto, Y., Yoshizawa, A., Bise, R.: Adaptive weighting multi-field-of-view cnn for semantic segmentation in pathology. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12597–12606 (2019)
-  Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3733–3742 (2018)
-  Xie, Q., Hovy, E., Luong, M.T., Le, Q.V.: Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252 (2019)
-  Xu, G., Song, Z., Sun, Z., Ku, C., Yang, Z., Liu, C., Wang, S., Ma, J., Xu, W.: Camel: A weakly supervised learning framework for histopathology image segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 10682–10691 (2019)
-  Zhang, J., Hu, J.: Image segmentation based on 2d otsu method with histogram analysis. In: 2008 International Conference on Computer Science and Software Engineering. vol. 6, pp. 105–108. IEEE (2008)
6 Augmentations for Model Training
We apply augmentations and for training teacher-studnet models [4, 13, 25] on partial label WSIs. Several augmentations are chosen as shown in Table. 5. For network dropout, we set all teacher models’ dropout to 0.0 whereas student models’ following Table. 6
. We use Noisy augmentation hyperparameters to train the student model in Noisy Student, and Normal augmentation hyperparameters to train the other teacher-student models.
|Contrast (delta)||(0.75, 1.25)||(0.5, 1.875)|
|Brightness (delta)||(-0.2, 0.2)||(-0.3, 0.3)|
|Hue (delta)||(-0.05, 0.05)||(-0.075, 0.075)|
|Stain/Color||Saturation (delta)||(0.8, 1.2)||(0.533, 1.8)|
|Resize (scale)||(0.9, 1.1)||(0.6, 1.35)|
|Crop (resolution)||(224, 224)|
|Rotation (degree)||(-180, 180)|
|Deformation||Translation (delta)||(-0.05, 0.05)||(-0.075, 0.075)|
7 Implementation Details of Baseline Methods
In this section, we describe more about the implementation of our baseline previous art methods. We implement Mean Teacher , Noisy Student , and Pred-ensemble  for comparison with our Self-similarity Student method. The teacher-student model training pipeline is illustrated in Algorithm 1. In each epoch, both teacher model and student model inference the noisy label patch . The supervised loss between the (pseudo) label and student model prediction is the calculated by cross entropy loss . Following , the consistency loss (Eq. 5) is used to constrain the feature map outputs of teacher model and student model to have similar predictions:
where and are feature maps from fully-connected layers of teacher-student models. After the optimizer backpropagated the loss, model weights ensemble and predictions ensemble are applied according to different settings respectively. Table. 6 shows all the hyperparameters settings of different previous art baselines. For per-batch EMA momentum and per-epoch EMA momentum , 1 denotes no weights update and 0 denotes entirely weights replacement of teacher model from student model. For predictions ensemble momentum , 1 denotes no label update and 0 denotes the entirely pseudo label update by the teacher model prediction. As indicated in Table. 6, following their original settings, we update Mean Teacher and Pred-ensemble every batch while update Noisy Student every epoch. We set for Pred-ensemble and only apply on Mean Teacher. Note that we use pseudo label mechanism, instead of eliminating patches by noisy label filtering, to stabilize the training process without overly filtering.
8 Feature Embedding of Self-similarity Student
To illustrate the effectiveness of similarity learning, we use UMAP  dimension reduction algorithm to visualize the feature embeddings derived from our method and the baselines. Specifically, we sample 30000 patches from our testing set and apply UMAP on the feature embeddings (n dimension=1024). As shown in Fig. 7, with similarity learning, Self-similarity Student can learn a more compact distribution of feature embeddings, which support its advantage in identifying cancerous patches over the baseline.
9 Additional Qualitative Result
More qualitative results on TVGH TURP dataset are illustrated in Fig. 8. Moreover, the results on CAMELYON16 dataset are shown in Fig. 10, and Fig. 9. Our Self-sim Student consistently outperforms other previous arts in multiple morphology patterns on TVGH TURP cancer dataset and CAMELYON16 dataset.