Supervision and Source Domain Impact on Representation Learning: A Histopathology Case Study

05/10/2020 ∙ by Milad Sikaroudi, et al. ∙ University of Waterloo 0

As many algorithms depend on a suitable representation of data, learning unique features is considered a crucial task. Although supervised techniques using deep neural networks have boosted the performance of representation learning, the need for a large set of labeled data limits the application of such methods. As an example, high-quality delineations of regions of interest in the field of pathology is a tedious and time-consuming task due to the large image dimensions. In this work, we explored the performance of a deep neural network and triplet loss in the area of representation learning. We investigated the notion of similarity and dissimilarity in pathology whole-slide images and compared different setups from unsupervised and semi-supervised to supervised learning in our experiments. Additionally, different approaches were tested, applying few-shot learning on two publicly available pathology image datasets. We achieved high accuracy and generalization when the learned representations were applied to two different pathology datasets.



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the advent of digital whole-slide image (WSI) scanners and digital pathology, a vast range of computer-vision algorithms using machine learning have been developed to process histopathology images. These technologies offer opportunities for better quantitative modeling of disease appearance and hence possibly improved prediction of disease severity and aggressiveness, and patient outcome. More specifically, machine-learning applications in digital pathology ranges from computer-aided diagnosis based on classification, detection, segmentation, and content-based image retrieval. But domain-specific limitations such as large image dimensions and insufficient amount of annotated data restrict the applicability of such approaches


Current deep learning algorithms may reach or surpass human-level accuracy when a large amount of data is available. However, deep networks suffer from poor sample efficiency in stark contrast to human perception that can learn object categories after seeing just a few pictures, in some cases even one instance. Few-shot learning as the ability to learn from a few labeled samples aims to address this issue. More specifically, the knowledge is extracted from other similar problems since there is not enough data. As a result, many methods characterize few-shot learning as a meta-learning problem.

Approaches to meta-learning can fall into three major sub-categories by whether they rely on prior knowledge about similarity, learning, or data [17]. We aimed to utilize prior knowledge about the similarity to learn robust embeddings by investigating the notion of pairwise similarity between samples. To this end, we propose a novel framework to impose abstract domain knowledge in histopathology as prior knowledge to train a triplet deep neural network.

Fig. 1: -D representation of the embedding using models trained by (a) with the triplet loss, (b) , (c) , and (d) with the cross-entropy loss. The -D representations are produced by UMAP [11].

Ii Related Work

Koch et al. [9] suggested the Siamese network as a choice for representation learning based on sample-wise comparisons. The Siamese networks generally consist of two similar neural networks that accept two inputs from either the same or different classes. The pair is embedded by identical neural networks first. Then the component-wise difference of the representations is passed into a comparison neural network. By doing so, the Siamese network learns to identify discrepancies between classes. Later, Hoffer and Ailon [6] proposed triplet networks for representation learning that learned the configuration among anchor, positive, and negative cases (or triplets) simultaneously. In detail, the triplet network learned to put anchor and positive samples closer while pushing the negative samples farther in the latent space. These works fall within a greater field of study called distance metric learning [15]. Medela et al. [12] employed triplet networks on the colorectal cancer slides as a source domain, and utilized their model to extract features and represent data from the healthy and tumorous colon, breast, and lung slides as the target domain. For this purpose, they utilized a VGG16 model [14] as the backbone of their triplet network while they replaced the last fully connected layer with a more compact version. They used labels provided by pathologists in the source domain to create triplets. Furthermore, due to the mismatch between the properties of source and target domain, images were adjusted through rescaling.

Gildenblat and Klaiman [4] trained a Siamese network based on a ResNet-50 architecture [5] by acknowledging adjunct patches as similar and remote tiles as non-similar cases. They both trained and tested their model on the Camelyon16 dataset [2]. Eventually, they evaluated their model in a tumor image retrieval task using the Camelyon16 test set. They reported as the ratio of correctly retrieved tumorous patches in comparison with only accuracy using a ResNet-50 [5]

with ImageNet weights. In another work, Teh and Taylor

[16] investigated the performance of a weakly-supervised framework for representation learning in digital pathology using ResNet-34 [5]. Thus, they examined the following setups for representation learning with varying target domain size: Training from scratch using target domain dataset, transfer learning with cross-entropy loss based on a network pre-trained on a weakly labeled source dataset, and employing the metric learning approach on the same model, pre-trained with weakly labeled data. Due to the property mismatch between different datasets utilized in their study, the authors had to resize the images. Finally, they reported accuracy on using samples per class, on the CRC dataset as their best outcome [8].

In this study, we aim to address shortcomings such as the definition of the similarity, investigation of the impact of the source and target domain datasets, and the level of supervision in representation learning in digital pathology. The description of the source and target domain datasets, triplet generation algorithm, and representation learning approach are introduced in section III. Section IV contains experiments associated with the effects of the source domain, supervision, and target domain size on feature learning. Finally, the discussion and a summary of our findings are described in section V.

Iii Methodology

In this work, we used two popular histopathology datasets, namely The Cancer Genome Atlas (TCGA) (available at and colorectal cancer (CRC) [8] datasets. The TCGA dataset contains WSIs from 25 different anatomical sites, including 32 different cancer subtypes. We utilized randomly selected WSIs from three organ sites, namely prostate, gastrointestinal, and lung. These organs were chosen as they are among the most commonly diagnosed cancers. These sites had a total of nine cancer subtypes, namely Prostate adenocarcinoma (PRAD), Testicular germ cell tumors (TGCT), Oesophageal carcinoma (ESCA), Stomach adenocarcinoma (STAD), Colonic adenocarcinoma (COAD), Rectal adenocarcinoma (READ), Lung adenocarcinoma (LUAD), Lung squamous cell carcinoma (LUSC), and Mesothelioma (MESO) [3]. The CRC dataset [8] contains histological images extracted from different tissues present in the colorectal cancer slides. The tissue types are background, adipose tissue, normal mucosal glands, debris, immune cells, complex stroma, simple stroma, and tumor epithelium. Some CRC patches of different tissue types can be found in Fig. 1.

Iii-a Triplet Generation

Since we used a triplet network, we defined the similarity concept in a way capable of generating triplets. The triplet was comprised of anchor, neighbor, and distant tiles (patches), in which anchor and neighbor were defined as similar and anchor and distant as dissimilar pairs. Hence, the notion of similarity was abstract enough to avoid limiting the learning performance. Also, the concept of similarity should require minimal supervision to reduce the cost of triplet generation. In this study, similar to [7], we utilized spatial correlation as one of the approaches to define the similarity among patches extracted from WSIs. In other words, we assumed that similar patterns usually emerge in an adjacent neighborhood, while the dissimilar layouts often appear in the spatially remote neighborhood. More specifically, a neighbor patch was selected within a certain range of the anchor’s tile center of the same WSI. On the other hand, we used several alternatives for choosing the distant patch. The distant sample was chosen from the same WSI as long as it was spatially remote, another WSI associated with the same cancer subtype, another WSI associated with other subtypes of the same organ, or another WSI associated with another organ. An example of a type triplet generation is depicted in Fig. 2. For the experiments performed on the CRC dataset [8], we selected the neighbor and distant patches from the same and different tissue types to the anchor patch, respectively. Fig. 3 shows sample triplets extracted from the TCGA and the CRC datasets [8].

Fig. 2: An example of the type triplet generation from a sample WSI (from COAD subtype) from TCGA dataset.

Iii-B Representation Learning in the Source Domain

In this paper, we aimed to analyze the effect of the source domain on a few-shot learning framework.

Fig. 3: The first two rows, sample triplets extracted from TCGA dataset: (a) anchor (STAD), (b) neighbor (STAD), (c) distant type 1 (from the same WSI), (d) distant type 2 (from different STAD WSI), (e) distant type 3 (COAD), and (f) distant type 4 (LUAD). The last row, a sample triplet generated from the CRC dataset [8]: (g) anchor, (h) neighbor, and (i) distant.

Therefore, we embedded the data in the source domain using different settings to evaluate its impact on the performance. For embedding, we used the triplet network, with a ResNet-18 backbone.

The triplet loss [13] was chosen as it evaluates similarity and dissimilarity among samples simultaneously, while the contrastive loss assesses them one by one. We implemented a triplet network [13] where three ResNet-18 networks [5] have shared weights. The triplet loss can be defined as:


where , is the batch size, and , , and are the anchor, neighbor, and distant tiles in the batch, respectively. Also, denotes the embedded output of the network. In the embedding space, the triplet loss pulls the neighbor toward the anchor by minimizing the first term and pushes the distant away from the anchor by maximizing the second term. The term prevents the network from pushing the distant sample further than the margin value. As a baseline for comparison, we also used a ResNet-18 [5] trained by a standard cross-entropy loss.

Iii-C Knowledge Transfer to the Target Domain

After training the triplet network, the target domain data was embedded using the model. Next, a classifier was trained on a portion of the target domain data and was tested against the remaining samples. We used a Support Vector Machine (SVM) as the classifier.

Iv Experiments

For all experiments in this study, we utilized a ResNet-18 [5] backbone with Adam optimizer (learning rate of and betas of and ), and a batch size of . Also, all images were of size pixels, extracted from WSIs at magnification (CRC tiles were cropped to adjust the size). Additionally, the last layer of the model had a dimension of . Also, the margin value, , was set to for all experiments. Furthermore, during the supervised setup, our model was trained using image tiles from the CRC dataset while for the triplet network the total number of

triplets was used (both TCGA and CRC dataset). All neural network approaches in this study were implemented using TensorFlow


Triplet Loss vs. Cross-entropy Loss – First, we split the CRC dataset [8] into and portions, denoting and , respectively. Accordingly, we trained a ResNet-18 with two different approaches: cross-entropy loss, and

triplet loss. To train our model with cross-entropy loss, we attached a softmax layer on top of the neural network. After training both models, the embeddings were extracted from the last layer. The embeddings of the

with the triplet and the cross-entropy loss are shown in Figs. 1-a and 1-d, respectively. We applied the Uniform Manifold Approximation and Projection (UMAP) [11] to visualize the 128-dimensional representations in D plane. The number of neighborhood parameter was set to for all -D representations.

Source Domain Effect – The triplets extracted from the CRC training set were sampled in a supervised manner as the labels of tissues were used. However, as we described in Section III-A, the triplets of TCGA data were sampled using the spatial and tissue type information in an unsupervised manner. As a result, we trained extra two models on triplets extracted from TCGA. The first one was trained on all three anatomical sites, called , while the second model was only trained on the gastrointestinal data from TCGA called as the was also related to the same anatomical site. Similarly, the embeddings encoded by these models are shown in Figs. 1-b and 1-c.

Portions of Triplet Cross-entropy
Data (%)
5 83.002.18 77.002.52 75.002.98 91.001.65
10 88.500.99 88.500.77 88.500.77 87.001.09
25 92.800.41 93.200.24 94.200.29 87.000.35
50 94.900.11 93.800.19 94.600.14 89.500.14
100 95.900.03 95.750.07 95.950.09 91.900.09

Average accuracy and confidence interval of the target domain classification over the different folds of the test dataset.

Target Domain Size Effect – Finally, we split the into chunks of , , , and using stratified sampling, besides the full set. Then we trained an SVM classifier utilizing 10-fold cross-validation on the latter subsets. For all experiments, all SVM classifiers were fine-tuned by changing the kernel (linear, RBF, sigmoid, and polynomial), and the parameters C (between , and ) and gamma (between and , scale and auto) to achieve the best performance. All results are reported in Table I.

V Summary and Conclusions

As it is depicted in Fig. 1, empty and adipose tiles’ representation were closely located to each other and far from other tissue textures in all models. Also, the mucosa was almost well-clustered in all embeddings because of its unique pattern. Moreover, the corresponding representations of classes with visual similarities blended in -D space when the triplet loss was employed. According to Table I, however, all models trained by the triplet loss scored higher in terms of mean accuracy in comparison to the model trained with the label information using cross-entropy loss, excluding subset. The confidence intervals shrunk as the size of the target data increased. Furthermore, the confidence interval was relatively larger when the source domain was different from the target domain. More importantly, the highest accuracy, , was achieved using the model which was trained on the gastrointestinal subset of the TCGA. This outcome may suggest that extensive training on a similar dataset using weakly supervised and unsupervised methods can improve the performance of the model and generalization of the solution.

In this work, We investigated the effect of the supervision and the source domain in embedding and few-shot learning for histopathology images. Accordingly, we have shown that a network can learn a meaningful representation from histopathology images using a massive amount of the weakly labeled data currently available online. This was achieved by acknowledging the spatial correlation, anatomical information, and slide level diagnosis.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from External Links: Link Cited by: §IV.
  • [2] B. E. Bejnordi, M. Veta, P. J. Van Diest, B. Van Ginneken, N. Karssemeijer, G. Litjens, J. A. Van Der Laak, M. Hermsen, Q. F. Manson, M. Balkenhol, et al. (2017) Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama 318 (22), pp. 2199–2210. Cited by: §II.
  • [3] L. A. Cooper, E. G. Demicco, J. H. Saltz, R. T. Powell, A. Rao, and A. J. Lazar (2018) PanCancer insights from the cancer genome atlas: the pathologist’s perspective. The Journal of pathology 244 (5), pp. 512–524. Cited by: §III.
  • [4] J. Gildenblat and E. Klaiman (2019) Self-supervised similarity learning for digital pathology. In MICCAI 2019 COMPAY Workshops, Cited by: §II.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §II, §III-B, §IV.
  • [6] E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: §II.
  • [7] N. Jean, S. Wang, A. Samar, G. Azzari, D. Lobell, and S. Ermon (2019) Tile2Vec: unsupervised representation learning for spatially distributed data. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 3967–3974. Cited by: §III-A.
  • [8] J. N. Kather, C. Weis, F. Bianconi, S. M. Melchers, L. R. Schad, T. Gaiser, A. Marx, and F. G. Zöllner (2016) Multi-class texture analysis in colorectal cancer histology. Scientific reports 6, pp. 27988. Cited by: §II, Fig. 3, §III-A, §III, §IV.
  • [9] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §II.
  • [10] D. Komura and S. Ishikawa (2018) Machine learning methods for histopathological image analysis. Computational and structural biotechnology journal 16, pp. 34–42. Cited by: §I.
  • [11] L. McInnes, J. Healy, and J. Melville (2018) UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: Fig. 1, §IV.
  • [12] A. Medela, A. Picon, C. L. Saratxaga, O. Belar, V. Cabezón, R. Cicchi, R. Bilbao, and B. Glover (2019) Few shot learning in histopathological images: reducing the need of labeled data on biological datasets. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 1860–1864. Cited by: §II.
  • [13] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §III-B.
  • [14] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II.
  • [15] J. L. Suárez, S. García, and F. Herrera (2018) A tutorial on distance metric learning: mathematical foundations, algorithms and software. arXiv preprint arXiv:1812.05944. Cited by: §II.
  • [16] E. W. Teh and G. W. Taylor (2019) Learning with less data via weakly labeled patch classification in digital pathology. arXiv preprint arXiv:1911.12425. Cited by: §II.
  • [17] J. Vanschoren (2018) Meta-learning: a survey. arXiv preprint arXiv:1810.03548. Cited by: §I.