Dealing with Label Scarcity in Computational Pathology: A Use Case in Prostate Cancer Classification

by   Koen Dercksen, et al.

Large amounts of unlabelled data are commonplace for many applications in computational pathology, whereas labelled data is often expensive, both in time and cost, to acquire. We investigate the performance of unsupervised and supervised deep learning methods when few labelled data are available. Three methods are compared: clustering autoencoder latent vectors (unsupervised), a single layer classifier combined with a pre-trained autoencoder (semi-supervised), and a supervised CNN. We apply these methods on hematoxylin and eosin (H&E) stained prostatectomy images to classify tumour versus non-tumour tissue. Results show that semi-/unsupervised methods have an advantage over supervised learning when few labels are available. Additionally, we show that incorporating immunohistochemistry (IHC) stained data provides an increase in performance over only using H&E.



page 2


A Topological Approach for Semi-Supervised Learning

Nowadays, Machine Learning and Deep Learning methods have become the sta...

Semi-supervised and Unsupervised Methods for Heart Sounds Classification in Restricted Data Environments

Automated heart sounds classification is a much-required diagnostic tool...

Semi-Supervised Classification for oil reservoir

This paper addresses the general problem of accurate identification of o...

Beyond Supervised Classification: Extreme Minimal Supervision with the Graph 1-Laplacian

We consider the task of classifying when an extremely reduced amount of ...

Denoising Adversarial Autoencoders: Classifying Skin Lesions Using Limited Labelled Training Data

We propose a novel deep learning model for classifying medical images in...

Incorporating Deep Features in the Analysis of Tissue Microarray Images

Tissue microarray (TMA) images have been used increasingly often in canc...

More Than Meets The Eye: Semi-supervised Learning Under Non-IID Data

A common heuristic in semi-supervised deep learning (SSDL) is to select ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Prostate cancer is manually graded by pathologists on H&E stained specimens, based on the morphological features of epithelial tissue. Since this is a labour intensive process, an automated system to perform cancer grading would be of great value. However, to develop such systems typically large sets of labelled data are required. To collect these data, annotations from human experts (in this case uropathologists) would be required. Such expertise is rare, and thus creating the required labelled datasets is challenging. This inherently limits the potential for algorithm development. (Litjens et al., 2017).


Figure 1: The flow of data for each of the three methods. Note that the data flow for the supervised method is identical to that of the semi-supervised method, but does not utilise unsupervised pre-training.

We hypothesise that a semi- or unsupervised approach, leveraging unlabelled data, can learn a good latent tissue representation which can be used to classify unseen tissue, without using labelled data during training (Arevalo et al., 2015; Hou et al., 2019; Kallenberg et al., 2016). While a supervised approach undoubtedly outperforms unsupervised methods given enough data, we show the advantage of using a semi- or unsupervised approach when little labelled data is available through a three-way comparison (fig:experiment_setup). In addition to H&E, we test incorporating IHC stained images that highlight epithelial tissue in order to force learning a more descriptive latent space.

2 Methodology


We used the PESO dataset (Bulten et al., 2019), which consists of 102 registered whole-slide image (WSI) pairs from patients that underwent a radical prostatectomy. Each pair is made up of a H&E slide, and a slide that was processed using IHC with an epithelial (CK8/18) and basal cell (P63) marker. The set was divided into 62 training and 40 test slides. All H&E slides are publicly available.111

For training, two separate datasets were created: one completely randomly sampled set of patches without labels. This set is used to pre-train the semi- and unsupervised methods. Another labelled set of equal size was created in which the ratio of stroma, benign epithelium and tumour tissue (class label determined by the center pixel) is respectively. Subsets of varying sizes were used for the experiments in this paper. All patches have a size of pixels (pixel resolution 0.96 m) and are sampled pair-wise from both stains.

fig:sliding_window[H&E] [IHC] [Supervised] [Semi-supervised] [Unsupervised]

Figure 2: Classification maps of models trained with 1000 labelled patches applied to a benign (top row) and tumour (bottom row) region (transparent = stroma, green = benign epithelium, red = tumour).

Semi- & unsupervised training.

An autoencoder is trained on to reconstruct either H&E or IHC patches given an H&E input patch by optimising the mean-squared error (MSE). The encoder part of the network

consists of strided convolution layers to compress the input into the 128-dimensional latent space. The decoder contains convolution and upsampling layers to decompress the latent vector back to the original input size. After training, the latent vectors

are clustered using k-means with 50 clusters. Finally, the clusters are assigned labels through majority voting by using subsets of varying sizes from

. Empty clusters are assigned the stroma label. For the semi-supervised experiments, the same autoencoder is trained, but instead of using k-means labels are now assigned by training a single-layer classifier on subsets of .

Supervised training.

As a baseline, only is used and trained in a supervised fashion on subsets of end-to-end, without using unsupervised pre-training on . This acts as an upper-bound to the classification performance on this dataset.

Every experiment uses data augmentation (flipping/hue/saturation/brightness/contrast) and is repeated five times in order to report confidence intervals.


We sample patches from the PESO test regions with the same tissue ratio as to measure the final performance of each approach. Every method is trained to predict all three classes (aiming to learn the difference between benign epithelium and tumour), and the F1 score is reported for tumour versus non-tumour classification. At test-time, all models except for the supervised IHC network are validated on H&E.

tab:results H&E → H&E H&E → IHC H&E IHC NLP Semi-SV Un-SV Semi-SV Un-SV SV SV 100 0.56 0.07 0.71 0.02 0.69 0.04 0.71 0.02 0.00 0.00 0.00 0.00 500 0.70 0.03 0.72 0.02 0.75 0.01 0.74 0.01 0.58 0.12 0.67 0.09 1000 0.73 0.01 0.72 0.01 0.77 0.01 0.74 0.01 0.76 0.01 0.77 0.02 2000 0.74 0.01 0.74 0.02 0.77 0.01 0.75 0.00 0.73 0.03 0.56 0.27 10.000 0.76 0.01 0.70 0.01 0.78 0.01 0.74 0.01 0.74 0.04 0.71 0.29 100.000 0.75 0.02 0.73 0.01 0.74 0.02 0.75 0.01 0.88 0.02 0.91 0.02

Table 1: F1 scores for all methods trained using various subsets of . NLP = Number of labelled patches, SV = supervised.

3 Results & Discussion

Semi- and unsupervised methods have an advantage over supervised training when few labels are available (tab:results). The semi-supervised method reaches an F1 score of with as few as 500 labelled patches, compared to for the supervised H&E classifier.

Additionally, the semi-/unsupervised performance is more robust than the supervised approach, which became unstable at small dataset sizes. At large dataset sizes the supervised method performed substantially better than the other approaches, as expected.

Using IHC data as a reconstruction target (or input for the supervised approach) improves the performance of every method, indicating that the extra information present in the IHC data leads to better latent representations.

At larger labelled dataset sizes the performance of the semi- and unsupervised approaches seems to saturate. This can be caused by the low complexity of the classification models (k-means and single layer neural network) or by limitations in the representative power of the latent space. In future work it might be interesting to investigate the latent representations across the different methods to better understand this phenomenon.


  • Arevalo et al. (2015) John Arevalo, Angel Cruz-Roa, Viviana Arias, Eduardo Romero, and Fabio A González. An unsupervised feature learning framework for basal cell carcinoma image analysis. Artificial intelligence in medicine, 64(2):131–145, 2015.
  • Bulten et al. (2019) Wouter Bulten, Péter Bándi, Jeffrey Hoven, Rob van de Loo, Johannes Lotz, Nick Weiss, Jeroen van der Laak, Bram van Ginneken, Christina Hulsbergen-van de Kaa, and Geert Litjens. Epithelium segmentation using deep learning in H&E-stained prostate specimens with immunohistochemistry as reference standard. Scientific reports, 9(1):864, 2019.
  • Hou et al. (2019) Le Hou, Vu Nguyen, Ariel B Kanevsky, Dimitris Samaras, Tahsin M Kurc, Tianhao Zhao, Rajarsi R Gupta, Yi Gao, Wenjin Chen, David Foran, et al. Sparse autoencoder for unsupervised nucleus detection and representation in histopathology images. Pattern recognition, 86:188–200, 2019.
  • Kallenberg et al. (2016) Michiel Kallenberg, Kersten Petersen, Mads Nielsen, Andrew Y Ng, Pengfei Diao, Christian Igel, Celine M Vachon, Katharina Holland, Rikke Rass Winkel, Nico Karssemeijer, et al. Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring. IEEE transactions on medical imaging, 35(5):1322–1331, 2016.
  • Litjens et al. (2017) Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.