Learning with less data via Weakly Labeled Patch Classification in Digital Pathology

11/27/2019 ∙ by Eu Wern Teh, et al. ∙ 0

In Digital Pathology (DP), labeled data is generally very scarce due to the requirement that medical experts provide annotations. We address this issue by learning transferable features from weakly labeled data, which are collected from various parts of the body and are organized by non-medical experts. In this paper, we show that features learned from such weakly labeled datasets are indeed transferable and allow us to achieve highly competitive patch classification results on the colorectal cancer (CRC) dataset [1] and the PatchCamelyon (PCam) dataset [2] while using an order of magnitude less labeled data.



There are no comments yet.


page 1

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, we have witnessed significant success in visual recognition for Digital Pathology [3, 1]. The main driving force of this success is due to the availability of labeled data by medical experts. However, creating such datasets is costly in terms of time and money.

Conversely, it is relatively easy to obtain weakly labeled images as it does not require annotation from medical experts. One example of such a dataset is KimiaPath24 [2] (see Figure 1

). These images are collected from different parts of the body by non-medical experts based on visual similarity and are organized into 24 tissue groups.

In this work, our goal is to transfer knowledge from weakly labeled data to a dataset where labeled data is scarce. In order to evaluate our method, we simulate data scarcity by varying the amount of data available from the richly annotated CRC [5] and PCam [14] datasets. Our contributions are as follows. First, we show that features learned from weakly labeled data are indeed useful for training models on other histology datasets. Second, we show that with weakly labeled data, one can use an order of magnitude less data to achieve competitive patch classification results rivaling models trained from scratch with 100% of the data. Third, we further explore a proxy-based metric learning approach to learn features on the weakly labeled dataset. Finally, we also achieve the state-of-the-art results for both CRC and PCam when our models are trained with both weakly labeled data and 100% of the original annotated data.

Figure 1: This figure shows all 24 whole slide images used to generate the KimiaPath24 dataset[2].

2 Related Work

Recently in the field of DP, there are a few works that try to perform transfer learning from one dataset to another. Khan et al. [6] attempt to improve prostate cancer detection by fine-tuning on a breast cancer dataset. Medela et al. [10] perform few-shot learning for lung, breast, and colon tissues by fine-tuning on the CRC dataset. These works demonstrate that there are some transferable features among various organs. However, the source domain where they pre-train the dataset still requires annotation from medical experts.

Apart from using classification to learn features from weakly labeled data, we explore the use of metric learning as an alternative approach. Metric Learning is a machine learning task where we learn a function to capture similarity or dissimilarity between inputs. Metric Learning has many applications such as person re-identification 

[9], product retrieval [12], clothing retrieval [8]

and face recognition 

[16]. In Digital Pathology, Medela et al. [10] use metric learning in the form of Siamese networks for few-shot learning tasks. Teh et al. [13] also use Siamese networks to boost patch classification performance in conjunction with cross-entropy loss. In Siamese networks, pairs of images are fed to the network, and the network attracts or repels these images based on some established notion of similarity (e.g., class label). A shortcoming for Siamese networks is its sampling process, which grows quadratically with respect to the number of examples in the dataset [11, 17, 15]. Due to this shortcoming, we choose to explore ProxyNCA [11], which offers faster convergence and better performance.

3 Description of datasets

To validate our hypothesis, we use one weakly labeled dataset and two target datasets where patch classification is the targeted task. First, we pre-train two different models (Classification and ProxyNCA) on the weakly labeled dataset. After pre-training, we train and evaluate our models on the target datasets.

3.1 Weakly labeled dataset

We use the KimiaPath24 dataset [2] as our weakly labeled dataset, where we learn transferable features. A medical non-expert has collected this data from various parts of the body and has organized them into 24 groups based on visual distinction. These images have a resolution of per pixel with a size of pixels. There are a total of 23,916 images in this dataset.

3.2 Target Dataset A: Colorectal Cancer (CRC) Dataset

The CRC dataset[5] consists of seven types of cell textures: tumor epithelium, simple stroma, complex stroma, immune cells, debris and mucus, mucosal glands, adipose tissue, and background. There are 625 images per class, and each image has a resolution of per pixel with dimensions pixels.

Tumour Epithelium Simple Stroma Complex Stroma Immune Cells Debris and Mucus Mucosal Glands Adipose Tissue Background

Figure 2: Examples of tissue samples from the CRC dataset

Normal Tumor

Figure 3: Examples of tissue samples from the PCam dataset.

3.3 Target Dataset B: Patch Camelyon (PCam) Dataset

PCam [14] is a subset of the CAMELYON16 dataset [3], which is a breast cancer dataset. There are a total of 327,680 images with a resolution of per pixel and dimensions pixels. There are two categories presented in this dataset: tumor and normal.

4 Proposed method

We compare three approaches. As a baseline, we train on the target dataset from scratch, i.e., from a randomly initialized model. We contrast this with models pre-trained on weakly labeled data by using two strategies: First, we learn features via classification of groups with a standard cross-entropy loss. Second, we explore metric learning as an alternative method for feature learning.

4.1 Metric Learning

We propose to use ProxyNCA (Proxy-Neighborhood Component Analysis) [11], which is a proxy-based metric learning technique. One benefit of ProxyNCA over standard Siamese Networks is a reduction in the number of compared examples, which is achieved by comparing examples against the class proxies.

Figure 4 visualizes the computational savings that can be gained by comparing examples to class proxies rather than other examples. In ProxyNCA, class proxies are stored as parameters with the same dimension as the input embedding space and are updated by minimizing the proxy loss .

Figure 4: A visualization how ProxyNCA works. [Left panel] Standard NCA compares one example with respect to all other examples (8 different pairings). [Right panel] In ProxyNCA, we only compare to the class proxies (2 different pairings).

ProxyNCA requires an input example , a backbone Model , an embedding Layer , a corresponding embedding , a proxy function , which returns a proxy of the same class as , and a proxy function , which returns all proxies of different class than . Its goal is to minimize the distance between and and at the same time maximize the distance between and . Let denote Euclidean distance between and , be the -Norm of vector and be the learning rate. We describe how to train ProxyNCA in Algorithm 1.

1:Randomly initialize all proxies
2:for  number of mini-batches do
3:     for  mini-batch size do
7:     end for
8:end for
Algorithm 1 ProxyNCA Training

After training on weakly labeled data, we discard everything except for the backbone model. We then use this backbone model along with a new embedding layer to train on the target dataset.

5 Experiments

In this section, we describe our experimental setup and results. We evaluate our models using prediction accuracy. For the colon cancer dataset, we follow the same experimental setup as [5] where we train and evaluate our models using 10-fold cross-validation. For the PCam dataset [14], we train our model using images in the training set and evaluate our performance on the test set following the setup from [14] and [13].

% Initialization/Method Accuracy (%)
12 2 Random 61.62 3.79
Classification 83.30 2.88
ProxyNCA 82.82 2.70
25 4 Random 64.12 3.73
Classification 87.30 1.40
ProxyNCA 87.10 1.23
50 9 Random 77.34 2.11
Classification 89.38 1.29
ProxyNCA 89.80 1.88
100 18 Random 84.58 1.19
Classification 91.58 1.08
ProxyNCA 91.96 0.99
625 100 Random 89.84 1.29
Classification 92.10 0.80
ProxyNCA 92.46 1.22
RBF-SVM [5] 87.40

Table 1: Accuracy of our model trained with % of data in three different pre-trained settings on the CRC dataset. denotes the number of examples per class. We compare our approach with the best model in [5]

which is a RBF-SVM that uses five different concatenated features (Lower-order histogram, Higher-order histogram, Local Binary Patterns, Gray-Level Co-occurrence Matrix and Ensemble of Decision Trees).

5.1 Experimental Setup

In all of our experiments, we use the Adam [7] optimizer to train our model. Following [11]

, we use an exponential learning rate decay schedule with a factor of 0.94. We also perform channel-wise data normalization with the mean and standard deviation of the respective dataset.

Weakly labeled dataset: We train our weakly labeled dataset similarly for both target datasets with one subtle difference. For the CRC dataset, we perform random cropping of size pixels directly on KimiaPath24 images since the resolution of both datasets are about the same. However, for the PCam dataset, we resize the KimiaPath24 images from pixel to pixels to match the resolution of the target dataset. Finally, we perform random cropping of size pixels on the resized images. The initial learning rate is set to

, and we train our models to convergence (100 epochs).

Target dataset:

We pad the images by 12.5% via image reflection prior to random cropping. We set the learning rate to

and train all of our models to convergence (200 epochs for the CRC dataset and 100 epochs for the PCam dataset). We repeat the experiment ten times with different random seeds and report the mean over trials in Table 1 and 2.

Data augmentation: In addition to random cropping, we also perform the following data augmentation at each training stage: a) Random Horizontal Flip b) Random Rotation c) Color Jittering - where we set the hue, saturation, brightness and contrast threshold to 0.4. All data augmentations are performed by using the TorchVision package.

Backbone Model: In all of our experiments, we use a modified version of ResNet34 [4]. Due to low resolution of the target datasets ( px and

px), we remove the max-pooling layer from ResNet34.

Additional Info: For the experiment with 100% of training data on the PCam dataset, we follow the same experimental setup as [13] where the number of training epochs is ten and learning rate is reduced by a factor of ten after epoch five.

5.2 Results

Table 1 and 2 show our experimental results. We train our model by varying the amount of target domain data on three different pre-training settings (Random, Classification and ProxyNCA).

By pre-training our models on weakly labeled data, we achieve test accuracies of and on the CRC and PCam datasets respectively with an order of magnitude less training data. Both results rival the test accuracies of randomly initialized models (89.84% and 88.98%), which are trained with of the data. With 100% of training data, our models attain test accuracies of and , outperforming previous state-of-the-art results: and on CRC and PCam respectively. We note that for PCam, the previous SOTA falls within the error bars of ours.

We further observe that ProxyNCA outperforms classification on the PCam dataset. However, on the CRC dataset, this trend persists when the number of classes per sample is 50 or larger. When the number of samples per class is too small, the results are highly varied, which makes the comparison more difficult.

In Figure 5, we qualitatively show the retrieval performance on four different cells with features trained on weakly labeled data. We use the activation of the pre-trained embedding layer as image features. Retrieval is performed by computing the Euclidean distance between features of query images and features of all other images in the CRC dataset.

6 Conclusion

We show that useful features can be learned from weakly labeled data, in which a non-trained expert can identify visually from where (which organ) an image came, but we do not have any expert annotations. We show that such features are transferable to the CRC dataset as well as the PCam dataset and are able to achieve competitive results with an order of magnitude less training data. Although evaluation is facilitated in the simulated “low data” regime, our approach holds promise for transfer to digital pathology datasets for which the number of actual annotations by medical experts is very small.

% Initialization/Method Accuracy (%)
1,000 0.76 Random 79.37 1.33
Classification 85.29 1.08
ProxyNCA 86.69 0.46
2,000 1.53 Random 81.26 1.47
Classification 86.55 1.29
ProxyNCA 87.38 0.78
3,000 2.29 Random 84.09 1.24
Classification 87.03 0.62
ProxyNCA 87.69 0.72
13,107 10 Random 88.47 0.59
Classification 89.64 0.39
ProxyNCA 89.77 0.50
131,072 100 Random 88.98 1.06
Classification 89.85 0.64
ProxyNCA 90.47 0.59
P4M[14] 89.97
Pi+[13] 90.36 0.41

Table 2: Accuracy of our model trained with % of data in three different pre-trained settings on the PCam dataset. denotes the number of examples per class. We compare our approach to [14], which uses a Rotational Equivariance CNN and [13] which uses contrastive and self-perturbation loss together with cross entropy loss.

Query   Rank 1 Rank 2 Rank 3 Tumour Epithelium   Simple Stroma   Complex Stroma   Debris and Mucus  

Figure 5: A visualization of top-3 retrieval performance on the CRC dataset with features trained on the weakly labeled dataset. Correctly retrieved images (same class as query image) are highlighted with a green-box. Query images are selected randomly from the dataset.


  • [1] G. Aresta, T. Araújo, S. Kwok, S. S. Chennamsetty, M. Safwan, V. Alex, B. Marami, M. Prastawa, M. Chan, M. Donovan, et al. (2019) Bach: grand challenge on breast cancer histology images. Medical image analysis. Cited by: §1.
  • [2] M. Babaie, S. Kalra, A. Sriram, C. Mitcheltree, S. Zhu, A. Khatami, S. Rahnamayan, and H. R. Tizhoosh (2017) Classification and retrieval of digital pathology scans: a new dataset. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 8–16. Cited by: Figure 1, §1, §3.1.
  • [3] B. E. Bejnordi, M. Veta, P. J. Van Diest, B. Van Ginneken, N. Karssemeijer, G. Litjens, J. A. Van Der Laak, M. Hermsen, Q. F. Manson, M. Balkenhol, et al. (2017)

    Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer

    Jama 318 (22), pp. 2199–2210. Cited by: §1, §3.3.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.1.
  • [5] J. N. Kather, C. Weis, F. Bianconi, S. M. Melchers, L. R. Schad, T. Gaiser, A. Marx, and F. G. Zöllner (2016) Multi-class texture analysis in colorectal cancer histology. Scientific reports 6, pp. 27988. Cited by: Learning with Less data via weakly labeled patch classification in Digital Pathology, §1, §3.2, Table 1, §5.
  • [6] U. A. H. Khan, C. Stürenberg, O. Gencoglu, K. Sandeman, T. Heikkinen, A. Rannikko, and T. Mirtti (2019) Improving prostate cancer detection with breast histopathology images. arXiv preprint arXiv:1903.05769. Cited by: §2.
  • [7] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • [8] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1096–1104. Cited by: §2.
  • [9] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang (2019) Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §2.
  • [10] A. Medela, A. Picon, C. L. Saratxaga, O. Belar, V. Cabezón, R. Cicchi, R. Bilbao, and B. Glover (2019) Few shot learning in histopathological images: reducing the need of labeled data on biological datasets. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 1860–1864. Cited by: §2, §2.
  • [11] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §2, §4.1, §5.1.
  • [12] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §2.
  • [13] E. W. Teh and G. W. Taylor (2019-08–10 Jul) Metric learning for patch classification in digital pathology. In International Conference on Medical Imaging with Deep Learning – Extended Abstract Track, London, United Kingdom. External Links: Link Cited by: §2, §5.1, §5, Table 2.
  • [14] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, and M. Welling (2018) Rotation equivariant cnns for digital pathology. In International Conference on Medical image computing and computer-assisted intervention, pp. 210–218. Cited by: Learning with Less data via weakly labeled patch classification in Digital Pathology, §1, §3.3, §5, Table 2.
  • [15] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott (2019) Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5022–5030. Cited by: §2.
  • [16] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §2.
  • [17] C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: §2.