In recent years, we have witnessed significant success in visual recognition for Digital Pathology [3, 1]. The main driving force of this success is due to the availability of labeled data by medical experts. However, creating such datasets is costly in terms of time and money.
). These images are collected from different parts of the body by non-medical experts based on visual similarity and are organized into 24 tissue groups.
In this work, our goal is to transfer knowledge from weakly labeled data to a dataset where labeled data is scarce. In order to evaluate our method, we simulate data scarcity by varying the amount of data available from the richly annotated CRC  and PCam  datasets. Our contributions are as follows. First, we show that features learned from weakly labeled data are indeed useful for training models on other histology datasets. Second, we show that with weakly labeled data, one can use an order of magnitude less data to achieve competitive patch classification results rivaling models trained from scratch with 100% of the data. Third, we further explore a proxy-based metric learning approach to learn features on the weakly labeled dataset. Finally, we also achieve the state-of-the-art results for both CRC and PCam when our models are trained with both weakly labeled data and 100% of the original annotated data.
2 Related Work
Recently in the field of DP, there are a few works that try to perform transfer learning from one dataset to another. Khan et al.  attempt to improve prostate cancer detection by fine-tuning on a breast cancer dataset. Medela et al.  perform few-shot learning for lung, breast, and colon tissues by fine-tuning on the CRC dataset. These works demonstrate that there are some transferable features among various organs. However, the source domain where they pre-train the dataset still requires annotation from medical experts.
Apart from using classification to learn features from weakly labeled data, we explore the use of metric learning as an alternative approach. Metric Learning is a machine learning task where we learn a function to capture similarity or dissimilarity between inputs. Metric Learning has many applications such as person re-identification, product retrieval , clothing retrieval 
and face recognition. In Digital Pathology, Medela et al.  use metric learning in the form of Siamese networks for few-shot learning tasks. Teh et al.  also use Siamese networks to boost patch classification performance in conjunction with cross-entropy loss. In Siamese networks, pairs of images are fed to the network, and the network attracts or repels these images based on some established notion of similarity (e.g., class label). A shortcoming for Siamese networks is its sampling process, which grows quadratically with respect to the number of examples in the dataset [11, 17, 15]. Due to this shortcoming, we choose to explore ProxyNCA , which offers faster convergence and better performance.
3 Description of datasets
To validate our hypothesis, we use one weakly labeled dataset and two target datasets where patch classification is the targeted task. First, we pre-train two different models (Classification and ProxyNCA) on the weakly labeled dataset. After pre-training, we train and evaluate our models on the target datasets.
3.1 Weakly labeled dataset
We use the KimiaPath24 dataset  as our weakly labeled dataset, where we learn transferable features. A medical non-expert has collected this data from various parts of the body and has organized them into 24 groups based on visual distinction. These images have a resolution of per pixel with a size of pixels. There are a total of 23,916 images in this dataset.
3.2 Target Dataset A: Colorectal Cancer (CRC) Dataset
The CRC dataset consists of seven types of cell textures: tumor epithelium, simple stroma, complex stroma, immune cells, debris and mucus, mucosal glands, adipose tissue, and background. There are 625 images per class, and each image has a resolution of per pixel with dimensions pixels.
3.3 Target Dataset B: Patch Camelyon (PCam) Dataset
4 Proposed method
We compare three approaches. As a baseline, we train on the target dataset from scratch, i.e., from a randomly initialized model. We contrast this with models pre-trained on weakly labeled data by using two strategies: First, we learn features via classification of groups with a standard cross-entropy loss. Second, we explore metric learning as an alternative method for feature learning.
4.1 Metric Learning
We propose to use ProxyNCA (Proxy-Neighborhood Component Analysis) , which is a proxy-based metric learning technique. One benefit of ProxyNCA over standard Siamese Networks is a reduction in the number of compared examples, which is achieved by comparing examples against the class proxies.
Figure 4 visualizes the computational savings that can be gained by comparing examples to class proxies rather than other examples. In ProxyNCA, class proxies are stored as parameters with the same dimension as the input embedding space and are updated by minimizing the proxy loss .
ProxyNCA requires an input example , a backbone Model , an embedding Layer , a corresponding embedding , a proxy function , which returns a proxy of the same class as , and a proxy function , which returns all proxies of different class than . Its goal is to minimize the distance between and and at the same time maximize the distance between and . Let denote Euclidean distance between and , be the -Norm of vector and be the learning rate. We describe how to train ProxyNCA in Algorithm 1.
After training on weakly labeled data, we discard everything except for the backbone model. We then use this backbone model along with a new embedding layer to train on the target dataset.
In this section, we describe our experimental setup and results. We evaluate our models using prediction accuracy. For the colon cancer dataset, we follow the same experimental setup as  where we train and evaluate our models using 10-fold cross-validation. For the PCam dataset , we train our model using images in the training set and evaluate our performance on the test set following the setup from  and .
which is a RBF-SVM that uses five different concatenated features (Lower-order histogram, Higher-order histogram, Local Binary Patterns, Gray-Level Co-occurrence Matrix and Ensemble of Decision Trees).
5.1 Experimental Setup
, we use an exponential learning rate decay schedule with a factor of 0.94. We also perform channel-wise data normalization with the mean and standard deviation of the respective dataset.
Weakly labeled dataset: We train our weakly labeled dataset similarly for both target datasets with one subtle difference. For the CRC dataset, we perform random cropping of size pixels directly on KimiaPath24 images since the resolution of both datasets are about the same. However, for the PCam dataset, we resize the KimiaPath24 images from pixel to pixels to match the resolution of the target dataset. Finally, we perform random cropping of size pixels on the resized images. The initial learning rate is set to
, and we train our models to convergence (100 epochs).
We pad the images by 12.5% via image reflection prior to random cropping. We set the learning rate toand train all of our models to convergence (200 epochs for the CRC dataset and 100 epochs for the PCam dataset). We repeat the experiment ten times with different random seeds and report the mean over trials in Table 1 and 2.
Data augmentation: In addition to random cropping, we also perform the following data augmentation at each training stage: a) Random Horizontal Flip b) Random Rotation c) Color Jittering - where we set the hue, saturation, brightness and contrast threshold to 0.4. All data augmentations are performed by using the TorchVision package.
Backbone Model: In all of our experiments, we use a modified version of ResNet34 . Due to low resolution of the target datasets ( px and
px), we remove the max-pooling layer from ResNet34.
Additional Info: For the experiment with 100% of training data on the PCam dataset, we follow the same experimental setup as  where the number of training epochs is ten and learning rate is reduced by a factor of ten after epoch five.
By pre-training our models on weakly labeled data, we achieve test accuracies of and on the CRC and PCam datasets respectively with an order of magnitude less training data. Both results rival the test accuracies of randomly initialized models (89.84% and 88.98%), which are trained with of the data. With 100% of training data, our models attain test accuracies of and , outperforming previous state-of-the-art results: and on CRC and PCam respectively. We note that for PCam, the previous SOTA falls within the error bars of ours.
We further observe that ProxyNCA outperforms classification on the PCam dataset. However, on the CRC dataset, this trend persists when the number of classes per sample is 50 or larger. When the number of samples per class is too small, the results are highly varied, which makes the comparison more difficult.
In Figure 5, we qualitatively show the retrieval performance on four different cells with features trained on weakly labeled data. We use the activation of the pre-trained embedding layer as image features. Retrieval is performed by computing the Euclidean distance between features of query images and features of all other images in the CRC dataset.
We show that useful features can be learned from weakly labeled data, in which a non-trained expert can identify visually from where (which organ) an image came, but we do not have any expert annotations. We show that such features are transferable to the CRC dataset as well as the PCam dataset and are able to achieve competitive results with an order of magnitude less training data. Although evaluation is facilitated in the simulated “low data” regime, our approach holds promise for transfer to digital pathology datasets for which the number of actual annotations by medical experts is very small.
-  (2019) Bach: grand challenge on breast cancer histology images. Medical image analysis. Cited by: §1.
-  (2017) Classification and retrieval of digital pathology scans: a new dataset. In , pp. 8–16. Cited by: Figure 1, §1, §3.1.
Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama 318 (22), pp. 2199–2210. Cited by: §1, §3.3.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.1.
-  (2016) Multi-class texture analysis in colorectal cancer histology. Scientific reports 6, pp. 27988. Cited by: Learning with Less data via weakly labeled patch classification in Digital Pathology, §1, §3.2, Table 1, §5.
-  (2019) Improving prostate cancer detection with breast histopathology images. arXiv preprint arXiv:1903.05769. Cited by: §2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
-  (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1096–1104. Cited by: §2.
-  (2019) Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §2.
-  (2019) Few shot learning in histopathological images: reducing the need of labeled data on biological datasets. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 1860–1864. Cited by: §2, §2.
-  (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §2, §4.1, §5.1.
-  (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §2.
-  (2019-08–10 Jul) Metric learning for patch classification in digital pathology. In International Conference on Medical Imaging with Deep Learning – Extended Abstract Track, London, United Kingdom. External Links: Cited by: §2, §5.1, §5, Table 2.
-  (2018) Rotation equivariant cnns for digital pathology. In International Conference on Medical image computing and computer-assisted intervention, pp. 210–218. Cited by: Learning with Less data via weakly labeled patch classification in Digital Pathology, §1, §3.3, §5, Table 2.
-  (2019) Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5022–5030. Cited by: §2.
-  (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §2.
-  (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: §2.