Basal cell carcinomas (BCCs) represent one of the most common cutaneous malignancies in humans (Chinem and Miot, 2011). Because of their frequency, BCCs are diagnosed by pathologists and dermato-pathologists on a regular basis. Digital pathology improves and simplifies histological diagnoses with regard to safety, quality and efficiency (Griffin and Treanor, 2017). Digital pathology improves diagnoses of pathologists by providing diagnostic support, namely computer-assisted diagnoses (Komura and Ishikawa, 2018). Recently this is mainly achieved using machine learning methods. Such methods could assist physicians, particularly pathologists in finding new histological patterns for the diagnosis of diseases.
From a machine learning perspective, convolutional neural networks(LeCun et al., 1998; Krizhevsky et al., 2012) would be the established method to tackle this image classification task, also because CNNs have been successfully applied to analyze biological and medical images. Examples include the detection of melanoma, performing on par with dermatologists (Tschandl et al., 2019; Haenssle et al., 2018; Esteva et al., 2017) or the prediction of cardiovascular risk factors based on retinal fundus images (Poplin et al., 2018). However, the size of typical histological images obtained in a resolution appropriate to see cellular structures are not suitable for processing with current state-of-the-art CNN architectures. Such images often have a resolution of 50,000100,000 pixels, while CNNs are typically applied to images with a maximal resolution of up to 4,0964,096 pixels (Momeni et al., 2018). Recent attempts in training CNNs on histopathologic images frequently avoid the ultra-high resolution problem by sampling random image patches from the full image (Cruz-Roa et al., 2013; Albarqouni et al., 2016; Janowczyk and Madabhushi, 2016), which maintains the weak labelling problem. An overview of machine learning and deep learning methods that tackle the high-resolution problem of histopathology slides is given in Komura and Ishikawa (2018). Furthermore, the classification of histopathology slides represents a weak label problem: the whole image slide is labelled with a single class (i.e. diagnosis), but a large fraction of the image is identical in all classes, and only a small region is indicative of the respective class. Recently, multiple instance learning (MIL) and attention-based models have been proposed for analyzing whole slide images (Tomczak et al., 2018; Ilse et al., 2018).
The aim of the study is to assess whether it is possible to detect basal cell carcinomas in whole slide images (WSI) overcoming the ultra-high resolution and weak label problems by adapting standard machine learning methods. Moreover, we aim to identify the key regions in the image that are important for the decision of the predictive model and finally to compare them to the diagnostic key regions for board-certified pathologists.
2 Massive Multiple Instance Learning with Attention
We treat the problem of classifying extremely high resolution images as a multiple instance learning problem by dividing whole slide images into patches of relatively small resolution that can be processed by standard CNN architectures. This introduces the problem of credit assignment, i.e. how to combine signals from a massive number of patches to classify a full WSI as ’contains BCC’ or ’does not contain BCC’. Simple solutions such as averaging patch predictions or adopting the prediction of the maximally activated patch have obvious problems with credit assignment. Therefore, we employ attention-based MIL pooling introduced as by Ilse and colleagues(Ilse et al., 2018)
and compare it against a baseline of downscaled WSI as well as patch based methods using mean and max pooling. An overview of the patch based approach is shown in Figure1. We use the well-known VGG11 architecture by Simonyan and Zisserman (2014) for all patch-based experiments.
Baseline method. As a baseline method, we use a standard CNN trained on whole slide images down-scaled to 10241024 pixels. The down-sampling strategy means that only
of the available information was used for the classification of the histological sections. The CNN consists of five blocks of convolution-convolution-maxpooling and utilizes SeLU activation functions(Klambauer et al., 2017)
. The architecture and the hyperparameters of this CNN were optimized on a validation set using manual hyperparameter tuning. The network was trained with standard SGD. Mean and standard deviation of accuracy, F1 score and the area under ROC curve (AUC) were calculated by re-training the networks 100 times.
WSI method. All WSI were divided into patches of 224
224 pixels covering the whole image, where padding was used if necessary. To exclude empty background patches, the average color intensityof all pixels for each patch was calculated. Then, for each WSI the maximum was calculated and all patches with higher than
of this maximum were removed and considered as empty. Each patch was normalized to zero mean and unit variance; no stain-normalization was applied.
Each non-empty patch was processed by the CNN and the activations of the last layer were stored as representation of the respective patch. These activations were then used as input for the final classification network, where all patch activations of a single WSI were passed to the network as a mini-batch. For mean and max pooling MIL methods we calculated the mean or maximum of the resulting network predictions. The attention classifier was designed according to Ilse et al. (2018), where the representation of each instance corresponds to the network activations of one image patch. We use a weighted average of instances, where the weights are the output of a neural network that directly predicts these attention weights. A set of instances in a single whole slide image is represented by , the MIL pooling function is and the the attention weight is
where and are trainable parameters. Finally, a classification layer with sigmoid activation is applied to provide the prediction . Again, we used standard SGD for training. Mean and standard deviation of accuracy, F1 score and AUC were calculated 100 times re-training the network.
Interpretation method. To interpret the predictions of the neural network, we used (1) Integrated Gradients (Sundararajan et al., 2017) for the baseline method and (2) the learned attention weights per image patch of the attention mechanism described above. We visualized the attention weights in order to determine the key regions for the classification of the CNN model (Figure 2).
3 Experiments and Results
Images of histologicals slides of normal skin and skin sections containing BCCs were stained with hematoxylin and eosin (H/E, n=838 slides). 647 of them represent BCC cases, 191 show normal skin. This set of 838 images was randomly split into 129 (15%) test set and 709 (85%) training images. 20% of the training set was used for as validation set. The median size of the WSI is 56,89626,198 pixels, with the height ranging from 6,884 to 47,939 and the width ranging from 7,360 to 99,568 pixels. The images were retrospectively collected at the Medical University of Vienna and the Kepler University Hospital, according to ethics votes number 1119/2018 (Ethics Committee Upper Austria) and 2085/2018 (Ethics Committee Medical University of Vienna).
We trained the baseline method, and three methods based on mean pooling, max pooling, and the attention mechanism described above using the PyTorch framework(Paszke et al., 2017). The results of the attention-based method against the baseline method and mean/max pooling MIL on our data set are shown in Table 1. The attention-based pooling method has significantly outperformed MIL with max pooling, MIL with mean pooling and an end-to-end CNN trained on down-sampled whole-slide images with respect to the AUC. Additionally, we report the accuracy and the F1 score of all predictive method in Table 1, which lead to the same ranking of the compared methods. The sensitivity and the specificity of the MIL-attention method is .97 .01 and .91 .03, respectively. In addition to the classification, this attention-based architecture allows to easily identify the patches that are important for the classification of the network by inspecting the corresponding attention weights (see Figure 2).
|data type||method||accuracy||F1 score||AUC||-value|
|patches||CNN + MIL attention||.96 (.95-.96)||.97 (.97-.98)||.94 (.93-.95)|
|patches||CNN + MIL max pooling||.92 (.90-.94)||.95 (.94-.96)||.88 (.84-.93)||<0.001|
|patches||CNN + MIL mean pooling||.88 (.87-.90)||.93 (.92-.94)||.83 (.80-.85)||<0.001|
|re-scaled WSI||end-to-end CNN||.76 (.70-.82)||.83 (.77-.88)||.90 (.88-.92)||<0.001|
Histopathology slides represent a gigantic source of information since they are collected and stored since decades. However, their computational analysis poses a huge challenge to machine learning techniques due to (1) the ultra-high resolution and (2) the weak labels. We have demonstrated that attention-based pooling and CNNs can be used to detect basal cell carcinomas in histopathology slides and how those models can be interpreted and visualized.
Preliminary work. This extended abstract presents ongoing and preliminary work. We investigate other types of attention mechanisms and experiments on additional data sets. Currently we use the VGG11 architecture for extracting features due to its simplicity. We plan to compare this with other CNN architectures as well as to training the whole pipeline in an end-to-end fashion. Furthermore, we want to scale down the WSI stepwise and find the optimal size to performance ratio. Another avenue we plan to follow is to compare the key regions as identified via the attention weights to saliency maps recorded via eye tracking of pathologists during diagnosis.
Acknowledgments. We would like to thank Rene Silye, Gudrun Lang, Giuliana Petronio and Christoph Sinz for their excellent scientific input and their great assistance in data collection. We thank the NVIDIA Corporation, Audi.JKU Deep Learning Center, Audi Electronic Venture GmbH, Janssen Pharmaceutica (MadeSMART), UCB S.A., FFG grant 871302, LIT grant DeepToxGen and AI-SNN, and FWF grant P 28660-N31.
- Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE transactions on medical imaging 35 (5), pp. 1313–1321. Cited by: §1.
- Epidemiology of basal cell carcinoma . Anais Brasileiros de Dermatologia 86 (2), pp. 292–305. External Links: Cited by: §1.
- A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013, K. Mori, I. Sakuma, Y. Sato, C. Barillot, and N. Navab (Eds.), Berlin, Heidelberg, pp. 403–410. Cited by: §1.
- Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 (7639), pp. 115–118. External Links: Cited by: §1.
- Digital pathology in clinical use: Where are we now and what is holding us back?. Histopathology 70 (1), pp. 134–145. External Links: Cited by: §1.
- Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Annals of Oncology (June), pp. 1–7. External Links: Cited by: §1.
- Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712. Cited by: §1, §2, §2.
- Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases. Journal of pathology informatics 7. Cited by: §1.
- Self-Normalizing Neural Networks. Neural Information Processing Systems (NeurIPS). External Links: Cited by: §2.
- Machine learning methods for histopathological image analysis. Computational and structural biotechnology journal 16, pp. 34–42. Cited by: §1, §1.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
Deep recurrent attention models for histopathological image analysis. BioRxiv, pp. 438341. Cited by: §1.
- Automatic differentiation in pytorch. Cited by: §3.
- Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering 2 (3), pp. 158. Cited by: §1.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
- Axiomatic Attribution for Deep Networks. In ICML’17 Proceedings of the 34th International Conference on MAchine Learing - Volume 70, pp. 3319–3328. External Links: Cited by: §2.
- Histopathological classification of precursor lesions of esophageal adenocarcinoma: a deep multiple instance learning approach. International Conference on Medical Imaging with Deep Learning. Cited by: §1.
- Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study. The Lancet Oncology 2045 (19), pp. 1–10. External Links: Cited by: §1.