Detecting cutaneous basal cell carcinomas in ultra-high resolution and weakly labelled histopathological images

Diagnosing basal cell carcinomas (BCC), one of the most common cutaneous malignancies in humans, is a task regularly performed by pathologists and dermato-pathologists. Improving histological diagnosis by providing diagnosis suggestions, i.e. computer-assisted diagnoses is actively researched to improve safety, quality and efficiency. Increasingly, machine learning methods are applied due to their superior performance. However, typical images obtained by scanning histological sections often have a resolution that is prohibitive for processing with current state-of-the-art neural networks. Furthermore, the data pose a problem of weak labels, since only a tiny fraction of the image is indicative of the disease class, whereas a large fraction of the image is highly similar to the non-disease class. The aim of this study is to evaluate whether it is possible to detect basal cell carcinomas in histological sections using attention-based deep learning models and to overcome the ultra-high resolution and the weak labels of whole slide images. We demonstrate that attention-based models can indeed yield almost perfect classification performance with an AUC of 0.95.


page 3

page 4


Weakly Supervised Attention-based Models Using Activation Maps for Citrus Mite and Insect Pest Classification

Citrus juices and fruits are commodities with great economic potential i...

Pyramid Grafting Network for One-Stage High Resolution Saliency Detection

Recent salient object detection (SOD) methods based on deep neural netwo...

Finding a Needle in the Haystack: Attention-Based Classification of High Resolution Microscopy Images

Deep learning for classification of microscopy images is challenging bec...

A weakly supervised framework for high-resolution crop yield forecasts

Predictor inputs and label data for crop yield forecasting are not alway...

Human Perception-based Evaluation Criterion for Ultra-high Resolution Cell Membrane Segmentation

Computer vision technology is widely used in biological and medical data...

Rice Diseases Detection and Classification Using Attention Based Neural Network and Bayesian Optimization

In this research, an attention-based depthwise separable neural network ...

Subcellular Protein Localisation in the Human Protein Atlas using Ensembles of Diverse Deep Architectures

Automated visual localisation of subcellular proteins can accelerate our...

1 Introduction

Basal cell carcinomas (BCCs) represent one of the most common cutaneous malignancies in humans (Chinem and Miot, 2011). Because of their frequency, BCCs are diagnosed by pathologists and dermato-pathologists on a regular basis. Digital pathology improves and simplifies histological diagnoses with regard to safety, quality and efficiency (Griffin and Treanor, 2017). Digital pathology improves diagnoses of pathologists by providing diagnostic support, namely computer-assisted diagnoses (Komura and Ishikawa, 2018). Recently this is mainly achieved using machine learning methods. Such methods could assist physicians, particularly pathologists in finding new histological patterns for the diagnosis of diseases.

From a machine learning perspective, convolutional neural networks

(LeCun et al., 1998; Krizhevsky et al., 2012) would be the established method to tackle this image classification task, also because CNNs have been successfully applied to analyze biological and medical images. Examples include the detection of melanoma, performing on par with dermatologists (Tschandl et al., 2019; Haenssle et al., 2018; Esteva et al., 2017) or the prediction of cardiovascular risk factors based on retinal fundus images (Poplin et al., 2018). However, the size of typical histological images obtained in a resolution appropriate to see cellular structures are not suitable for processing with current state-of-the-art CNN architectures. Such images often have a resolution of 50,000100,000 pixels, while CNNs are typically applied to images with a maximal resolution of up to 4,0964,096 pixels (Momeni et al., 2018). Recent attempts in training CNNs on histopathologic images frequently avoid the ultra-high resolution problem by sampling random image patches from the full image (Cruz-Roa et al., 2013; Albarqouni et al., 2016; Janowczyk and Madabhushi, 2016), which maintains the weak labelling problem. An overview of machine learning and deep learning methods that tackle the high-resolution problem of histopathology slides is given in Komura and Ishikawa (2018). Furthermore, the classification of histopathology slides represents a weak label problem: the whole image slide is labelled with a single class (i.e. diagnosis), but a large fraction of the image is identical in all classes, and only a small region is indicative of the respective class. Recently, multiple instance learning (MIL) and attention-based models have been proposed for analyzing whole slide images (Tomczak et al., 2018; Ilse et al., 2018).

The aim of the study is to assess whether it is possible to detect basal cell carcinomas in whole slide images (WSI) overcoming the ultra-high resolution and weak label problems by adapting standard machine learning methods. Moreover, we aim to identify the key regions in the image that are important for the decision of the predictive model and finally to compare them to the diagnostic key regions for board-certified pathologists.

2 Massive Multiple Instance Learning with Attention

We treat the problem of classifying extremely high resolution images as a multiple instance learning problem by dividing whole slide images into patches of relatively small resolution that can be processed by standard CNN architectures. This introduces the problem of credit assignment, i.e. how to combine signals from a massive number of patches to classify a full WSI as ’contains BCC’ or ’does not contain BCC’. Simple solutions such as averaging patch predictions or adopting the prediction of the maximally activated patch have obvious problems with credit assignment. Therefore, we employ attention-based MIL pooling introduced as by Ilse and colleagues

(Ilse et al., 2018)

and compare it against a baseline of downscaled WSI as well as patch based methods using mean and max pooling. An overview of the patch based approach is shown in Figure

1. We use the well-known VGG11 architecture by Simonyan and Zisserman (2014) for all patch-based experiments.

Figure 1: Left: Histopathology slides with ultra-high resolution (usually >20,000 >20,000 pixels) represent the input for the detection of BCC. Center: The full image is separated into patches with a resolution of 224 224 pixels. Right: Patches represent small regions of the histopathological slides and are used as instances in a multiple instance learning setting.

Baseline method. As a baseline method, we use a standard CNN trained on whole slide images down-scaled to 10241024 pixels. The down-sampling strategy means that only

of the available information was used for the classification of the histological sections. The CNN consists of five blocks of convolution-convolution-maxpooling and utilizes SeLU activation functions 

(Klambauer et al., 2017)

. The architecture and the hyperparameters of this CNN were optimized on a validation set using manual hyperparameter tuning. The network was trained with standard SGD. Mean and standard deviation of accuracy, F1 score and the area under ROC curve (AUC) were calculated by re-training the networks 100 times.

WSI method. All WSI were divided into patches of 224

224 pixels covering the whole image, where padding was used if necessary. To exclude empty background patches, the average color intensity

of all pixels for each patch was calculated. Then, for each WSI the maximum was calculated and all patches with higher than

of this maximum were removed and considered as empty. Each patch was normalized to zero mean and unit variance; no stain-normalization was applied.

Each non-empty patch was processed by the CNN and the activations of the last layer were stored as representation of the respective patch. These activations were then used as input for the final classification network, where all patch activations of a single WSI were passed to the network as a mini-batch. For mean and max pooling MIL methods we calculated the mean or maximum of the resulting network predictions. The attention classifier was designed according to Ilse et al. (2018), where the representation of each instance corresponds to the network activations of one image patch. We use a weighted average of instances, where the weights are the output of a neural network that directly predicts these attention weights. A set of instances in a single whole slide image is represented by , the MIL pooling function is and the the attention weight is


where and are trainable parameters. Finally, a classification layer with sigmoid activation is applied to provide the prediction . Again, we used standard SGD for training. Mean and standard deviation of accuracy, F1 score and AUC were calculated 100 times re-training the network.

Interpretation method. To interpret the predictions of the neural network, we used (1) Integrated Gradients (Sundararajan et al., 2017) for the baseline method and (2) the learned attention weights per image patch of the attention mechanism described above. We visualized the attention weights in order to determine the key regions for the classification of the CNN model (Figure 2).

Figure 2: Visualization of the interpretation methods. Left: Regions of the input image contributing to the prediction according to the Integrated Gradients method are colored in green. Right: Regions of the input image contributing to the prediction according to the weights provided by the attention mechanism. Both Integrated Gradients (left) and the attention mechanism (right) identify similar regions as indicative of the cancer class.

3 Experiments and Results

Images of histologicals slides of normal skin and skin sections containing BCCs were stained with hematoxylin and eosin (H/E, n=838 slides). 647 of them represent BCC cases, 191 show normal skin. This set of 838 images was randomly split into 129 (15%) test set and 709 (85%) training images. 20% of the training set was used for as validation set. The median size of the WSI is 56,89626,198 pixels, with the height ranging from 6,884 to 47,939 and the width ranging from 7,360 to 99,568 pixels. The images were retrospectively collected at the Medical University of Vienna and the Kepler University Hospital, according to ethics votes number 1119/2018 (Ethics Committee Upper Austria) and 2085/2018 (Ethics Committee Medical University of Vienna).

We trained the baseline method, and three methods based on mean pooling, max pooling, and the attention mechanism described above using the PyTorch framework

(Paszke et al., 2017). The results of the attention-based method against the baseline method and mean/max pooling MIL on our data set are shown in Table 1. The attention-based pooling method has significantly outperformed MIL with max pooling, MIL with mean pooling and an end-to-end CNN trained on down-sampled whole-slide images with respect to the AUC. Additionally, we report the accuracy and the F1 score of all predictive method in Table 1, which lead to the same ranking of the compared methods. The sensitivity and the specificity of the MIL-attention method is .97 .01 and .91 .03, respectively. In addition to the classification, this attention-based architecture allows to easily identify the patches that are important for the classification of the network by inspecting the corresponding attention weights (see Figure  2).

data type method accuracy F1 score AUC -value
patches CNN + MIL attention .96 (.95-.96) .97 (.97-.98) .94 (.93-.95)
patches CNN + MIL max pooling .92 (.90-.94) .95 (.94-.96) .88 (.84-.93) <0.001
patches CNN + MIL mean pooling .88 (.87-.90) .93 (.92-.94) .83 (.80-.85) <0.001
re-scaled WSI end-to-end CNN .76 (.70-.82) .83 (.77-.88) .90 (.88-.92) <0.001
Table 1: Performance metrics for the compared methods. Displayed values are the mean and the mean standard deviation across 100 re-runs of the training procedure. -values correspond to Wilcoxon signed-rank test of AUC between CNN + MIL attention and the other methods.

4 Discussion

Histopathology slides represent a gigantic source of information since they are collected and stored since decades. However, their computational analysis poses a huge challenge to machine learning techniques due to (1) the ultra-high resolution and (2) the weak labels. We have demonstrated that attention-based pooling and CNNs can be used to detect basal cell carcinomas in histopathology slides and how those models can be interpreted and visualized.

Preliminary work. This extended abstract presents ongoing and preliminary work. We investigate other types of attention mechanisms and experiments on additional data sets. Currently we use the VGG11 architecture for extracting features due to its simplicity. We plan to compare this with other CNN architectures as well as to training the whole pipeline in an end-to-end fashion. Furthermore, we want to scale down the WSI stepwise and find the optimal size to performance ratio. Another avenue we plan to follow is to compare the key regions as identified via the attention weights to saliency maps recorded via eye tracking of pathologists during diagnosis.

Acknowledgments. We would like to thank Rene Silye, Gudrun Lang, Giuliana Petronio and Christoph Sinz for their excellent scientific input and their great assistance in data collection. We thank the NVIDIA Corporation, Audi.JKU Deep Learning Center, Audi Electronic Venture GmbH, Janssen Pharmaceutica (MadeSMART), UCB S.A., FFG grant 871302, LIT grant DeepToxGen and AI-SNN, and FWF grant P 28660-N31.


  • S. Albarqouni, C. Baur, F. Achilles, V. Belagiannis, S. Demirci, and N. Navab (2016) Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE transactions on medical imaging 35 (5), pp. 1313–1321. Cited by: §1.
  • V. P. Chinem and H. A. Miot (2011) Epidemiology of basal cell carcinoma . Anais Brasileiros de Dermatologia 86 (2), pp. 292–305. External Links: ISBN 0365-0596 1806-4841, ISSN 0365-0596 Cited by: §1.
  • A. A. Cruz-Roa, J. E. Arevalo Ovalle, A. Madabhushi, and F. A. González Osorio (2013) A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013, K. Mori, I. Sakuma, Y. Sato, C. Barillot, and N. Navab (Eds.), Berlin, Heidelberg, pp. 403–410. Cited by: §1.
  • A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 (7639), pp. 115–118. External Links: Document, ISBN 0028-0836, ISSN 0028-0836, Link Cited by: §1.
  • J. Griffin and D. Treanor (2017) Digital pathology in clinical use: Where are we now and what is holding us back?. Histopathology 70 (1), pp. 134–145. External Links: Document, ISSN 13652559 Cited by: §1.
  • H. A. Haenssle, C. Fink, R. Schneiderbauer, F. Toberer, T. Buhl, A. Blum, A. Kalloo, A. Ben Hadj Hassen, L. Thomas, A. Enk, and L. Uhlmann (2018) Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Annals of Oncology (June), pp. 1–7. External Links: Document, ISBN 4962215639, ISSN 0923-7534 Cited by: §1.
  • M. Ilse, J. M. Tomczak, and M. Welling (2018) Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712. Cited by: §1, §2, §2.
  • A. Janowczyk and A. Madabhushi (2016) Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases. Journal of pathology informatics 7. Cited by: §1.
  • G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter (2017) Self-Normalizing Neural Networks. Neural Information Processing Systems (NeurIPS). External Links: Document, 1706.02515, Link Cited by: §2.
  • D. Komura and S. Ishikawa (2018) Machine learning methods for histopathological image analysis. Computational and structural biotechnology journal 16, pp. 34–42. Cited by: §1, §1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
  • A. Momeni, M. Thibault, and O. Gevaert (2018)

    Deep recurrent attention models for histopathological image analysis

    BioRxiv, pp. 438341. Cited by: §1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §3.
  • R. Poplin, A. V. Varadarajan, K. Blumer, Y. Liu, M. V. McConnell, G. S. Corrado, L. Peng, and D. R. Webster (2018) Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering 2 (3), pp. 158. Cited by: §1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic Attribution for Deep Networks. In ICML’17 Proceedings of the 34th International Conference on MAchine Learing - Volume 70, pp. 3319–3328. External Links: 1703.01365, ISBN 9781510855144, ISSN 1938-7228, Link Cited by: §2.
  • J. M. Tomczak, M. Ilse, M. Welling, M. Jansen, H. G. Coleman, M. Lucas, K. de Laat, M. de Bruin, H. Marquering, M. J. van der Wel, et al. (2018) Histopathological classification of precursor lesions of esophageal adenocarcinoma: a deep multiple instance learning approach. International Conference on Medical Imaging with Deep Learning. Cited by: §1.
  • P. Tschandl, N. Codella, B. N. Akay, G. Argenziano, R. P. Braun, H. Cabo, D. Gutman, A. Halpern, B. Helba, R. Hofmann-Wellenhof, A. Lallas, J. Lapins, C. Longo, J. Malvehy, M. A. Marchetti, A. Marghoob, S. Menzies, A. Oakley, J. Paoli, S. Puig, C. Rinner, C. Rosendahl, A. Scope, C. Sinz, H. P. Soyer, L. Thomas, I. Zalaudek, and H. Kittler (2019) Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study. The Lancet Oncology 2045 (19), pp. 1–10. External Links: Document, ISSN 14702045, Link Cited by: §1.