Skin cancer is the most common form of cancer in the United States [rogers2010incidence, lomas2012systematic], and melanoma is the leading cause of skin cancer related death [schadendorf2018melanoma]. Automated diagnosis of melanoma from digitized dermoscopy images thus represents an important potential use case for deep learning methods. Inspired by a breakthrough result by Esteva et. al., [esteva2017dermatologist], many recent publications claim “better than dermatologist” performance of convolutional neural networks (CNNs) on a variety of skin cancer classification tasks [esteva2017dermatologist, review2018, haenssle2018man, brinker2019deep, Mahbod2019, fujisawa2019deep]. If indeed such models have diagnostic performance comparable to board certified dermatologists, this heralds a new era in skin cancer care, with standardization of diagnosis and democratization of access [janda2019can, mar2018]. Early diagnosis of melanoma is associated with improved outcomes but poor availability of well trained clinicians in many parts of the world means too often diagnosis is made too late. CNNs represent an important new technology to address this problem for social good.
How can we evaluate the veracity of these exciting new claims? Unfortunately, privacy constraints typically make it difficult to access training, validation, test data, and final model weights. This makes it impossible to verify the accuracy of these published models and reproduce their claims [review2018]. As is common in medical settings, there are inherent known biases in the data: lesion classes are unevenly distributed, healthy images are over-represented, racial bias is present (few lesions are from dark-skinned individuals) [brinker2019deep] and there is significant variability in ground truth labelling [Elmorej2813]. Can we be confident the model has not inherited any of these known biases? A further challenge is due to the presence of unknown biases in the data. If an artifact is present in images from two diagnostic classes but more prevalent in one, how do we know when the model classifications are weighted by the presence or absence of this artifact?
In light of these problems, it is an open question as to the best strategy to determine if a given model will generalize to future data where the distribution of these biases may be different. Currently there are two approaches: (i) ad hoc techniques that penalize model complexity (batch normalization and dropout, for example), and (ii) training and testing models on larger and more complex data sets. Importantly, neither of these techniques can identify, nor correct, specific biases prior to model deployment.
In this paper, we investigate the possibility that current interpretability methods may assist in this task. Interpretability methods seek to produce an indication of features of the input data that the model regards as important for weighting the final diagnostic decision. While they do not capture the entirety of the predictive process, they can nonetheless provide some guidance to how a given model makes decisions.
For this study we use publicly available data from HAM10000, a well curated data set of dermoscopy images collected specifically for use in the machine learning context[ham10000]. The full data set includes seven classes of skin lesions—in this study we concentrate on differentiating between benign naevus (moles) and melanoma, a particularly challenging clinical task. Our data set contains a total of 6017 images, with significant class imbalance: 5403 naevi and 614 malignant melanoma. We retain a balanced set of 200 images of each class as a hold out test set. It is worth noting that in the clinical context false negatives (predicting naevus when ground truth is melanoma) have far more serious consequences than false positives (predicting melanoma when ground truth is naevus). This means we need to ensure the class imbalance is addressed during training: a model trained and tested on the current distribution can achieve high accuracy (88%) simply by always guessing naevus.
The majority of publications in this area use transfer learning from Inception, pre-trained on Imagenet with an added pooling layer, dense layer and dropout—we follow suit for ease of comparison. We address class imbalance by first augmenting the melanoma images, obtaining a final set of 1656 melanoma images. We then sample 15 random subsets of 818 images from both classes and train a total of 30 models via a Bayesian hyper-parameter search—searching over learning rate, dropout, momentum, beta, beta
, number of dense nodes, number of epochs, SGD and Adam111Details of augmentation, random data sampling, Bayesian hyper-parameter search, all code for training and experiments, including the final 30 trained models can be found here: https://github.com/KyleYoung1997/DNNorDermatologist
. The aim is to survey the landscape of possible models, giving us a selection of multiple networks to compare and explore rather than a single, cherry-picked one. The mean AUC over the 30 models is 85% with a variance of 1.8% and a mean recall of 87%—a performance comparable to other published models in this context (e.g the model of[haenssle2018man] achieved AUC of 86%)222Note that differences in test set size and distribution mean that direct comparison of model performance via AUC is of limited merit. However, as AUC is the standard metric reported in the literature, we include it here. Further comment can be found in the conclusions.. Reported AUC for melanoma identification from dermoscopy images for dermatologists is around 79% [haenssle2018man] and for primary care physicians even lower [raasch1999suspicious]. These results signal the fact that this is indeed a difficult task for which CNN decision support may prove useful.
It is interesting to note that the variance across model accuracy (AUC) over the 30 models is relatively small at
. While these models share the same basic architecture, they have been trained on different sub-samples of the data using different hyper-parameters—thus are likely converging on different local optima. This is evidenced by the differences in mis-classified test images across the different models. Interestingly, seventeen images were consistently mis-classified: at least 25/30 models got the class label wrong. For example, the naevus in Fig.1 was mis-classified by all 30 models as melanoma. Interestingly, this lesion does arguably satisfy one of the clinical criteria for melanoma. A small human evaluation trial by 3 primary care physicians suggests these images are challenging: scores were 4/17, 5/17 and 6/17.
2.3 GradCAM and Kernel SHAP
GradCAM [gradcam2017] and Kernel SHAP [lundberg2017unified]
are both model agnostic, local interpretability methods. While both highlight pixels that the trained network deems relevant for the final classification, they work in very different ways. GradCAM computes the gradient of the class-score (logit) with respect to the feature map of the final convolutional layer. Formally, consider each input image as a vectorwhere our model is a function , with the total number of classes. GradCAM provides an “interpretability” map that maps inputs to objects of the same dimension. If
are feature maps obtained from the last convolutional layer, global average pooling of the gradients gives us a set of neuron importance maps
and the final mask corresponds to a ReLU applied to a weighted linear combination of the feature maps and the importance maps:ReLU. More details and examples can be found in [gradcam2017].
While there are a large variety of methods for applying saliency maps, recent work has shown that many are in fact independent of both the model weights and/or the class labels [adebayo2018sanity]. In these cases it is likely the model architecture itself is constraining the saliency maps to look falsely meaningful: frequently the maps just act as a variant of edge detector [adebayo2018sanity]. This is particularly dangerous in the context of skin cancer detection as features at the borders of lesions are often considered diagnostic for melanoma: saliency maps that highlight the edges of a lesion may be misconstrued as clinically meaningful. We use GradCAM in our analysis because it was one of the few methods that passed the recommended sanity checks (we also perform our own to double-check this particular context).
We also investigate the use of Kernel SHAP [lundberg2017unified], an interpretability method that was not among those investigated in [adebayo2018sanity], but has strong theoretical justification [Molnar2019]. A stronger agreement was found between Shapley explanations and human explanations when compared to two alternative popular saliency methods, LIME and DeepLIFT [lundberg2017unified]
, further confirming that this is an appropriate method to explore. Based on Shapley values from co-operative game theory[shapley1953value], the method assigns a fair attribution value to each feature based on the contribution that feature makes to the total prediction. The method is proven to be the unique mapping that satisfies a number of reasonable criteria and is calculated by considering interactions between all possible subsets of features. For features, calculating the Shapley value for a given feature will need to account for all subsets containing . Thus a downside to the original approach is that it scales exponentially in the number of features. Consequently, we use an approximate, computationally feasible method: Kernel SHAP [lundberg2017unified].
2.4 Sanity checks
We perform three simple sanity checks on GradCAM and Kernel SHAP to explore their performance in this context. (i) Reproducibility:
we run the algorithms twice using the same randomly selected model and the same image, then compare images visually and using SSIM333Details on SSIM (Structural Similarity Index) can be found in [wang2004image]
. GradCAM saliency maps were unsurprisingly visually identical, with a perfect SSIM of 1, reflecting the deterministic nature of this algorithm. Kernel SHAP images were visually close to identical, but with SSIM less than perfect (mean 0.92, standard deviation 0.028). This small deviance is unsurprising given the method requires approximation via random sampling of subsets of features. (ii)Model dependence: using techniques inspired by [adebayo2018sanity] we randomize the weights of selective, progressively shallower layers in a randomly chosen model and recompute the GradCAM and Kernel SHAP images. The idea is to ensure that the saliency maps are not in fact independent of model weights. Visual comparison and SSIM scores verify that the maps are indeed model dependent, an example can be seen in Fig. 2. (iii) Sensitivity: we compare saliency maps from three models with the same AUC. This test serves to determine the sensitivity of the maps to model weights and also provides insight into differences across models of similar performance. Visual inspection shows variation across three models with identical AUCs of 85% for both methods, with average variation in SSIM of 20% for both GradCShrapenlAM and kernel SHAP. An example can be seen in Fig. 3.
2.5 Spurious correlation
The saliency maps show that at this resolution the majority of images do not unambiguously capture clinically meaningful information. However, several images suggest that the model is indeed weighting the classification decision using spurious correlations. Notable examples include those images that highlight the dark corners of the images (e.g. Fig 3).
There are a number of limitations of this study. The small data size makes over-fitting more likely, thus increasing the chances that we would uncover spurious correlations. Additionally, while our small data size made many tasks computationally and practically feasible, for large test data sets this will not be the case—visual inspection to screen for spurious correlations will likely become impractical. An alternative approach would be to use these methods to provide feedback at the time of prediction: while a saliency map located on the lesion can not yet be viewed as justification that clinically meaningful correlations have been learned, a map that is clearly located on a clinically irrelevant region could be used to signal a prediction that should be ignored. Our study was also limited to models of a particular architecture, while we justify this as providing a point of comparison with existing published research, future work could include model architecture as a search hyper-parameter.
While the accuracy of our models is good and comparable to human accuracy, it is likely ensembled methods will improve accuracy further. It is difficult to envision how these interpretability methods could be applied meaningfully in this context. One alternative could be to use the maps themselves to regularize each of the models during training—methods such as GradMask suggest this may be possible [simpson2019gradmask]. Finally, there exists recent, alternative methods for implementing Shapley analysis that may well produce better results and permit the use of higher resolution images [chen2018shapley, aas2019explaining] . These experiments we leave for the future.
There is a significant literature comparing the performance of DNNs and dermatologists on test sets of dermoscopy images. These studies provide credibility for pursuing research in this area and the next task is to develop techniques that enable DNNs to become valued clinical decision support tools. We have shown that GradCAM and kernel SHAP maps pass some basic sanity checks and can provide insight into potential sources of bias. However, it is clear that more work is needed before these maps can provide clinically meaningful information. We have also shown that evaluating models according to AUC alone provides limited insight into the true nature of the performance of the model: saliency maps show that models with the same AUC can make predictions using completely different rationales.
This work was supported by an Australian Research Council Centre of Excellence for Quantum Engineered Systems grant (CE 110001013).