Diagnostic Accuracy of Content Based Dermatoscopic Image Retrieval with Deep Classification Features

by   Philipp Tschandl, et al.

Background: Automated classification of medical images through neural networks can reach high accuracy rates but lack interpretability. Objectives: To compare the diagnostic accuracy obtained by using content based image retrieval (CBIR) to retrieve visually similar dermatoscopic images with corresponding disease labels against predictions made by a neural network. Methods: A neural network was trained to predict disease classes on dermatoscopic images from three retrospectively collected image datasets containing 888, 2750 and 16691 images respectively. Diagnosis predictions were made based on the most commonly occurring diagnosis in visually similar images, or based on the top-1 class prediction of the softmax output from the network. Outcome measures were area under the ROC curve for predicting a malignant lesion (AUC), multiclass-accuracy and mean average precision (mAP), measured on unseen test images of the corresponding dataset. Results: In all three datasets the skin cancer predictions from CBIR (evaluating the 16 most similar images) showed AUC values similar to softmax predictions (0.842, 0.806 and 0.852 versus 0.830, 0.810 and 0.847 respectively; p-value>0.99 for all). Similarly, the multiclass-accuracy of CBIR was comparable to softmax predictions. Networks trained for detecting only 3 classes performed better on a dataset with 8 classes when using CBIR as compared to softmax predictions (mAP 0.184 vs. 0.368 and 0.198 vs. 0.403 respectively). Conclusions: Presenting visually similar images based on features from a neural network shows comparable accuracy to the softmax probability-based diagnoses of convolutional neural networks. CBIR may be more helpful than a softmax classifier in improving diagnostic accuracy of clinicians in a routine clinical setting.


page 2

page 3

page 6

page 9

page 10


Combining Real-Valued and Binary Gabor-Radon Features for Classification and Search in Medical Imaging Archives

Content-based image retrieval (CBIR) of medical images in large datasets...

Skin Disease Classification versus Skin Lesion Characterization: Achieving Robust Diagnosis using Multi-label Deep Neural Networks

In this study, we investigate what a practically useful approach is in o...

Deep neural network or dermatologist?

Deep learning techniques have proven high accuracy for identifying melan...

Skin lesion detection based on an ensemble of deep convolutional neural network

Skin cancer is a major public health problem, with over 5 million newly ...

Identifying Pediatric Vascular Anomalies With Deep Learning

Vascular anomalies, more colloquially known as birthmarks, affect up to ...

MDNet: A Semantically and Visually Interpretable Medical Image Diagnosis Network

The inability to interpret the model prediction in semantically and visu...

Risk of Training Diagnostic Algorithms on Data with Demographic Bias

One of the critical challenges in machine learning applications is to ha...

I Introduction

Automated analysis of medical images using neural networks has been used in dermatoscopy for more than a decade 1; 2, but recently gained attention since groups have reported high accuracy rates with convolutional neural networks (CNN) for skin images 3; 4 and dermatoscopy 5, as well as for other medical domains such as fundoscopy6 or chest X-rays 7

. CNNs, in brief, are a group of modern and powerful machine learning models that don’t require explicit handcrafted engineering. Rather, they learn to detect visual elements such as colors, shapes and edges by themselves, and combine detections of those internally to a prediction. The only thing needed for them, apart from computing power, is a large number of images and labels to train them, where the labels correspond to the diagnosis in the medical field.

Implementing automated classification models, like a CNN, that output probabilities of diagnoses, or the most probable diagnosis, is deemed desirable for a number of reasons within a health care system. Using patient-based methods could ultimately reduce the need for physicians in areas of scarcity and reduce burden on the health care system, but are highly problematic in regard to regulations and safety. A more realistic approach is having decision support systems available to non-specialised physicians that may be easier to implement and have the potential to increase their diagnostic accuracy and decrease referral rates. Integrating classification systems into a specialists’ clinical workflow may increase efficiency and free them from spending a large amount of time on easy to diagnose cases.

While these effects are undeniably positive, real-world settings can be problematic for classifiers that output the probability of a diagnosis. Accuracy rates for specified cutoffs are commonly reported in experimental settings on digital images with a verification bias, as mainly pathologically verified diagnoses are deemed the gold standard for ground-truth labels 8. Even in sets using expert evaluations as “labels”, the included cases may not inherit all or enough representations of common banal skin diseases 9. Specialised centers may not bother photo-documenting such common cases due to the additional time required, and given their obvious diagnosis to an expert.
Apart from imperfect accuracy rates of neural networks, unforeseen problems can arise in practical use. This is exemplified by an earlier clinical study using an automated skin lesion classifier where melanomas where missed simply because they were not photographed by the user 2.
Lastly, classifications of CNNs can be prone to adversarial examples 10 raising questions of liability in misdiagnoses of such systems, or falsely vindicating skin lesion removal on insurance funds for cosmetic or financial incentives.

A solution for these problems is to keep physicians ”in the loop” 11 for automated diagnoses. Classification systems could run in the background analyzing images to bring the ones of most concern to a doctor’s attention more quickly. These systems could also be used to continuously audit previously diagnosed cases where disagreements between the automated classifier and physician can be flagged and recommended for review. For a successful human-machine collaboration it is key to know why a system makes a specific diagnosis, options being visual question answering or automated captioning 12

. For all these systems it is left to the discretion of the user to interpret the results and decide whether they are correct. Herein we explore a different, intuitive and transferrable approach for ’explainable’ artificial intelligence (AI), called content based image retrieval (

CBIR). With CBIR, the user presents an unknown query image to a system, and cases with similar visual features are automatically retrieved and displayed from a database. Example queries and results of automatically retrieved similar images are shown in Figure 1.

With the increased performance of convolutional neural networks in regard to classification, previous work has found that those networks also learn filters that correspond to visual elements of an image in later layers of a CNN 13

. In other words, one set of filters in a CNN could for example respond to whether a brown network is visible, and another one could respond to a group of blue clods. With many filters present in a CNN, and many ways to combine them as an image moves through the network, it is an active research area to try and understand what set of filters correspond to an exact given visual structure. However, even without knowing what exact filter detects which structure, taken altogether they can be expressed as row of simple numbers (called a “feature vector” or “deep features”), representing all visual elements in an image. By comparing how similar these collected numbers of two images are, one can match faces

14, or retrieve visually similar medical data such as histopathologic images15. Recently, Kawahara et al.16 used such extracted features of a multi-modality network to query a database for similar images and found it had high sensitivity (94%) but low specificity (36%) for detecting melanoma (73% and 79% respectively for a different diagnostic cutoff).

The goals of this study are:

  • To evaluate whether CBIR based on deep features of a neural network, trained for classification, can provide a comparable diagnostic accuracy as its softmax probabilities.

  • To determine how many similar images may be practically needed.

  • To determine whether a CBIR system is transferrable to different datasets.

Ii Methods

Dataset Total angioma bcc bkl df inflammatory mel nevus scc
EDRA 888 (100%) - - 69 (7.8%) - - 247 (27.8%) 572 (64.4%) -
ISIC2017 2750 (100%) - - 386 (14.0%) - - 521 (18.9%) 1843 (67.0%) -
PRIV 16691 (100%) 203 (1.2%) 3842 (23.0%) 1368 (8.2%) 206 (1.2%) 566 (3.4%) 2276 (13.6%) 5941 (35.6%) 2289 (13.7%)
TABLE I: Presentation of used study datasets with numbers of included diagnoses. EDRA and ISIC2017 contain the same disease classes, whereas the private (PRIV) dataset contains 8 different diagnoses.

Ii-a Datasets

We compare diagnostic performance of a CBIR system to neural networks using the following 3 datasets:

  • EDRA: A large collection of dermatoscopic images was published alongside the Interactive Atlas of Dermoscopy17. We filtered the dataset to contain only diagnoses with more than 50 examples and are consistent with the ISIC2017 dataset. 20% of the images, randomised and stratified by diagnosis of cases, were split as a test-set to evaluate our method on. Of the remaining cases, 20% were used as validation during training to fit network training parameters.

  • ISIC2017: The International Skin Imaging Collaboration (ISIC) 2017 challenge for melanoma recognition published a convenience dataset of dermatoscopic images with fixed training, validation and test splits 8. The diagnoses included in the dataset are melanoma (mel), nevi (nevus) and seborrheic keratoses (bkl).

  • PRIV: We gathered dermatoscopic images that were consecutively collected at a single skin cancer center between 2001 and 2016 for clinical documentation including pathologic and clinical diagnoses 111Ethics review board waiver from Ospedaliera di Reggio Emilia, Protocol No. 2011/0027989. We excluded diagnoses with less than 150 examples, which resulted in inclusion of the following diagnoses: angioma (incl. angiokeratoma), bcc (basal cell carcinoma), bkl (seborrheic keratoses, solar lentigines and lichen planus-like keratoses), df (dermatofibromas), inflammatory lesions (including dermatitis, lichen sclerosus, porokeratosis, rosacea, psoriasis, lupus erythematosus, bullous pemphigoid, lichen planus, granulomatous processes and artifacts), mel (all types of melanomas), nevus (all types of melanocytic nevi) and scc (squamous cell carcinomas, actinic keratoses and bowen’s disease). We performed splitting in the same manner as for the EDRA dataset for cases with a pathologic diagnosis. Cases that had no pathologic diagnosis but an expert rating were included only in the training set.

For all datasets, the training-set also represents the pool for images possibly retrieved by the tested CBIR systems. We avoided same-lesion images spread between training-, validation- and test-set. Complete dataset numbers are shown in Table I.

Ii-B Network architecture and training

In all experiments we use a ResNet-50 architecture18

with network parameters obtained through training on the ImageNet

19 dataset, which contains 1 million images of 1000 different objects of daily life. This pretraining enables the ResNet-50 architecture to recognize general shape, edge and colour combinations, and reduces the training time needed to adapt it to our specialized task of dermatoscopic image classification. Depending on the dataset used for a given experiment we modify the size of the last fully-connected layer in the CNN to match the number of classes present respectively, and fine-tune the network. This fully connected layer provides the probabilty output for every diagnosis, and because this layer processes its numerical input with the softmax function, we refer to its output as ’softmax prediction’. As compared to Han et al.4 we don’t define diagnosis-specific thresholds, but rather take the diagnosis with the highest probability value as the final diagnosis prediction. Further training implementation details are given in the Supplementary File.

Ii-B1 Cbir

For all images in the retrieval image set we pass them through the CNN, and collect the output of the deepest feature layer (”pool5”) as our feature vector. This vector consists of 2048 numbers that represent visual features of an image. By calculating the cosine similarity of two such vectors, we get a single number ranging between 0 and 1 corresponding to how ’similar’ features in two images are. In other words, the cosine similarity of two images describes in a single number how similar the visual elements of two images are. So, to obtain the most visually similar images to a query in this study, we calculate its cosine similarity to every other image in a dataset and sort them by the resulting value. In order to be able to compare CBIR with softmax predictions, we collect the k most similar lesions for every query and regard the frequency of their corresponding disease labels as their probability. For example, if 4 of 5 similar images are a melanoma and one is a nevus, we regard melanoma probability as 0.8 and nevus probability as 0.2.

Ii-C Metrics and Statistics

The following metrics are calculated for evaluating diagnostic accuracy, where all retrieved images had the same weight during retrieval except for solving ties of specific diagnoses:

  • Area under the ROC curve for detecting skin cancer (): Percent of malignant retrieval cases (CBIR) or the sum of probabilities of malignant classes (Softmax) are used to calculate ROC curves. Sensitivity and specificity values are likewise calculated for detecting skin cancer with fixed cutoffs of needed malignant examples / probabilities returned (25% ( and ) and 50% ( and ) of retrievals). Due to the lack of other malignant classes, this value is equal to the AUC to detect melanoma when testing on EDRA and ISIC2017 datasets.

  • Multi-class Accuracy (): Percentage of all correct specific predictions, where the prediction is made for the class with the highest probability (Softmax) or most commonly retrieved (CBIR) examples. To avoid tied predictions with CBIR, a minimal linear weighting based on retrieval order (1.00-0.99 distributed evenly along retrieved images) is applied during counting.

  • Multi-class Mean Average Precision (): Briefly, average precision scores for every test-set class are macro-averaged as implemented by 20, where prediction scores were obtained by either the frequency of the query class in CBIR retrievals or softmax prediction scores. A more detailed description is given in Supplementary File 1.

Fig. 2:

Measured visual similarity (cosine similarity) of images with the same diagnoses (blue) compared to others (red) in a dataset. Images of the same diagnoses are significiantly rated higher in almost any subgroup, showing automated measurements of visual similarity can differentiate between diagnoses within a retrieval dataset. Lines are drawn between values for a single query image, and rows denote dataset used for training, queries and image retrieval. Comparing differences was performed with a paired t-test or a paired Wilcoxon signed rank test (

W). NS p-value0.05, * p-value0.05, ** p-value0.01, *** p-value0.001.

Experiments as well as raw data computation and visualisation are performed with python (PyTorch

21, sklearn20 and matplotlib22) and R Statistics 23; 24. As testing all combinations of CBIR cutoffs (restricted to up to 32 images), datasets and metrics would result in too many comparisons, we restricted formal statistical tests comparing diagnostic metrics to the AUC of ROC detecting skin cancer when retrieving 2, 4, 8, 16, and 32 images which we believe is a clinically meaningful evaluation. ROC curves are computed using pROC25 and compared using the DeLong method 26. Paired t-tests are used to compare cosine similarity values after checking for approximate normality. In case of a violation, paired Wilcoxon signed rank test is used instead. A two-sided p-value of 0.05 is regarded statistically significant. 95%CI values of ROC curves as well as sensitivity and specificity at specified cutoffs are calculated with 2000 bootstrapped replicates. All p-values are reported adjusted for multiple testing with the Holm method 27

unless otherwise specified. Correction for multiple testing was stopped after the first non-rejection of the null-hypothesis, and therefore no adjusted p-values reported for the remaining comparisons.

Iii Results

Iii-a Same-source CBIR and classification

The mean cosine similarities of all retrieval images for all queries of the same data-source were 0.631 (95%CI: 0.628-0.634; EDRA), 0.623 (95%CI: 0.621-0.625; ISIC2017) and 0.638 (95%CI: 0.635-0.640; PRIV). Retrieval images with the same diagnosis had a significantly higher similarity value to a query image compared to those of different classes (0.667 (95%-CI: 0.665-0.669) vs. 0.601 (95%-CI: 0.600-0.603); p0.001). Subgroup analyses likewise revealed significant differences for every diagnosis within every dataset (see Figure 2). For accuracy calculations below, the most similar retrieval images were collected for every query, and the most frequent occurring disease label counted as the prediction.

Fig. 3: Frequency of correct specific diagnoses (Accuracy) made within each dataset by either softmax based predictions (red), or CBIR with a different number of retrieved similar images (black). Retrieval of already few images is performing better in the 3-class datasets (EDRA, ISIC2017), whereas in the 8-class (PRIV) dataset it takes over 20 images to approximate softmax based accuracy.

Using these ranked images for diagnostic predictions was able to approximate a classic softmax-based classifier with only few retrieval cases in regard to multi-class accuracy (Figure 3 and Table II). For the two datasets containing only 3 classes, CBIR outperformed the softmax-based classification and had the highest accuracy when retrieving 8 (EDRA, accuracy0.762) and 16 similar cases (ISIC2017, accuracy0.759), whereas in the PRIV dataset the best result with 32 retrievals (accuracy 0.629) was still below the corresponding softmax accuracy of 0.645. As can be seen in Figure 3, using more than 16 retrieved images did not consistently improve accuracy of CBIR. In all three datasets, showing only two retrieved images resulted in decreased performance in detecting skin cancer as measured by the AUC, where the difference was significant for the 8-class dataset (EDRA 0.782 vs. 0.830, p1.0; ISIC2017 0.760 vs. 0.810, p0.073; PRIV 0.791 vs 0.847, p0.001).

Dataset CBIR (k)
EDRA 2 0.750 72.7 (59.1-84.1) 78.4 (70.7-85.3) 72.7 (59.1-86.4) 78.4 (70.7-85.3) 0.782 (0.703-0.861) 0.151
4 0.756 88.6 (79.5-97.7) 64.7 (56-73.3) 63.6 (50.0-77.3) 86.2 (79.3-92.2) 0.830 (0.760-0.900) 0.99
8 0.762 86.4 (75.0-95.5) 70.7 (62.1-79.3) 56.8 (40.9-70.5) 89.7 (84.5-94.8) 0.850 (0.784-0.916) 0.342
16 0.744 84.1 (72.7-93.2) 68.1 (59.5-76.7) 52.3 (38.6-65.9) 89.7 (83.6-94.8) 0.842 (0.776-0.908) 0.491
32 0.744 86.4 (75.0-95.5) 69.8 (61.2-77.6) 47.7 (31.8-61.4) 92.2 (87.1-96.6) 0.844 (0.776-0.912) 0.499
Softmax 0.731 77.3 (65.9-88.6) 75.9 (68.1-83.6) 61.4 (47.7-75.0) 84.5 (77.6-91.4) 0.830 (0.759-0.901) - -
ISIC2017 2 0.717 66.7 (58.1-75.2) 83.2 (80-86.7) 66.7 (58.1-75.2) 83.2 (79.7-86.7) 0.760 (0.713-0.807) 0.006 0.073
4 0.727 76.1 (68.4-83.8) 71.9 (67.5-76.0) 53.8 (44.4-63.2) 89.5 (86.5-92.4) 0.785 (0.737-0.833) 0.118
8 0.736 70.1 (61.5-78.6) 78.4 (74.7-82.1) 50.4 (41.9-59.8) 90.4 (87.8-93.0) 0.798 (0.751-0.845) 0.431
16 0.759 70.9 (62.4-78.6) 77.6 (73.9-81.3) 50.4 (41-59.8) 92.8 (90.4-95.2) 0.806 (0.759-0.853) 0.785
32 0.753 68.4 (59.8-76.9) 77.3 (73.6-81.0) 41.0 (32.5-49.6) 94.3 (92.2-96.3) 0.799 (0.751-0.846) 0.354
Softmax 0.708 70.9 (62.4-79.5) 74.9 (70.8-78.6) 60.7 (52.1-69.2) 86.7 (83.4-89.5) 0.810 (0.765-0.854) - -
PRIV 2 0.594 82.7 (80.2-85.1) 66.8 (63.1-70.6) 82.7 (80.2-85.3) 67.0 (63.1-70.6) 0.791 (0.770-0.813) 0.001 0.001
4 0.614 89.3 (87.2-91.3) 54.4 (50.5-58.3) 79.4 (76.8-81.9) 73.9 (70.3-77.3) 0.822 (0.802-0.843) 0.002 0.032
8 0.615 87.9 (85.9-89.9) 60.6 (56.9-64.4) 74.7 (72-77.4) 77.9 (74.8-81.2) 0.843 (0.823-0.862) 0.597
16 0.624 87.4 (85.2-89.5) 63.9 (60-67.8) 74.2 (71.2-76.9) 81.2 (78.1-84.3) 0.852 (0.833-0.871) 0.456
32 0.629 87.2 (84.9-89.3) 66.2 (62.4-69.8) 73.2 (70.5-75.9) 82.2 (78.9-85.1) 0.859 (0.840-0.878) 0.072
Softmax 0.645 87.7 (85.5-89.7) 62.6 (58.7-66.3) 75.3 (72.5-78.0) 79.7 (76.6-82.7) 0.847 (0.827-0.867) - -
TABLE II: Intra-dataset performance metrics. AUC is calculated for detecting any malignant skin tumor in the corresponding dataset. Sensitivity (Sens) and Specificity (Spec) are calculated at 25% and 50% retrieved malignant cases (CBIR) or predicted probability of malignancy (Softmax). p-values, provided original and as p-value (adj.) with correction for multiple testing by the method of Holm27, denote difference of CBIR based AUC values to Softmax based ones.

signifies non-evaluated comparisons after correction for multiple testing. Numbers in brackets represent 95% confidence intervals.

Figure 4 shows the ROC curve of the EDRA intra-dataset evaluation when fixing the CBIR output to 16 images, where disregarding a small frequency of malignant cases in the images doesn’t change sensitivity substantially. Fixing the outputs to 16 cases and labeling a query case ”malignant” if at least 25% of retrievals show a malignant lesion, results in a sensitivity of 84.1% at a specificity of 68.1% in the EDRA dataset, 70.9% and 77.6% in the ISIC2017, and 87.4% and 63.9% in the PRIV dataset respectively (Table II).

Fig. 4: ROC for detecting melanoma when retrieving 16 similar images with CBIR (grey), showing different thresholds of needed malignant retrieval images (”predict melanoma when x of 16 retrieved images are melanomas”), as well as with softmax based probabilities (red). Network training-, query- and retrieval- images are from EDRA.

Iii-B New-source classification

Figure 5 and Table III show mean average precision values of networks trained and tested on different datasets, with different CBIR resource databases used. In other words, the images to be diagnosed, the images a CNN retrieves similar cases from, and the images the CNN was trained on can all originate from different sources. Softmax based predictions from 3-class-trained networks (EDRA & ISIC2017) perform worse on predicting the 8-class dataset (PRIV) with values of 0.184 and 0.198 respectively. Using the target source as a CBIR resource improved to up to 0.368 and 0.403 respectively. This is because previously ’unknown’ classes can still be retrieved as those networks transfer the ability to distinguish diagnoses through visual similarity (see Figure 6). The best CBIR performance is obtained with combinations where training, testing and resource are from the same source.

Fig. 5: Mean Average Precision (mAP) of a ResNet-50 network trained on EDRA dataset images. Predictions are made either through softmax probabilities (red line) or class-frequencies of CBIR (black). Softmax predictions perform bad on predicting PRIV dataset images, as the network as not able to predict 5 of the 8 classes in any case (first to columns, bottom row). CBIR retrieving from EDRA and ISIC2017 suffers from the same shortcomings, but is able to predict better when using PRIV-source retrieval images (bottom right). In general, CBIR performs best when using retrieval images from the same source as the test images (descending diagonal), and here performs better on new data than softmax predictions. Re-training the network on those new source images (blue) in turn outperforms CBIR again.
EDRA EDRA EDRA 0.632 0.681 0.702 0.748 0.775 0.761
ISIC2017 0.466 0.507 0.579 0.638 0.662
PRIV 0.405 0.490 0.520 0.563 0.573
ISIC2017 EDRA 0.385 0.417 0.429 0.444 0.444 0.456
ISIC2017 0.465 0.513 0.576 0.585 0.582
PRIV 0.388 0.398 0.425 0.438 0.445
PRIV EDRA 0.154 0.161 0.165 0.172 0.177 0.184
ISIC2017 0.150 0.163 0.170 0.179 0.188
PRIV 0.249 0.284 0.310 0.338 0.368
ISIC2017 EDRA EDRA 0.524 0.591 0.583 0.624 0.604 0.524
ISIC2017 0.410 0.448 0.487 0.488 0.512
PRIV 0.374 0.416 0.441 0.453 0.459
ISIC2017 EDRA 0.376 0.403 0.459 0.504 0.537 0.745
ISIC2017 0.583 0.654 0.697 0.725 0.734
PRIV 0.405 0.423 0.439 0.468 0.483
PRIV EDRA 0.149 0.158 0.167 0.175 0.182 0.198
ISIC2017 0.159 0.172 0.183 0.191 0.200
PRIV 0.269 0.316 0.377 0.389 0.403
PRIV EDRA EDRA 0.514 0.597 0.637 0.647 0.640 0.641
ISIC2017 0.434 0.465 0.498 0.540 0.566
PRIV 0.543 0.552 0.582 0.597 0.629
ISIC2017 EDRA 0.371 0.403 0.434 0.458 0.475 0.551
ISIC2017 0.543 0.596 0.649 0.667 0.688
PRIV 0.419 0.446 0.468 0.498 0.528
PRIV EDRA 0.152 0.161 0.167 0.171 0.177 0.598
ISIC2017 0.158 0.169 0.181 0.188 0.197
PRIV 0.405 0.472 0.517 0.545 0.568
TABLE III: Mean average precision between datasets. TRAIN denotes dataset the ResNet-50 architecture was trained on, TEST the origin of test images, and CBIR origin of retrieval images. While CBIR is able to approximate Softmax-based predictions between the 3-class datasets (EDRA and ISIC2017) when using same-source TEST and CBIR sets, it outperforms 3-class trained networks on the 8-class PRIV dataset as it is able to recognise unseen classes through the larger resource dataset.
Fig. 6:

Mean cosine similarities of PRIV retrieval images with the same (blue) or different (red) diagnosis for the correponding PRIV query images. Cosine similarity is calculated by feature extraction of ResNet-50 networks trained for classification on different training datasets (rows). Compared to the PRIV-trained network, those trained on different sources (row EDRA and ISIC2017) transfer their ability to distinguish specific diagnoses through visual similary except for bkl cases. Lines are drawn between values for the same query image.

W denotes use of paired Wilcoxon signed rank test instead of paired t-test. NS p-value0.05, * p-value0.05, ** p-value0.01, *** p-value0.001, grey indicators denote non-adjusted p-values as these comparisons were omitted during correction for multiple testing (see statistics section).

Iv Discussion

Current convolutional neural network (CNN) classifiers perform well but commonly behave as black-boxes during inference and preclude meaningful integration of their findings to a clinical decision process. Having an intuitive, ’explainable’, output of an automated classifier which complements - rather than overrides - a clinical decision process may be more desirable and can enhance efficient use of health care workers. Compared to other techniques for explainable AI 28 such as image captioning and visual question answering 12, we hypothesize that showing similar cases with their ground truth may be even more intuitive. Similar images found by CBIR further comprehensibly reveal the knowledge base of a network decision and may conceive when not to trust the automated system. More specifically, if users notice retrieved cases look nothing like the query image, they could intuitively decide the CNN cannot help in that case. Herein we show that CBIR can perform on par with softmax-based predictions of a ResNet-50 network on accuracy of skin cancer detection, as well as multi-class accuracy and mean average precision (Table II).
We describe reasonably good metrics for formal evaluation of a CBIR system, but more current architectures may be able to reach even higher accuracy. We hypothesise, that with increasing accuracy of a network, accuracy of CBIR will rise accordingly. The true advantage of CBIR may lie in that a human reader can pick the most fitting and relevant examples out from the provided image-subset and is not restricted to the strict counting and weighting used for calculations in this manuscript. We suspect having such a ’human-in-the-loop’ would give a much higher diagnostic precision in practice, which should be subject to future studies.

Deep learning literature dealing with image classification commonly presents accuracy metrics measured on the same dataset-source incorporating the same diagnostic classes. Relying on those experimental results when implementing an automated classifier in clinical practice may be precarious, as an end-user may take images with a different camera, on patients with different skin types, with different class distributions - and even disease classes the network has not encountered before. For these reasons a classifier with a fixed set of diagnoses may fail in unexpected ways which would go unnoticed if the output is merely a probability of specific diagnoses. Neural networks trained for classification by design are limited to predict classes they have seen during the training period. Currently, to our knowledge, no available dataset comes close to encompass all clinically possible classes. Further, class definitions of medical entities may change over time with new biologic insights. The CBIR method described herein shows that classifiers knowing only 3 classes are able to generalise better to a new dataset with 8 classes than their softmax based predictions (Table III). The highest accuracy can still be obtained through finetuning a network on the target data source (blue lines in Figure 5), but such a re-training period may not be feasible when retrieval data-sources are not accessible for training due to data protection regulations or lack of machine learning resources.

In contrast to decision support systems with a fixed performance and cutoff that needs to undergo clinical testing 29, CBIR as a dynamic, and potentially vendor-independent, decision support system may be easier to expand and update in practice with growing search datasets and improved models.

Iv-a Limitations

As the results from a previous study by Kawahara et al.16

were not public until the end of our experiments we did not perform a sample size calculation, so this work needs to be regarded an exploratory pilot study. We trained the ResNet-50 architecture on the datasets with reasonable effort on fine-tuning, data augmentation and hyperparameter tuning, but did not pursue maximum classification accuracy. Therefore, achievable values may be higher as shown by

4, but we expect a better classifier using a larger image dataset to improve CBIR in a similar way. All data herein is suffering from selection bias (they were found worthwhile to be photographed by a physician) and verification bias. A user-focused and prospective analysis of such a decision support will be able to give more insight in clinical applicability.

Document retrieval studies usually use a different set of metrics where mean average precision is defined differently. We chose the used metrics and definitions to reflect clinically meaningful outcomes rather than retrieval performance.

V Conclusion

In this work we show that automated retrieval of few visually similar dermatoscopic images approximate accuracy of softmax-based prediction probabilities. Further, CBIR may improve performance of trained networks in new sets and unseen classes when there is no possibility of fine-tuning of a network on new data.