Automated analysis of medical images using neural networks has been used in dermatoscopy for more than a decade 1; 2, but recently gained attention since groups have reported high accuracy rates with convolutional neural networks (CNN) for skin images 3; 4 and dermatoscopy 5, as well as for other medical domains such as fundoscopy6 or chest X-rays 7
. CNNs, in brief, are a group of modern and powerful machine learning models that don’t require explicit handcrafted engineering. Rather, they learn to detect visual elements such as colors, shapes and edges by themselves, and combine detections of those internally to a prediction. The only thing needed for them, apart from computing power, is a large number of images and labels to train them, where the labels correspond to the diagnosis in the medical field.
Implementing automated classification models, like a CNN, that output probabilities of diagnoses, or the most probable diagnosis, is deemed desirable for a number of reasons within a health care system. Using patient-based methods could ultimately reduce the need for physicians in areas of scarcity and reduce burden on the health care system, but are highly problematic in regard to regulations and safety. A more realistic approach is having decision support systems available to non-specialised physicians that may be easier to implement and have the potential to increase their diagnostic accuracy and decrease referral rates. Integrating classification systems into a specialists’ clinical workflow may increase efficiency and free them from spending a large amount of time on easy to diagnose cases.
While these effects are undeniably positive, real-world settings can be problematic for classifiers that output the probability of a diagnosis. Accuracy rates for specified cutoffs are commonly reported in experimental settings on digital images with a verification bias, as mainly pathologically verified diagnoses are deemed the gold standard for ground-truth labels 8. Even in sets using expert evaluations as “labels”, the included cases may not inherit all or enough representations of common banal skin diseases 9. Specialised centers may not bother photo-documenting such common cases due to the additional time required, and given their obvious diagnosis to an expert.
Apart from imperfect accuracy rates of neural networks, unforeseen problems can arise in practical use. This is exemplified by an earlier clinical study using an automated skin lesion classifier where melanomas where missed simply because they were not photographed by the user 2.
Lastly, classifications of CNNs can be prone to adversarial examples 10 raising questions of liability in misdiagnoses of such systems, or falsely vindicating skin lesion removal on insurance funds for cosmetic or financial incentives.
A solution for these problems is to keep physicians ”in the loop” 11 for automated diagnoses. Classification systems could run in the background analyzing images to bring the ones of most concern to a doctor’s attention more quickly. These systems could also be used to continuously audit previously diagnosed cases where disagreements between the automated classifier and physician can be flagged and recommended for review. For a successful human-machine collaboration it is key to know why a system makes a specific diagnosis, options being visual question answering or automated captioning 12
. For all these systems it is left to the discretion of the user to interpret the results and decide whether they are correct. Herein we explore a different, intuitive and transferrable approach for ’explainable’ artificial intelligence (AI), called content based image retrieval (CBIR). With CBIR, the user presents an unknown query image to a system, and cases with similar visual features are automatically retrieved and displayed from a database. Example queries and results of automatically retrieved similar images are shown in Figure 1.
With the increased performance of convolutional neural networks in regard to classification, previous work has found that those networks also learn filters that correspond to visual elements of an image in later layers of a CNN 13
. In other words, one set of filters in a CNN could for example respond to whether a brown network is visible, and another one could respond to a group of blue clods. With many filters present in a CNN, and many ways to combine them as an image moves through the network, it is an active research area to try and understand what set of filters correspond to an exact given visual structure. However, even without knowing what exact filter detects which structure, taken altogether they can be expressed as row of simple numbers (called a “feature vector” or “deep features”), representing all visual elements in an image. By comparing how similar these collected numbers of two images are, one can match faces14, or retrieve visually similar medical data such as histopathologic images15. Recently, Kawahara et al.16 used such extracted features of a multi-modality network to query a database for similar images and found it had high sensitivity (94%) but low specificity (36%) for detecting melanoma (73% and 79% respectively for a different diagnostic cutoff).
The goals of this study are:
To evaluate whether CBIR based on deep features of a neural network, trained for classification, can provide a comparable diagnostic accuracy as its softmax probabilities.
To determine how many similar images may be practically needed.
To determine whether a CBIR system is transferrable to different datasets.
|EDRA||888 (100%)||-||-||69 (7.8%)||-||-||247 (27.8%)||572 (64.4%)||-|
|ISIC2017||2750 (100%)||-||-||386 (14.0%)||-||-||521 (18.9%)||1843 (67.0%)||-|
|PRIV||16691 (100%)||203 (1.2%)||3842 (23.0%)||1368 (8.2%)||206 (1.2%)||566 (3.4%)||2276 (13.6%)||5941 (35.6%)||2289 (13.7%)|
We compare diagnostic performance of a CBIR system to neural networks using the following 3 datasets:
EDRA: A large collection of dermatoscopic images was published alongside the Interactive Atlas of Dermoscopy17. We filtered the dataset to contain only diagnoses with more than 50 examples and are consistent with the ISIC2017 dataset. 20% of the images, randomised and stratified by diagnosis of cases, were split as a test-set to evaluate our method on. Of the remaining cases, 20% were used as validation during training to fit network training parameters.
ISIC2017: The International Skin Imaging Collaboration (ISIC) 2017 challenge for melanoma recognition published a convenience dataset of dermatoscopic images with fixed training, validation and test splits 8. The diagnoses included in the dataset are melanoma (mel), nevi (nevus) and seborrheic keratoses (bkl).
PRIV: We gathered dermatoscopic images that were consecutively collected at a single skin cancer center between 2001 and 2016 for clinical documentation including pathologic and clinical diagnoses 111Ethics review board waiver from Ospedaliera di Reggio Emilia, Protocol No. 2011/0027989. We excluded diagnoses with less than 150 examples, which resulted in inclusion of the following diagnoses: angioma (incl. angiokeratoma), bcc (basal cell carcinoma), bkl (seborrheic keratoses, solar lentigines and lichen planus-like keratoses), df (dermatofibromas), inflammatory lesions (including dermatitis, lichen sclerosus, porokeratosis, rosacea, psoriasis, lupus erythematosus, bullous pemphigoid, lichen planus, granulomatous processes and artifacts), mel (all types of melanomas), nevus (all types of melanocytic nevi) and scc (squamous cell carcinomas, actinic keratoses and bowen’s disease). We performed splitting in the same manner as for the EDRA dataset for cases with a pathologic diagnosis. Cases that had no pathologic diagnosis but an expert rating were included only in the training set.
For all datasets, the training-set also represents the pool for images possibly retrieved by the tested CBIR systems. We avoided same-lesion images spread between training-, validation- and test-set. Complete dataset numbers are shown in Table I.
Ii-B Network architecture and training
In all experiments we use a ResNet-50 architecture18
with network parameters obtained through training on the ImageNet19 dataset, which contains 1 million images of 1000 different objects of daily life. This pretraining enables the ResNet-50 architecture to recognize general shape, edge and colour combinations, and reduces the training time needed to adapt it to our specialized task of dermatoscopic image classification. Depending on the dataset used for a given experiment we modify the size of the last fully-connected layer in the CNN to match the number of classes present respectively, and fine-tune the network. This fully connected layer provides the probabilty output for every diagnosis, and because this layer processes its numerical input with the softmax function, we refer to its output as ’softmax prediction’. As compared to Han et al.4 we don’t define diagnosis-specific thresholds, but rather take the diagnosis with the highest probability value as the final diagnosis prediction. Further training implementation details are given in the Supplementary File.
For all images in the retrieval image set we pass them through the CNN, and collect the output of the deepest feature layer (”pool5”) as our feature vector. This vector consists of 2048 numbers that represent visual features of an image. By calculating the cosine similarity of two such vectors, we get a single number ranging between 0 and 1 corresponding to how ’similar’ features in two images are. In other words, the cosine similarity of two images describes in a single number how similar the visual elements of two images are. So, to obtain the most visually similar images to a query in this study, we calculate its cosine similarity to every other image in a dataset and sort them by the resulting value. In order to be able to compare CBIR with softmax predictions, we collect the k most similar lesions for every query and regard the frequency of their corresponding disease labels as their probability. For example, if 4 of 5 similar images are a melanoma and one is a nevus, we regard melanoma probability as 0.8 and nevus probability as 0.2.
Ii-C Metrics and Statistics
The following metrics are calculated for evaluating diagnostic accuracy, where all retrieved images had the same weight during retrieval except for solving ties of specific diagnoses:
Area under the ROC curve for detecting skin cancer (): Percent of malignant retrieval cases (CBIR) or the sum of probabilities of malignant classes (Softmax) are used to calculate ROC curves. Sensitivity and specificity values are likewise calculated for detecting skin cancer with fixed cutoffs of needed malignant examples / probabilities returned (25% ( and ) and 50% ( and ) of retrievals). Due to the lack of other malignant classes, this value is equal to the AUC to detect melanoma when testing on EDRA and ISIC2017 datasets.
Multi-class Accuracy (): Percentage of all correct specific predictions, where the prediction is made for the class with the highest probability (Softmax) or most commonly retrieved (CBIR) examples. To avoid tied predictions with CBIR, a minimal linear weighting based on retrieval order (1.00-0.99 distributed evenly along retrieved images) is applied during counting.
Multi-class Mean Average Precision (): Briefly, average precision scores for every test-set class are macro-averaged as implemented by 20, where prediction scores were obtained by either the frequency of the query class in CBIR retrievals or softmax prediction scores. A more detailed description is given in Supplementary File 1.
Experiments as well as raw data computation and visualisation are performed with python (PyTorch21, sklearn20 and matplotlib22) and R Statistics 23; 24. As testing all combinations of CBIR cutoffs (restricted to up to 32 images), datasets and metrics would result in too many comparisons, we restricted formal statistical tests comparing diagnostic metrics to the AUC of ROC detecting skin cancer when retrieving 2, 4, 8, 16, and 32 images which we believe is a clinically meaningful evaluation. ROC curves are computed using pROC25 and compared using the DeLong method 26. Paired t-tests are used to compare cosine similarity values after checking for approximate normality. In case of a violation, paired Wilcoxon signed rank test is used instead. A two-sided p-value of 0.05 is regarded statistically significant. 95%CI values of ROC curves as well as sensitivity and specificity at specified cutoffs are calculated with 2000 bootstrapped replicates. All p-values are reported adjusted for multiple testing with the Holm method 27
unless otherwise specified. Correction for multiple testing was stopped after the first non-rejection of the null-hypothesis, and therefore no adjusted p-values reported for the remaining comparisons.
Iii-a Same-source CBIR and classification
The mean cosine similarities of all retrieval images for all queries of the same data-source were 0.631 (95%CI: 0.628-0.634; EDRA), 0.623 (95%CI: 0.621-0.625; ISIC2017) and 0.638 (95%CI: 0.635-0.640; PRIV). Retrieval images with the same diagnosis had a significantly higher similarity value to a query image compared to those of different classes (0.667 (95%-CI: 0.665-0.669) vs. 0.601 (95%-CI: 0.600-0.603); p0.001). Subgroup analyses likewise revealed significant differences for every diagnosis within every dataset (see Figure 2). For accuracy calculations below, the most similar retrieval images were collected for every query, and the most frequent occurring disease label counted as the prediction.
Using these ranked images for diagnostic predictions was able to approximate a classic softmax-based classifier with only few retrieval cases in regard to multi-class accuracy (Figure 3 and Table II). For the two datasets containing only 3 classes, CBIR outperformed the softmax-based classification and had the highest accuracy when retrieving 8 (EDRA, accuracy0.762) and 16 similar cases (ISIC2017, accuracy0.759), whereas in the PRIV dataset the best result with 32 retrievals (accuracy 0.629) was still below the corresponding softmax accuracy of 0.645. As can be seen in Figure 3, using more than 16 retrieved images did not consistently improve accuracy of CBIR. In all three datasets, showing only two retrieved images resulted in decreased performance in detecting skin cancer as measured by the AUC, where the difference was significant for the 8-class dataset (EDRA 0.782 vs. 0.830, p1.0; ISIC2017 0.760 vs. 0.810, p0.073; PRIV 0.791 vs 0.847, p0.001).
|EDRA||2||0.750||72.7 (59.1-84.1)||78.4 (70.7-85.3)||72.7 (59.1-86.4)||78.4 (70.7-85.3)||0.782 (0.703-0.861)||0.151|
|4||0.756||88.6 (79.5-97.7)||64.7 (56-73.3)||63.6 (50.0-77.3)||86.2 (79.3-92.2)||0.830 (0.760-0.900)||0.99|
|8||0.762||86.4 (75.0-95.5)||70.7 (62.1-79.3)||56.8 (40.9-70.5)||89.7 (84.5-94.8)||0.850 (0.784-0.916)||0.342|
|16||0.744||84.1 (72.7-93.2)||68.1 (59.5-76.7)||52.3 (38.6-65.9)||89.7 (83.6-94.8)||0.842 (0.776-0.908)||0.491|
|32||0.744||86.4 (75.0-95.5)||69.8 (61.2-77.6)||47.7 (31.8-61.4)||92.2 (87.1-96.6)||0.844 (0.776-0.912)||0.499|
|Softmax||0.731||77.3 (65.9-88.6)||75.9 (68.1-83.6)||61.4 (47.7-75.0)||84.5 (77.6-91.4)||0.830 (0.759-0.901)||-||-|
|ISIC2017||2||0.717||66.7 (58.1-75.2)||83.2 (80-86.7)||66.7 (58.1-75.2)||83.2 (79.7-86.7)||0.760 (0.713-0.807)||0.006||0.073|
|4||0.727||76.1 (68.4-83.8)||71.9 (67.5-76.0)||53.8 (44.4-63.2)||89.5 (86.5-92.4)||0.785 (0.737-0.833)||0.118|
|8||0.736||70.1 (61.5-78.6)||78.4 (74.7-82.1)||50.4 (41.9-59.8)||90.4 (87.8-93.0)||0.798 (0.751-0.845)||0.431|
|16||0.759||70.9 (62.4-78.6)||77.6 (73.9-81.3)||50.4 (41-59.8)||92.8 (90.4-95.2)||0.806 (0.759-0.853)||0.785|
|32||0.753||68.4 (59.8-76.9)||77.3 (73.6-81.0)||41.0 (32.5-49.6)||94.3 (92.2-96.3)||0.799 (0.751-0.846)||0.354|
|Softmax||0.708||70.9 (62.4-79.5)||74.9 (70.8-78.6)||60.7 (52.1-69.2)||86.7 (83.4-89.5)||0.810 (0.765-0.854)||-||-|
|PRIV||2||0.594||82.7 (80.2-85.1)||66.8 (63.1-70.6)||82.7 (80.2-85.3)||67.0 (63.1-70.6)||0.791 (0.770-0.813)||0.001||0.001|
|4||0.614||89.3 (87.2-91.3)||54.4 (50.5-58.3)||79.4 (76.8-81.9)||73.9 (70.3-77.3)||0.822 (0.802-0.843)||0.002||0.032|
|8||0.615||87.9 (85.9-89.9)||60.6 (56.9-64.4)||74.7 (72-77.4)||77.9 (74.8-81.2)||0.843 (0.823-0.862)||0.597|
|16||0.624||87.4 (85.2-89.5)||63.9 (60-67.8)||74.2 (71.2-76.9)||81.2 (78.1-84.3)||0.852 (0.833-0.871)||0.456|
|32||0.629||87.2 (84.9-89.3)||66.2 (62.4-69.8)||73.2 (70.5-75.9)||82.2 (78.9-85.1)||0.859 (0.840-0.878)||0.072|
|Softmax||0.645||87.7 (85.5-89.7)||62.6 (58.7-66.3)||75.3 (72.5-78.0)||79.7 (76.6-82.7)||0.847 (0.827-0.867)||-||-|
signifies non-evaluated comparisons after correction for multiple testing. Numbers in brackets represent 95% confidence intervals.
Figure 4 shows the ROC curve of the EDRA intra-dataset evaluation when fixing the CBIR output to 16 images, where disregarding a small frequency of malignant cases in the images doesn’t change sensitivity substantially. Fixing the outputs to 16 cases and labeling a query case ”malignant” if at least 25% of retrievals show a malignant lesion, results in a sensitivity of 84.1% at a specificity of 68.1% in the EDRA dataset, 70.9% and 77.6% in the ISIC2017, and 87.4% and 63.9% in the PRIV dataset respectively (Table II).
Iii-B New-source classification
Figure 5 and Table III show mean average precision values of networks trained and tested on different datasets, with different CBIR resource databases used. In other words, the images to be diagnosed, the images a CNN retrieves similar cases from, and the images the CNN was trained on can all originate from different sources. Softmax based predictions from 3-class-trained networks (EDRA & ISIC2017) perform worse on predicting the 8-class dataset (PRIV) with values of 0.184 and 0.198 respectively. Using the target source as a CBIR resource improved to up to 0.368 and 0.403 respectively. This is because previously ’unknown’ classes can still be retrieved as those networks transfer the ability to distinguish diagnoses through visual similarity (see Figure 6). The best CBIR performance is obtained with combinations where training, testing and resource are from the same source.
Current convolutional neural network (CNN) classifiers perform well but commonly behave as black-boxes during inference and preclude meaningful integration of their findings to a clinical decision process. Having an intuitive, ’explainable’, output of an automated classifier which complements - rather than overrides - a clinical decision process may be more desirable and can enhance efficient use of health care workers. Compared to other techniques for explainable AI 28 such as image captioning and visual question answering 12, we hypothesize that showing similar cases with their ground truth may be even more intuitive. Similar images found by CBIR further comprehensibly reveal the knowledge base of a network decision and may conceive when not to trust the automated system. More specifically, if users notice retrieved cases look nothing like the query image, they could intuitively decide the CNN cannot help in that case.
Herein we show that CBIR can perform on par with softmax-based predictions of a ResNet-50 network on accuracy of skin cancer detection, as well as multi-class accuracy and mean average precision (Table II).
We describe reasonably good metrics for formal evaluation of a CBIR system, but more current architectures may be able to reach even higher accuracy. We hypothesise, that with increasing accuracy of a network, accuracy of CBIR will rise accordingly. The true advantage of CBIR may lie in that a human reader can pick the most fitting and relevant examples out from the provided image-subset and is not restricted to the strict counting and weighting used for calculations in this manuscript. We suspect having such a ’human-in-the-loop’ would give a much higher diagnostic precision in practice, which should be subject to future studies.
Deep learning literature dealing with image classification commonly presents accuracy metrics measured on the same dataset-source incorporating the same diagnostic classes. Relying on those experimental results when implementing an automated classifier in clinical practice may be precarious, as an end-user may take images with a different camera, on patients with different skin types, with different class distributions - and even disease classes the network has not encountered before. For these reasons a classifier with a fixed set of diagnoses may fail in unexpected ways which would go unnoticed if the output is merely a probability of specific diagnoses. Neural networks trained for classification by design are limited to predict classes they have seen during the training period. Currently, to our knowledge, no available dataset comes close to encompass all clinically possible classes. Further, class definitions of medical entities may change over time with new biologic insights. The CBIR method described herein shows that classifiers knowing only 3 classes are able to generalise better to a new dataset with 8 classes than their softmax based predictions (Table III). The highest accuracy can still be obtained through finetuning a network on the target data source (blue lines in Figure 5), but such a re-training period may not be feasible when retrieval data-sources are not accessible for training due to data protection regulations or lack of machine learning resources.
In contrast to decision support systems with a fixed performance and cutoff that needs to undergo clinical testing 29, CBIR as a dynamic, and potentially vendor-independent, decision support system may be easier to expand and update in practice with growing search datasets and improved models.
As the results from a previous study by Kawahara et al.16
were not public until the end of our experiments we did not perform a sample size calculation, so this work needs to be regarded an exploratory pilot study. We trained the ResNet-50 architecture on the datasets with reasonable effort on fine-tuning, data augmentation and hyperparameter tuning, but did not pursue maximum classification accuracy. Therefore, achievable values may be higher as shown by4, but we expect a better classifier using a larger image dataset to improve CBIR in a similar way. All data herein is suffering from selection bias (they were found worthwhile to be photographed by a physician) and verification bias. A user-focused and prospective analysis of such a decision support will be able to give more insight in clinical applicability.
Document retrieval studies usually use a different set of metrics where mean average precision is defined differently. We chose the used metrics and definitions to reflect clinically meaningful outcomes rather than retrieval performance.
In this work we show that automated retrieval of few visually similar dermatoscopic images approximate accuracy of softmax-based prediction probabilities. Further, CBIR may improve performance of trained networks in new sets and unseen classes when there is no possibility of fine-tuning of a network on new data.
- 1 Menzies S, Bischof L, Talbot H, et al. The performance of solarscan: An automated dermoscopy image analysis instrument for the diagnosis of primary melanoma. Archives of Dermatology. 2005;141(11):1388–1396.
- 2 Dreiseitl S, Binder M, Vinterbo S, Kittler H. Applying a decision support system in clinical practice: results from melanoma diagnosis. AMIA Annu Symp Proc. 2007;p. 191–195.
- 3 Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–118.
- 4 Han SS, Kim MS, Lim W, Park GH, Park I, Chang SE. Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm. Journal of Investigative Dermatology. 2018;p. preprint.
- 5 Haenssle HA, Fink C, Schneiderbauer R, Toberer F, Buhl T, Blum A, et al. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Annals of Oncology. 2018;p. mdy166. Available from: http://dx.doi.org/10.1093/annonc/mdy166.
- 6 Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA. 2016;316(22):2402–2410.
- 7 Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. CoRR. 2017;abs/1711.05225. Available from: http://arxiv.org/abs/1711.05225.
- 8 Codella NCF, Gutman D, Celebi ME, Helba B, Marchetti MA, Dusza SW, et al. Skin Lesion Analysis Toward Melanoma Detection: A Challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), Hosted by the International Skin Imaging Collaboration (ISIC). CoRR. 2017;abs/1710.05006. Available from: http://arxiv.org/abs/1710.05006.
- 9 Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset: a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data. 2018;5:180161.
- 10 Finlayson SG, Chung HW, Kohane IS, Beam AL. Adversarial Attacks Against Medical Deep Learning Systems. arXiv Preprints. 2018;abs/1804.05296. Available from: http://arxiv.org/abs/1804.05296.
- 11 Girardi D, Küng J, Kleiser R, Sonnberger M, Csillag D, Trenkler J, et al. Interactive knowledge discovery with the doctor-in-the-loop: a practical example of cerebral aneurysms research. Brain Informatics. 2016;3(3):133–143. Available from: https://doi.org/10.1007/s40708-016-0038-2.
- 12 Park DH, Hendricks LA, Akata Z, Schiele B, Darrell T, Rohrbach M. Attentive Explanations: Justifying Decisions and Pointing to the Evidence. CoRR. 2016;abs/1612.04757. Available from: http://arxiv.org/abs/1612.04757.
- 13 Piplani T, Bamman D. DeepSeek: Content Based Image Search & Retrieval. CoRR. 2018;abs/1801.03406. Available from: http://arxiv.org/abs/1801.03406.
Parkhi OM, Vedaldi A, Zisserman A.
Deep Face Recognition.In: Xianghua Xie MWJ, Tam GKL, editors. Proceedings of the British Machine Vision Conference (BMVC). BMVA Press; 2015. p. 41.1–41.12. Available from: https://dx.doi.org/10.5244/C.29.41.
- 15 Shi X, Sapkota M, Xing F, Liu F, Cui L, Yang L. Pairwise based Deep Ranking Hashing For Histopathology Image Classification and Retrieval. Pattern Recognition. 2018;81:14–22.
- 16 Kawahara J, Daneshvar S, Argenziano G, Hamarneh G. 7-Point Checklist and Skin Lesion Classification using Multi-Task Multi-Modal Neural Nets. IEEE Journal of Biomedical and Health Informatics. 2018;p. preprint.
- 17 Argenziano G, Soyer P, De Giorgi V, Piccolo D, Carli P, Delfino M, et al. Interactive atlas of dermoscopy. Dermoscopy: a tutorial (Book) and CD-ROM. Milan, Italy: Edra Medical Publishing and New Media; 2000.
He K, Zhang X, Ren S, Sun J.
Deep Residual Learning for Image Recognition.
In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 770–778.
- 19 Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV). 2015;115(3):211–252.
- 20 Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830.
- 21 Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, et al. Automatic differentiation in PyTorch. In: NIPS-W; 2017. .
- 22 Hunter JD. Matplotlib: A 2D graphics environment. Computing In Science & Engineering. 2007;9(3):90–95.
- 23 R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2017. Available from: https://www.R-project.org/.
- 24 Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York; 2016. Available from: http://ggplot2.org.
- 25 Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77.
DeLong ER, DeLong DM, Clarke-Pearson DL.
Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.Biometrics. 1988;44(3):837–845.
- 27 Holm S. A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics. 1979;6(2):65–70.
- 28 Holzinger A, Biemann C, Pattichis CS, Kell DB. What do we need to build explainable AI systems for the medical domain? CoRR. 2017;abs/1712.09923. Available from: http://arxiv.org/abs/1712.09923.
- 29 Monheit G, Cognetta AB, Ferris L, Rabinovitz H, Gross K, Martini M, et al. The performance of MelaFind: a prospective multicenter study. Arch Dermatol. 2011;147(2):188–194.