Skin cancer is the most common form of cancer in the United States, and costs over $8 billion annually . With early detection, the 5 year survival rate of the most deadly form, melanoma, can be up to 99%; however, delayed diagnosis causes the survival rate to dramatically decrease to 23% .
Due to the importance of early detection, much work has been dedicated to increasing the accuracy and scale of diagnostic methods. In 2016 and 2017, the International Skin Imaging Collaboration (ISIC), a global partnership that has organized the world’s largest repository of publicly available dermoscopic images, hosted the first public benchmarks for melanoma detection in dermoscopic images, titled “Skin Lesion Analysis Towards Melanoma Detection”, at the IEEE International Symposium of Biomedical Imaging (ISBI) [3, 4, 5]. The two consecutive challenges attracted global participation, with over 900 registrations and over 350 submissions, making them the largest standardized and comparative studies at the time, yielding novel findings and numerous publications, and have been tacitly accepted as a de-facto reference standard by other groups [6, 7, 8, 9].
This article describes the methods and the results from the most recent instance of the ISIC Challenge on Skin Lesion Analysis Towards Melanoma Detection, hosted in 2018 at the Medical Image Computing and Computer Aided Intervention (MICCAI) conference in Granada, Spain. In addition to considerable increases in the size of the dataset and number of labels, key changes to evaluation criteria and study design were implemented to better reflect the complexity of clinical scenarios encountered in practice. These changes included 1) a new segmentation metric to better account for extreme deviations from interobserver variability, 2) implementation of balanced accuracy for classification decisions to minimize influence of prevalence and prior distributions that may not be consistent in practice, and 3) inclusion of external test data from institutions excluded from representation in the training dataset, to better assess how algorithms generalize beyond the environments for which they were trained. The impacts of each change are examined in this work, followed by a set of concluding recommendations for future challenges.
The challenge was separated into 3 image analysis tasks of lesion segmentation, attribute detection, and disease classification (Fig. 1
). There was no requirement that any task be independent of the data or analytics developed for the other tasks. Participants were required to provide 4 page manuscripts along with submissions describing implemented methods. Participants were allowed to use proprietary sources of in-domain (dermoscopic) data, but were required to disclose such use in a specific meta-data field. Use of out-of-domain data (non-dermoscopic), such as ImageNet, was expected to be mentioned in manuscripts, but not required to be disclosed in a separate meta-data field.
2.1 Part 1: Lesion Segmentation
For Part 1, 2,594 dermoscopic images with ground truth segmentation masks were provided for training. For validation and test sets, 100 and 1,000 images were provided, respectively, without ground truth masks.
Evaluation criteria historically had been the Jaccard index[3, 4, 5], averaged over all images in the dataset. In practice, ground truth segmentation masks are influenced by inter-observer and intra-observer variability, due to variations in human annotators and variations in annotation software used to generate the masks. An ideal evaluation would generate several ground truth segmentation masks for every image using multiple annotators and software systems. Then, for each image, predicted masks would be compared to the multiple ground truth masks to determine whether the predicted mask falls outside or within observer variability. However, this would multiply the manual labor required to generate ground truth masks, rendering such an evaluation impractical and infeasible.
As an approximation to this ideal evaluation criteria, we introduced “Thresholded Jaccard”, which works similarly to standard Jaccard, with one important exception: if the Jaccard value of a particular mask falls below a threshold T, the Jaccard is set to zero. The value of the threshold T defines the point in which a segmentation is considered “incorrect”.
Prior work measured the average Jaccard between 3 expert annotators on 100 images from the 2016 challenge . The resulting values were 0.743, 0.754, and 0.861, yielding an average of 0.786 and a range of 0.118. For the 2018 challenge, T was defined as 0.65, by rounding the lowest agreement to 0.75 and subtracting one rounded range (0.10) to increase the certainty (specificity) of segmentation failure.
2.2 Part 2: Lesion Attribute Detection
For Part 2, 2,594 dermoscopic images with 12,970 ground truth segmentation masks for 5 dermoscopic attributes were provided for training. For validation and held-out test sets, 100 and 1,000 images were provided, respectively, without ground truth masks.
Jaccard was used as the evaluation metric this year in order to facilitate possible re-use of methods developed for segmentation, and encourage greater participation. As some dermoscopic attributes may be entirely absent from certain images, the Jaccard value for such attributes is ill-defined (division by 0). To overcome this difficulty, the Jaccard was measured by computing the TP, FP, and FN for the entire dataset, rather than a single image at a time.
2.3 Part 3: Lesion Disease Classification
For Part 3, 10,015 dermoscopic images with 7 ground truth classification labels were provided for training . For validation and held-out test sets, 193 and 1,512 images are provided, respectively, without ground truth.
Held-out test data was further split into two partitions: 1) an “internal” partition, consisting of 1196 images selected from data sources that were consistent with the training dataset (two institutions in Austria and Australia), and 2) an “external” partition, consisting of 316 images additionally selected from data sources not reflected in the training dataset (institutions from Turkey, New Zealand, Sweden, and Argentina). The purpose of this split was to better assess algorithm capability to generalize.
Evaluation was carried out using balanced accuracy, inversely weighting samples according to the prevalence of their disease label, which may not be reflective of real world disease prevalence, especially with regard to over-representation of melanomas. Previous years had used melanoma average precision 
and melanoma AUC (area under receiver operating characteristic curve), which are only robust to the prevalence imbalance of melanomas, and may be influenced by clinically irrelevant low-sensitivity performance.
3.1 Part 1: Lesion Segmentation
In total, 112 submissions were received for Part 1. The top performing submission achieved a Thresholded Jaccard of 0.802, with many of the other top algorithms also achieving approximately 0.8. A histogram summary of the submissions are shown in Fig. 2, showing the proportion of failures (segmentations below 0.65 Jaccard), performance according to Thresholded Jaccard, and performance according to Jaccard. What’s clear from this analysis is that even though submissions may achieve very high scores in average Jaccard values (over 0.8, exceeding even previous reports of average Jaccard values for inter-observer variability of 0.786 ), most methods still fail to properly segment over 10% of images. This is an important observation that is often diluted by most aggregated statistics.
Fig. 3a shows the correlation between both Thresholded Jaccard and Jaccard against the proportion of segmentation failures, demonstrating that Thresholded Jaccard has improved correlation with a slope closer to 1. This suggests that the new metric may be a better assessment of clinical utility, if instances where segmentation Jaccard falls below 0.65 accurately reflects incorrect segmentations. Fig. 3b shows a scatter plot of participant challenge rank according to Thresholded Jaccard (X-axis) and Jaccard (Y-axis), demonstrating that changing the evaluation criteria to Threhsolded Jaccard has a significant impact on the ranking of participant algorithms.
3.2 Part 2: Lesion Attribute Detection
In total, 26 submissions were received for Part 2. The histograms of performance of each attribute is plotted in Fig. 4 according to average Jaccard. The distribution of values were exceptionally low, with the best submissions achieving an average of 0.473. Poor performance may the result of several factors, which may include that dermoscopic attributes tend to have poor inter-observer correlation among expert clinicians 
. The implications for future challenges may be that either the field of clinical dermoscopic attributes must mature further before additional research is performed to apply machine learning—or that machine learning methods should be applied to the reverse problem: to help find and annotate specific patterns that may be strongly correlated with disease.
3.3 Part 3: Lesion Disease Classification
In total, 141 submissions were received to Part 3: Disease Classification. Histograms of submission performance, according to balanced accuracy (BACC), are shown in Fig. 5. The highest performance achieved was 0.885. Correlation between internal and external test dataset performance for each submission is shown in Fig. 6a, and between whole test set performance and the difference between internal and external sets in 6b. Differences in ranking according to balanced accuracy, accuracy, and mean AUC are plotted in Fig. 7.
These analyses provide the following important observations: 1) Most submissions overfit and perform better on internal data vs. external data (Fig. 5), but some approaches, including the top performers, do not (Figs. 5 & 6b). 2) The use of proprietary data is not required in order to prevent overfitting to internal data (Fig. 6b). 3) Various algorithms may achieve similar whole test set performance but vary widely on their ability to generalize (Fig. 6b). 4) Simple linear correlation between internal and external test dataset performance does not elucidate the spread of overfitting as well as plotting by the difference between datasets (Fig. 6a vs. Fig. 6b). 5) The choice of evaluation metric has a significant impact on participant ranking (Fig. 7
). Use of balanced accuracy is critical to select the best unbiased classifier, rather than one that overfits to arbitrary dataset prevalence, as is the case with accuracy (Fig.7
a). Even other unbiased estimators, such as mean AUC (Fig.7b) show significant differences in rank as compared to balanced accuracy, and consider areas of operating curves, such as low-recall regions, that may not be clinically relevant.
4 Discussion & Conclusion
This work summarizes the results of the 2018 MICCAI Challenge on Skin Lesion Analysis Toward Melanoma Detection, hosted by the International Skin Imaging Collaboration (ISIC), which represents the de facto best current benchmark for machine learning in this domain. In total, over 12,500 images were made available to participants for training across 3 tasks, and over 2,000 images for testing. 900 teams registered, and 299 submissions were received, making this challenge the largest in the field to date, in terms of both size and complexity of the dataset, and degree of participation.
Several changes were implemented in comparison to previous challenges, in order to be more applicable to clinical practice. These include 1) Thresholded Jaccard, which severely penalizes segmentations that fall outside an estimate of interobserver variability, 2) balanced accuracy, which avoids encouraging classification systems from over-fitting potential dataset imbalances, and 3) a dual-partition held-out test set, including data sourced from institutions not reflected in the training dataset, to better measure algorithm ability to generalize.
Results show that 1) Thresholded Jaccard better correlates with the proportion of segmentation failures, as defined by masks that fall outside an approximation to interobserver variability, 2) balanced accuracy leads to significant changes in participant ranking versus other metrics that may be more prone to imbalance or influence from clinically irrelevant ROC regions, 3) multi-partition test sets containing data not reflected in training dataset are an effective way to differentiate the generalization ability of algorithms, 4) poor performance observed in Part 2 may imply that dermoscopic attributes must mature further before research is continued to apply machine learning — or that machine learning should be applied to find specific patterns that may be correlated with disease.
Future challenges in medical imaging and dermoscopic image analysis should take these observations into consideration in order to best quantify algorithm performance, robustness, and generalizability in clinical scenarios.
Datasets presented remain available: http://challenge2018.isic-archive.com/.
-  Guy GP, Machlin S, Ekwueme DU, Yabroff KR. Prevalence and costs of skin cancer treatment in the US, 2002—2006 and 2007—2011. Am J Prev Med. 2015;48:183—7.
-  Cancer Facts and Figures 2018. American Cancer Society. https://www.cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/annual-cancer-facts-and-figures/2018/cancer-facts-and-figures-2018.pdf. Accessed May 3, 2018.
-  Gutman D, Codella NCF, Celebi E, Helba, B, Marchetti M, Mishra N, Halpern A. ”Skin Lesion Analysis toward Melanoma Detection: A Challenge at the International Symposium on Biomedical Imaging (ISBI) 2016, hosted by the International Skin Imaging Collaboration (ISIC)”. eprint arXiv:1605.01397. 2016.
-  Marchetti M, et al. “Results of the 2016 International Skin Imaging Collaboration International Symposium on Biomedical Imaging challenge: Comparison of the accuracy of computer algorithms to dermatologists for the diagnosis of melanoma from dermoscopic images”. J Am Acad Dermatol. 2018 Feb;78(2):270—277
-  Codella N, et al. “Skin Lesion Analysis toward Melanoma Detection: A Challenge at the International Symposium on Biomedical Imaging (ISBI) 2017, hosted by the International Skin Imaging Collaboration (ISIC)”. IEEE International Symposium of Biomedical Imaging (ISBI) 2018.
-  Codella NCF, Nguyen B, Pankanti S, Gutman D, Helba B, Halpern A, Smith JR. “Deep learning ensembles for melanoma recognition in dermoscopy images” In: IBM Journal of Research and Development, vol. 61, no. 4/5, 2017.
Esteva A,Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. “Dermatologist-level classification of skin cancer with deep neural networks”. Nature, vol 542, pp 115—118. 2017.
-  Menegola A, Tavares J, Fornaciali M, Li LT, Avila S, Valle E. “RECOD Titans at ISIC Challenge 2017”. 2017 International Symposium on Biomedical Imaging (ISBI) Challenge on Skin Lesion Analysis Towards Melanoma Detection. Available: https://arxiv.org/pdf/1703.04819.pdf
-  Diaz, I.G. Incorporating the Knowledge of Dermatologists to Convolutional Neural Networks for the Diagnosis of Skin Lesions. 2017 International Symposium on Biomedical Imaging (ISBI) Challenge on Skin Lesion Analysis Towards Melanoma Detection. Available: https://arxiv.org/abs/1703.01976
-  Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data. 2018 Aug 14;5:180161.
-  Carrera C, et al. Validity and Reliability of Dermoscopic Criteria Used to Differentiate Nevi From Melanoma: A Web-Based International Dermoscopy Society Study. JAMA Dermatol. 2016 July 01; 152(7): 798—806.