Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

by   Noel Codella, et al.

The International Skin Imaging Collaboration (ISIC) is a global partnership that has organized the world's largest public repository of dermoscopic images of skin lesions. This archive has been used for 3 consecutive years to host challenges on skin lesion analysis toward melanoma detection, covering 3 analysis tasks of lesion segmentation, lesion attribute detection, and disease classification. The most recent instance in 2018 was hosted at the Medical Image Computing and Computer Assisted Intervention (MICCAI) conference in Granada, Spain. The dataset included over 10,000 images. Approximately 900 users registered for data download, 115 submitted to the lesion segmentation task, 25 submitted to the lesion attribute detection task, and 159 submitted to the disease classification task, making this the largest study in the field to date. Important new analyses were introduced to better reflect the difficulties of translating research systems to clinical practice. This article summarizes the results of these analyses, and makes recommendations for future challenges in medical imaging.


page 1

page 2

page 3

page 4


Unsupervised Approaches for Out-Of-Distribution Dermoscopic Lesion Detection

There are limited works showing the efficacy of unsupervised Out-of-Dist...

Lesion segmentation using U-Net network

This paper explains the method used in the segmentation challenge (Task ...

Causality matters in medical imaging

This article discusses how the language of causality can shed new light ...

Dermoscopic Image Analysis for ISIC Challenge 2018

This short paper reports the algorithms we used and the evaluation perfo...

Artifact-Based Domain Generalization of Skin Lesion Models

Deep Learning failure cases are abundant, particularly in the medical ar...

Dermoscopic Image Classification with Neural Style Transfer

Skin cancer, the most commonly found human malignancy, is primarily diag...

1 Introduction

Skin cancer is the most common form of cancer in the United States, and costs over $8 billion annually [1]. With early detection, the 5 year survival rate of the most deadly form, melanoma, can be up to 99%; however, delayed diagnosis causes the survival rate to dramatically decrease to 23% [2].

Due to the importance of early detection, much work has been dedicated to increasing the accuracy and scale of diagnostic methods. In 2016 and 2017, the International Skin Imaging Collaboration (ISIC), a global partnership that has organized the world’s largest repository of publicly available dermoscopic images, hosted the first public benchmarks for melanoma detection in dermoscopic images, titled “Skin Lesion Analysis Towards Melanoma Detection”, at the IEEE International Symposium of Biomedical Imaging (ISBI) [3, 4, 5]. The two consecutive challenges attracted global participation, with over 900 registrations and over 350 submissions, making them the largest standardized and comparative studies at the time, yielding novel findings and numerous publications, and have been tacitly accepted as a de-facto reference standard by other groups [6, 7, 8, 9].

This article describes the methods and the results from the most recent instance of the ISIC Challenge on Skin Lesion Analysis Towards Melanoma Detection, hosted in 2018 at the Medical Image Computing and Computer Aided Intervention (MICCAI) conference in Granada, Spain. In addition to considerable increases in the size of the dataset and number of labels, key changes to evaluation criteria and study design were implemented to better reflect the complexity of clinical scenarios encountered in practice. These changes included 1) a new segmentation metric to better account for extreme deviations from interobserver variability, 2) implementation of balanced accuracy for classification decisions to minimize influence of prevalence and prior distributions that may not be consistent in practice, and 3) inclusion of external test data from institutions excluded from representation in the training dataset, to better assess how algorithms generalize beyond the environments for which they were trained. The impacts of each change are examined in this work, followed by a set of concluding recommendations for future challenges.

2 Methods

The challenge was separated into 3 image analysis tasks of lesion segmentation, attribute detection, and disease classification (Fig. 1

). There was no requirement that any task be independent of the data or analytics developed for the other tasks. Participants were required to provide 4 page manuscripts along with submissions describing implemented methods. Participants were allowed to use proprietary sources of in-domain (dermoscopic) data, but were required to disclose such use in a specific meta-data field. Use of out-of-domain data (non-dermoscopic), such as ImageNet, was expected to be mentioned in manuscripts, but not required to be disclosed in a separate meta-data field.

Figure 1: Example ground truth segmentation masks from Part 1: Lesion Segmentation, Part 2: Attribution Detection, and Part 3: Disease Classification.

2.1 Part 1: Lesion Segmentation

For Part 1, 2,594 dermoscopic images with ground truth segmentation masks were provided for training. For validation and test sets, 100 and 1,000 images were provided, respectively, without ground truth masks.

Evaluation criteria historically had been the Jaccard index  

[3, 4, 5], averaged over all images in the dataset. In practice, ground truth segmentation masks are influenced by inter-observer and intra-observer variability, due to variations in human annotators and variations in annotation software used to generate the masks. An ideal evaluation would generate several ground truth segmentation masks for every image using multiple annotators and software systems. Then, for each image, predicted masks would be compared to the multiple ground truth masks to determine whether the predicted mask falls outside or within observer variability. However, this would multiply the manual labor required to generate ground truth masks, rendering such an evaluation impractical and infeasible.

As an approximation to this ideal evaluation criteria, we introduced “Thresholded Jaccard”, which works similarly to standard Jaccard, with one important exception: if the Jaccard value of a particular mask falls below a threshold T, the Jaccard is set to zero. The value of the threshold T defines the point in which a segmentation is considered “incorrect”.

Prior work measured the average Jaccard between 3 expert annotators on 100 images from the 2016 challenge [6]. The resulting values were 0.743, 0.754, and 0.861, yielding an average of 0.786 and a range of 0.118. For the 2018 challenge, T was defined as 0.65, by rounding the lowest agreement to 0.75 and subtracting one rounded range (0.10) to increase the certainty (specificity) of segmentation failure.

2.2 Part 2: Lesion Attribute Detection

For Part 2, 2,594 dermoscopic images with 12,970 ground truth segmentation masks for 5 dermoscopic attributes were provided for training. For validation and held-out test sets, 100 and 1,000 images were provided, respectively, without ground truth masks.

Jaccard was used as the evaluation metric this year in order to facilitate possible re-use of methods developed for segmentation, and encourage greater participation. As some dermoscopic attributes may be entirely absent from certain images, the Jaccard value for such attributes is ill-defined (division by 0). To overcome this difficulty, the Jaccard was measured by computing the TP, FP, and FN for the entire dataset, rather than a single image at a time.

2.3 Part 3: Lesion Disease Classification

For Part 3, 10,015 dermoscopic images with 7 ground truth classification labels were provided for training [10]. For validation and held-out test sets, 193 and 1,512 images are provided, respectively, without ground truth.

Held-out test data was further split into two partitions: 1) an “internal” partition, consisting of 1196 images selected from data sources that were consistent with the training dataset (two institutions in Austria and Australia), and 2) an “external” partition, consisting of 316 images additionally selected from data sources not reflected in the training dataset (institutions from Turkey, New Zealand, Sweden, and Argentina). The purpose of this split was to better assess algorithm capability to generalize.

Evaluation was carried out using balanced accuracy, inversely weighting samples according to the prevalence of their disease label, which may not be reflective of real world disease prevalence, especially with regard to over-representation of melanomas. Previous years had used melanoma average precision [3]

and melanoma AUC (area under receiver operating characteristic curve), which are only robust to the prevalence imbalance of melanomas, and may be influenced by clinically irrelevant low-sensitivity performance.

3 Results

3.1 Part 1: Lesion Segmentation

Figure 2:

Histograms of submissions for Part 1: Lesion Segmentation. Performance on the X-axis, and number of submissions on the Y-axis. “Proportion of Failures” refers to number of images below 0.65 Jaccard. Average values shown as solid vertical lines, and +/- standard deviation shown as dotted vertical lines.

In total, 112 submissions were received for Part 1. The top performing submission achieved a Thresholded Jaccard of 0.802, with many of the other top algorithms also achieving approximately 0.8. A histogram summary of the submissions are shown in Fig. 2, showing the proportion of failures (segmentations below 0.65 Jaccard), performance according to Thresholded Jaccard, and performance according to Jaccard. What’s clear from this analysis is that even though submissions may achieve very high scores in average Jaccard values (over 0.8, exceeding even previous reports of average Jaccard values for inter-observer variability of 0.786 [6]), most methods still fail to properly segment over 10% of images. This is an important observation that is often diluted by most aggregated statistics.

Fig. 3a shows the correlation between both Thresholded Jaccard and Jaccard against the proportion of segmentation failures, demonstrating that Thresholded Jaccard has improved correlation with a slope closer to 1. This suggests that the new metric may be a better assessment of clinical utility, if instances where segmentation Jaccard falls below 0.65 accurately reflects incorrect segmentations. Fig. 3b shows a scatter plot of participant challenge rank according to Thresholded Jaccard (X-axis) and Jaccard (Y-axis), demonstrating that changing the evaluation criteria to Threhsolded Jaccard has a significant impact on the ranking of participant algorithms.

Figure 3: Assessment of new Threshold Jaccard metric, which thresholds Jaccard score at 0.65 (all values below set to 0.0). A) Proportion of segmentation failures (X-axis) vs. various metric values (Y-axis). B) Participant ranking by Thresholded Jaccard (X-axis) vs. Jaccard (Y-axis), broken down by usage of proprietary training data.

3.2 Part 2: Lesion Attribute Detection

Figure 4: Histogram of submissions to Part 2: Lesion Attribute Detection.

In total, 26 submissions were received for Part 2. The histograms of performance of each attribute is plotted in Fig. 4 according to average Jaccard. The distribution of values were exceptionally low, with the best submissions achieving an average of 0.473. Poor performance may the result of several factors, which may include that dermoscopic attributes tend to have poor inter-observer correlation among expert clinicians [11]

. The implications for future challenges may be that either the field of clinical dermoscopic attributes must mature further before additional research is performed to apply machine learning—or that machine learning methods should be applied to the reverse problem: to help find and annotate specific patterns that may be strongly correlated with disease.

3.3 Part 3: Lesion Disease Classification

Figure 5: Histograms of submissions for Part 3: Lesion Classification. Average values showed as solid vertical lines, and +/- standard deviation showed as dotted vertical lines. Left: Entire test dataset. Center: Internal (blue) and External (green) test partitions split. Right: Histogram of the subtraction of external test performance from internal.

In total, 141 submissions were received to Part 3: Disease Classification. Histograms of submission performance, according to balanced accuracy (BACC), are shown in Fig. 5. The highest performance achieved was 0.885. Correlation between internal and external test dataset performance for each submission is shown in Fig. 6a, and between whole test set performance and the difference between internal and external sets in 6b. Differences in ranking according to balanced accuracy, accuracy, and mean AUC are plotted in Fig. 7.

These analyses provide the following important observations: 1) Most submissions overfit and perform better on internal data vs. external data (Fig. 5), but some approaches, including the top performers, do not (Figs. 5 & 6b). 2) The use of proprietary data is not required in order to prevent overfitting to internal data (Fig. 6b). 3) Various algorithms may achieve similar whole test set performance but vary widely on their ability to generalize (Fig. 6b). 4) Simple linear correlation between internal and external test dataset performance does not elucidate the spread of overfitting as well as plotting by the difference between datasets (Fig. 6a vs. Fig. 6b). 5) The choice of evaluation metric has a significant impact on participant ranking (Fig. 7

). Use of balanced accuracy is critical to select the best unbiased classifier, rather than one that overfits to arbitrary dataset prevalence, as is the case with accuracy (Fig.


a). Even other unbiased estimators, such as mean AUC (Fig.

7b) show significant differences in rank as compared to balanced accuracy, and consider areas of operating curves, such as low-recall regions, that may not be clinically relevant.

Figure 6: Comparisons of participant performance on internal test dataset vs. external test dataset. A: Internal vs. external test set peformance. B: Whole test set vs. the difference between internal and external test set performances.
Figure 7: Impact of new balanced accuracy (BACC) jaccard metric on participant ranking. A) BACC vs. ACC. B) BACC vs. Mean AUC.

4 Discussion & Conclusion

This work summarizes the results of the 2018 MICCAI Challenge on Skin Lesion Analysis Toward Melanoma Detection, hosted by the International Skin Imaging Collaboration (ISIC), which represents the de facto best current benchmark for machine learning in this domain. In total, over 12,500 images were made available to participants for training across 3 tasks, and over 2,000 images for testing. 900 teams registered, and 299 submissions were received, making this challenge the largest in the field to date, in terms of both size and complexity of the dataset, and degree of participation.

Several changes were implemented in comparison to previous challenges, in order to be more applicable to clinical practice. These include 1) Thresholded Jaccard, which severely penalizes segmentations that fall outside an estimate of interobserver variability, 2) balanced accuracy, which avoids encouraging classification systems from over-fitting potential dataset imbalances, and 3) a dual-partition held-out test set, including data sourced from institutions not reflected in the training dataset, to better measure algorithm ability to generalize.

Results show that 1) Thresholded Jaccard better correlates with the proportion of segmentation failures, as defined by masks that fall outside an approximation to interobserver variability, 2) balanced accuracy leads to significant changes in participant ranking versus other metrics that may be more prone to imbalance or influence from clinically irrelevant ROC regions, 3) multi-partition test sets containing data not reflected in training dataset are an effective way to differentiate the generalization ability of algorithms, 4) poor performance observed in Part 2 may imply that dermoscopic attributes must mature further before research is continued to apply machine learning — or that machine learning should be applied to find specific patterns that may be correlated with disease.

Future challenges in medical imaging and dermoscopic image analysis should take these observations into consideration in order to best quantify algorithm performance, robustness, and generalizability in clinical scenarios.

Datasets presented remain available: