Data, Depth, and Design: Learning Reliable Models for Melanoma Screening

11/01/2017 ∙ by Eduardo Valle, et al. ∙ University of Campinas 0

State of the art on melanoma screening evolved rapidly in the last two years, with the adoption of deep learning. Those models, however, pose challenges of their own, as they are expensive to train and difficult to parameterize. Objective: We investigate the methodological issues for designing and evaluating deep learning models for melanoma screening, by exploring nine choices often faced to design deep networks: model architecture, training dataset, image resolution, type of data augmentation, input normalization, use of segmentation, duration of training, additional use of SVM, and test data augmentation. Methods: We perform a two-level full factorial experiment, for five different test datasets, resulting in 2560 exhaustive trials, which we analyze using a multi-way ANOVA. Results: The main finding is that the size of training data has a disproportionate influence, explaining almost half the variation in performance. Of the other factors, test data augmentation and input resolution are the most helpful. Deeper models, when combined, with extra data, also help. We show that the costly full factorial design, or the unreliable sequential optimization, are not the only options: ensembles of models provide reliable results with limited resources. Conclusions and Significance: To move research forward on automated melanoma screening, we need to curate larger shared datasets. Optimizing hyperparameters and measuring performance on the same dataset is common, but leads to overoptimistic results. Ensembles of models are a cost-effective alternative to the expensive full-factorial and to the unstable sequential designs.



There are no comments yet.


page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning became the gold standard for image classification tasks. Automated melanoma screening is no exception, with a state of the art that improved rapidly since the adoption of those models.

Deep learning, however, poses its own challenges, requiring high computational power and copious amounts of annotated data. Deep learning is also challenging to parameterize, often involving dozens of hyperparameters [1]. Although high-level frameworks have simplified its use, much craftsmanship is involved in designing and optimizing its models.

We examine those difficulties, with an exhaustive evaluation of ten factors faced when designing deep networks for melanoma detection. We explore the variations of those factors in two full factorial designs. Our main experiment evaluates nine factors, for five different test datasets, resulting in treatments. Our assessment of transfer learning evaluates eight factors for five datasets, with treatments. We analyze those results using a multi-way ANOVA, to obtain decisions that generalize across datasets.

Full factorial designs are too expensive for most practical contexts, but the complete set of data we obtained allowed us to simulate two alternative designs: the traditional sequential optimization (a single factor at a time) and ensembles of models. We show that the latter provides the best performance, at reasonable costs.

Our participation at the ISIC (International Skin Imaging Collaboration) 2017 Challenge [2] inspired this paper. At the time, with a strategy of aggressively optimizing our models, we were ranked first in melanoma classification [3], with an AUC of 0.874.

Here we have very different goals, focusing on methodological issues for designing and evaluating deep learning models for melanoma detection. The main contributions are:

  • A main evaluation of nine factors affecting the design of deep-learning for melanoma detection, together with their cross-interactions up to the third level;

  • A quantitative appraisal of the importance of transfer learning, in perspective with seven of the nine factors analyzed in the main experiment;

  • A critique of hyperparameter optimization, demonstrating that the customary procedure of optimizing hyperparameters and evaluating techniques on the same dataset leads to overoptimistic results;

  • A demonstration of how simple ensemble solutions can provide good results, without sacrificing methodological rigor;

  • State-of-the-art AUCs in all three classification tasks of the ISIC Challenge 2017/Part 3. The focus of this paper is not on increasing metrics, but those results showcase the previous point;

  • Finally, our entire source code is available, and our procedures are explained step-by-step, from the acquisition of the data, until the generation of the tables and graphs of this paper111 We are committed to making this work as reproducible as technically feasible.

We organized the remaining text as follows: we discuss the recent state of the art in Section II, and very briefly overview our participation on ISIC Challenge 2017. We detail our goals, materials, and methods in Section III. Experimental results and analyses appear in Section IV. Finally, we review and discuss the main findings in Section V.

Ii Survey of recent techniques

Since 2016, state of the art on automated melanoma took a sharp turn towards deep learning. We will focus on those results, without emphasis on relevant previous literature [4, 5], and refer the interested reader to more comprehensive surveys [6, 7]. We assume that the reader has passing familiarity with deep learning and its issues — if needed, we refer the reader to a complete overview [1].

Existing works either train deep networks from scratch [8, 9, 10]; or reuse the weights from pre-trained networks  [11, 12, 13, 14, 15, 16], in a scheme called transfer learning.

Transfer learning is usually preferred, as it alleviates the main issue of deep learning for melanoma detection: way too small datasets — most often comprising a few thousand samples. (Contrast that with the ImageNet dataset, employed to evaluate deep networks, with more than

a million samples.) Training from scratch is preferable only when attempting new architectures, or when avoiding external data due to legal/scientific issues. Menegola et al. [13] explain and evaluate transfer learning for melanoma detection in more detail. The results in this paper reinforce that transfer learning is critical for best performance. In the main experiments, all treatments employ transfer learning.

Whether using transfer or not, works vary widely in their choice of deep-learning architecture, from the relatively shallow (for today’s standards) VGG [11, 13, 17, 14], mid-range GoogLeNet [11, 12, 15, 18, 19], until the deeper ResNet [15, 20, 3, 21, 16] or Inception [12, 3, 22]. On the one hand, more recent architectures tend to be deeper, and to yield better accuracies; on the other hand, they require more data and are more difficult to parameterize and train. Although high-level frameworks for deep learning have simplified training those networks, a good deal of craftsmanship is still involved. In this work we contrast two architectures: ResNet-101-v2 [23], and Inception-v4 [24]. Both are very deep and show outstanding performance on ImageNet, but Inception-v4 is considerably deeper and bigger than ResNet-101-v2.

Data augmentation is another technique used to bypass the need for data, while also enhancing networks’ invariance properties. Augmentation creates a myriad of new samples by applying random distortions (e.g., rotations, crops, resizes, color changes) to the existing samples. Augmentation provides best performance when applied to both train and test samples, but only the most recent melanoma detection works follow that scheme [9, 13, 3, 21, 22]. Train-only augmentation is still very common [11, 12, 17, 14, 25, 18, 19].

In this work, we evaluate two augmentation distortion setups

(TensorFlow/Slim’s default for each network, and an attempt to customize it for skin lesions). We always apply augmentation to the train samples, but evaluate the impact of applying or not

augmentation for test samples.

Works based on global features or bags of visual words often preprocess the images (some recent works using deep learning do it as well) to reduce noise, remove artifacts (e.g., hair), enhance brightness and color, or highlight structures [26, 27, 8, 20, 10, 28]. The deep-learning ethos usually forgoes that kind of “hand-made” preprocessing, relying instead on networks’ abilities to learn those invariances — with the help of data augmentation if needed.

On the other hand, segmentation as preprocessing is common on deep-learning for melanoma detection [9, 18], sometimes employing a dedicated network to segment the lesion before forwarding it to the classification network [11, 25, 16]. Those works usually report improved accuracies. In this work, we also evaluate the impact of using segmentation to help classification.

If ad-hoc preprocessing (e.g., hair removal) is atypical in deep-learning, statistical

preprocessing is very common. Many networks fail to converge if the expected value of input data is too far from zero. Learning an average input vector during training and subtracting it from each input is standard, and performing a comparable procedure for standard deviations is usual. The procedure is so routine, that with rare exceptions 

[29, 3], researchers do not even mention it. In this work, we evaluate the impact of two image normalization schemes: TensorFlow/Slim’s default for each network, and an attempt to customize it for skin lesions.

Deep network architectures can directly provide the classification decisions, or can provide features for the final classifier — often support vector machines (SVM). Both the former 

[12, 14, 15], and the latter [8, 17, 13, 16] procedures are readily found for melanoma detection. We evaluate those choices of adding or not an SVM layer. Also common, are ensemble techniques, which fuse the results from several classifiers into a final decision [20, 3, 21, 22, 15].

As shown above, the literature on automated melanoma screening is vast: even the limited scope chosen for this survey comprises more than a dozen works. Making comparisons across those techniques was, until very recently, next to impossible due to poorly documented choices of datasets, splits, implementation details, or even evaluation metrics 

[7]. The ISIC Challenge [2] has sharply improved that scenario, by providing standards for data and metrics, and by requiring participants to publish working notes.

A subtler way ISIC Challenge served the community was by clearly separating validation and test datasets, and by keeping test datasets secret until all evaluations were over. When test sets are open, researchers have a strong tendency to optimize hyperparameters on test (hyperoptimize on test, for short).

Hyperoptimizing on test is not the same as overfitting the training: the latter is a technical shortcoming in which one overadjusts one’s parameters (or hyperparameters) to train (or validation) data, while the former is a methodological error in which one adjusts one’s hyperparameters using test data. Hyperoptimizing on test is also different than directly train–test contamination — mixing train and test samples, or using statistics from the test set in the model — a crude methodological blunder that most practitioners avoid. Hyperoptimizing on test is a subtle methodological error in which one indirectly uses information from the test set, by, for example, designing, refining and evaluating different models over a test set, picking the best one and reporting its results in that same test set [30]. Since the whole process may stretch over weeks or months, researchers fail to realize they are using privileged information to improve the model.

As we will discuss, hyperoptimizing on test leads to overoptimistic estimations of performance.

Ii-a ISIC Challenge 2017

We documented our participation in Parts 1 and 3 of ISIC Challenge 2017 in our working notes [3]. In this section, we briefly summarize that participation, our findings, and contrast our aims then and now.

Our team has been working on melanoma classification since early 2014 [31], and has been employing deep learning with transfer learning for that task since 2015 [32]. On the other hand, we were tackling skin-lesion segmentation for the very first time. Our team reached 1st place in melanoma classification (AUC = 0.874), 3rd place at keratosis classification (AUC = 0.943), and 3rd place in overall melanoma/keratosis classification (mean AUC = 0.908). We also reached 5th place in skin-lesion segmentation (Jaccard score = 0.754). The organizers published a comprehensive summary of the entire challenge [2].

The experiments presented here are inspired by the findings we made during the challenge, but this paper takes different directions, especially regarding methodology. If during the challenge we tempered scientific rigor with the sportive desire to win, here we stress only the former: this work does not focus on maximizing AUCs, but on broader issues of experimental design and validation for the proposed methods.

Iii Methods, Data, and Models

Our first — and more practical — goal is an exhaustive evaluation of ten factors involved in designing deep learning models (Table I).

Symbol Factor Levels
a Model ResNet-101 v2 versus Inception v4
b Train dataset Train split of ISIC Challenge 2017 versus Full: Level 1 + ISIC Archive + U. of Porto PH + U. of Edinburgh Dermofit
c Input resolution Pre-augmentation resoltion — 299299 pixels (305305 if using segmentation) versus 598598 pixels
d Train augmentation TensorFlow/Slim’s default versus Level 1 + rotations = on, fast_mode = off, minimum_area = 0.20
e Input normalization TensorFlow/Slim’s default versus Subtract mean of samples’ pixels
f Segmentation No segmentation information versus Segmentation pre-encoded at input
g Training length Short (about half the length of Full) versus Full (30k batches for ResNet / 40k batches for Inception; batch_size = 32)
h SVM decision layer Absent versus Present
i Test augmentation post-deep No (decision on single non-augmented sample) versus Yes (decision on average of 50 random-augmented samples)
j Test dataset Split of Train/Full vs. Validation of ISIC Chall. ’17 vs. Test of ISIC Chall. ’17 vs. EDRA/Dermoscopic vs. EDRA/Clinical
t Transfer learning Training from scratch (weights initialized at random) vs. Transfer from ImageNet (checkpoint published by Tensorflow/Slim)
TABLE I: Factors in our experimental designs, with corresponding levels

Most of those factors are not particular to melanoma detection, but are relevant for all image classifiers using deep learning. However, a preoccupation with resolution (c), augmentation customization (d), and segmentation (f) makes more sense for melanoma detection — or at least for medical images in general — than for general-purpose tasks, like ImageNet.

Our second, more philosophical, goal is discussing methodological issues in the design and evaluation of classification models, especially those which, like melanoma detection, aim at medical applications. We are far from the first to point out [30, 33] that the common practice of meta-optimizing a technique on the same test set used to evaluate the technique leads to over-optimistic results. In this paper, we showcase that effect as we cross-analyze the results in five different test sets.

Iii-a Software and Hardware

We ran the experiments on a cluster of Ubuntu Linux machines, on a variety of NVIDIA GPUs, including Titan X Maxwell, Titan X Pascal, and Tesla K40. We built the classification models on Python/TensorFlow v.1.3. using the Slim framework. Slim provides ready-to-use models for ResNet-101-v2 and Inception-v4, which we used, with slight adaptations. The statistical analysis ran in R.

In this text we aimed to provide the details needed to understand all results; for complete reproducibility, we provide the end-to-end pipeline, from the raw data to the tables and figures, at our code repository.

Iii-B Experimental Design

The main experimental design was a two-level full factorial design for nine of the ten factors mentioned above (a–i), for each one of the five test datasets (j), resulting in treatments evaluated. We run a second experiment to evaluate the impact of transfer learning, evaluating seven factors (a–e, g, i, t), and fixing (f) as no segmentation and (h) as SVM layer absent, resulting in treatments evaluated.

In all experiments, we used the area under the Receiver Operating Characteristic curve (AUC) as main metric. Following the ISIC Challenge 2017, we use the mean AUC between the melanoma-vs-all and the keratosis-vs-all as the measured outcome in all experiments.

The analysis for both experiments was a classical multi-way ANOVA, in which the test datasets entered as one of the factors. That choice highlights our aim to make decisions that generalize across datasets, in contrast to maximizing the performance for a particular dataset.

ANOVA provides both a significance analysis (p-value), and a partition of the variance, which allows to roughly estimate the relative importance/influence of each factor, or combinations of factors. For effect size/explanatory power, we use the

measure, which is the ratio of the variances (sums-of-squares), extensively used due to its simplicity. The ANOVA table of the main experiment is summarized in Table III and explored in the next section. We also employed correlograms to highlight issues relating to the choice of test dataset, and performance metric (Fig. 1). We discuss the results for the transfer experiment in Section IV.

Most of the time, our full factorial designs are too costly to use — thus our next set of experiments, exploring ensemble techniques, helps in more practical situations. We evaluated a straightforward ensemble, which just pools the decision of several classifiers, and showed that it provides very good performances, without the costs of a full design.

We also simulated the most common procedure employed by researchers and practitioners: sequential optimization of hyperparameters, in which one starts from a given configuration of hyperparameters, selects one of them to evaluate, commits to the best results, and proceeds to evaluate the next. Although such procedure is very fast (it allows optimizing the nine factors our main design in just 18 experiments), it is sub-optimal in comparison to ensembles.

Finally, we showed that the customary procedure of optimizing the hyperparameters in the same test set used to evaluate the technique leads to overoptimistic results in both the ensemble and the sequential design.

Details and results of all procedures appear in Section IV.

Iii-C Data

Due to deep-learning greediness for data, we sought all high-quality publicly available (for free, or for a fee) sources to compound our dataset:

  1. ISIC 2017 Challenge [2], the official challenge dataset, with 2,000 dermoscopic images (374 melanomas, 254 seborrheic keratoses, and 1,372 benign nevi).

  2. ISIC Archive, with over 13,000 dermoscopic images.

  3. Dermofit Image Library [34], with 1,300 clinical images (76 melanomas, 257 seborrheic keratoses).

  4. PH2 Dataset [35], with 200 dermoscopic images (40 melanomas).

  5. EDRA Interactive Atlas of Dermoscopy [36], with 1,000+ clinical cases (270 melanomas, 49 seborrheic keratoses), each with at least two images (dermoscopic, and close-up clinical).

We used essentially the same data sources we employed during the ISIC 2017 challenge, except for the no longer available dataset created by the Department of Medical Informatics, RWTH Aachen University (cited in our report [3] as “IRMA Dataset”). Even with that exclusion, the new dataset grew, due to a more careful matching of diagnostics among the sources (instead of dropping the doubtful cases).

Besides that matching, which we performed with a thesaurus of the terms used in the different datasets, we also annotated images by case, by aliases (same image with different names), and by near-duplicates (two almost-copies of the same lesion). When creating the train and test sets, we barred any cases, aliases, or near-duplicates from splitting across sets.

Data sources affect the train and test datasets. For the train dataset (factor b), we contrasted (1) using only the official train split of the ISIC Challenge 2017 dataset, to (2) joining the train split of the ISIC Challenge, the ISIC Archive, the Dermofit Library, and the PH2 Dataset and extracting from that full dataset a train split. For the test dataset (factor j), we contrasted (1) an internal test split extracted from our full dataset; (2) the official validation split and (3) the official test split of ISIC Challenge 2017; (4) the dermoscopic images and (5) the clinical images of the EDRA Interactive Atlas of Dermoscopy. Table II summarizes the final assembled sets.

Type Melanoma Nevus Keratosis
ISIC Challenge 2017 train split 374 1372 254
Full train (composition of datasets) 1227 10124 710
Internal test split from full 135 3129 89
ISIC Challenge 2017 validation split 30 78 42
ISIC Challenge 2017 testing split 117 393 90
EDRA Atlas of Dermoscopy (each version) 518 1154 95
TABLE II: Summary of the train and test sets.

Iii-D Models

For the main experiment, we always employed pre-trained models that proved successful for the ImageNet challenge, loading the weights from checkpoints published in Tensorflow repository. For the transfer experiment, we contrasted those with the same models initialized with random weights.

To evaluate the choice of the model (factor a) we contrasted two architectures: ResNet-101-v2 [23], and Inception-v4 [24], using the reference implementation available in TensorFlow/Slim v1.3.

In this paper, segmentation was used only as an ancillary input for classification (factor f). For the ISIC Challenge 2017, we had used a segmentation network based on the work of Ronneberger et al. [37] and Codella et al. [38]

. For this work, we streamlined that model, reducing the number of parameters, removing the fully-connected and Gaussian-noise layers, and adding batch-normalization and dropout layers. The new model

222 is faster to train and occupies much less disk space. We trained the segmentation models on the same images as their corresponding classification models.

Because of the lack of literature consensus on how to use segmentation for melanoma detection, we opted for schemes with minimal changes to both data and networks. We pre-evaluated two candidates: pixel-wise multiplying the input RGB images by the segmentation masks versus pre-encoding the four planes (R, G, B, and mask) into three planes, keeping the rest of the networks unchanged. For the full design, we only considered the latter, which appeared more promising on those preliminary tests.

Pre-encoding the masks required slightly adapting ResNet and Inception, by adding the pre-encoding adapter layers. For both ResNet/Inception we added three convolutional layers before the input, two layers with 32 filters, and a third with 3 filters. All convolutional layers used

kernels and stride of 1. Since ResNet-101-v2 and Inception-v4 models require input images of

pixels, the adapter layer took -pixel images, to account for the 2 border pixels lost at each convolutional layer.

Iv Experiments

As explained, the main experiment was a full factorial design with nine two-level factors (a–i), and five test datasets (factor j). We used a classical multi-way ANOVA with the mean AUC for melanoma and keratosis as the measured outcome (with the small technicality of taking the logit of that measure, since, when working with rates, the logit helps to fulfill ANOVA’s assumption of Gaussian residuals). We considered all main effects, and up to 3-way interactions. We considered higher-order interactions unlikely and assigned them to the residuals.

Table III shows a summary of main experiment’s ANOVA, with the symbols for the factors and interactions on the first column, and the names of the main factors on the second. The remaining columns show the outcomes of the test. The most important columns are p-value, which measures statistical significance, and explanation (%), which measures effect-size/explanatory power. The p-value is inferred, as usual, from the F-statistic of ANOVA, while the explanatory power uses the measure.

We present the absolute explanation (considering the entire table) for reference, but our analysis is focused on the relative explanation, which ignores the choice of the test set (j) and the residuals. The reason for ignoring those is that they are not actual choices for designing a new model; therefore, relative explanations indicate better the relative importance of choices to practitioners.

The original full table contained all main effects, and up to 3-way interactions. However, not surprisingly, 126 of the resulting 176 lines were non-significant interactions, which were omitted here. We also left out those interactions with relative explanations lower than 1%, even if significant. With the notable exception of the customized data augmentation (d), all main effects were significant, but most of their relative explanations were small.

Explanation (%) Best AUC (%) Worst AUC (%)
Factor p-value Absolute Relative Treatment Mean Treatment Mean
a Model architecture 0.001 0 1 resnet 84 inception 83
b Train dataset 0.001 5 46 full 85 challenge 81
c Input resolution 0.001 1 5 598 84 299–305 82
d Data augmentation 0.17 0 0 default 83 custom 83
e Input normalization 0.001 0 0 default 83 erase mean 83
f Use of segmentation 0.001 0 2 no 84 yes 83
g Duration of training 0.003 0 0 full 83 half 83
h SVM layer 0.001 0 4 no 84 yes 83
i Augmentation on test 0.001 1 12 yes 84 no 82
j Test dataset 0.001 75 full.split 96 edra.clinical 66
a:b 0.001 1 8 inception/full 86 inception/challenge 80
a:f 0.001 0 2 resnet/no 84 inception/yes 82
b:e 0.001 0 2 full/default 86 challenge/default 80
b:j 0.001 2 full/full.split 98 chall/edra.clinical 63
h:j 0.001 0 no/full.split 97 yes/edra.clinical 65
i:j 0.001 0 yes/full.split 97 no/edra.clinical 65
a:b:d 0.001 0 2 inception/full/custom 86 inception/challenge/custom 78
a:d:e 0.001 0 2 resnet/custom/default 85 inception/custom/default 81
a:f:j 0.001 0 resnet/yes/full.split 97 inception/yes/edra.clinical 65
b:d:e 0.001 0 1 full/custom/default 86 challenge/custom/default 79
c:e:f 0.001 0 1 598/default/no 86 299–305/default/yes 82
Residuals 12
TABLE III: Selected lines from the 176-line ANOVA table; most of the omitted lines (126) had p-values 0.05. Absolute explanation based on -measure, relative explanation ignores residuals and choice of test dataset (j).

The analysis of the relative explanation shows an unsurprising, but still disappointing result: the performance gains (b) are almost wholly due to the usage of more data. Other than data, the most important factor was the use of data augmentation on test (d). We performed it, as usual, by taking the test image, generating a number (in our case, 50) of augmented samples exactly like in training, collecting the prediction for each of the samples, and pooling the decisions (in our case, by taking the average prediction). Although not surprising for the literature of deep learning, that finding is relevant for the literature of melanoma detection, where many works still forgo augmentation in the test.

Most of the findings tended to confirm the (limited) observations we made during the ISIC Challenge 2017, with two notable exceptions. Input resolution (c), which we deemed unimportant during the challenge, turned out to have a non-negligible effect. That result is particularly interesting, because we used a very rough form of augmented resolution, by inputting high-resolution images to the augmentation engine, but still feeding normal-resolution crops to the network. On the other hand, the use of an SVM decision layer (h), which we considered advantageous during the Challenge turned out to have a large-effect… only negative! Globally, ANOVA shows it is better not to use the SVM.

Normalization (e) and training duration (g) showed tiny (1%), but still significant positive effects. The choice for those factors must consider their very different costs: adding normalization costs next to nothing, both in implementation complexity and in training time. Training duration doubled the already many-hours-long training times.

As usual, most of the interactions were not significant, and even the ones that were, had effect sizes too small to be worth noting. A notable exception was the interaction between model architecture and train dataset (a:b), whose 8% of relative explanation was bigger than most main effects. Model choice alone favors the simplest ResNet over Inception, but the combination of Inception with the full dataset is so advantageous that it offsets that effect. We had already observed, informally, this synergy between more data and deeper models during the Challenge.

The most disappointing result was the use of segmentation, which was more than unhelpful, harmful. This result, however, is contingent on our choice for adding segmentation to classification.

We performed an additional correlation analysis with the full factorial experiment (Fig. 1), to highlight the correlations (a) among results on different test datasets; and (b) among different metrics. To keep the scatter plots directly interpretable, instead of taking the logit of the rates, we dealt with the non-linearity by using Spearman’s instead of Pearson’s as correlation measure.

The correlogram on Fig. (a)a considers, as the ANOVA, the mean melanoma/keratosis AUC. The test dataset names appear in the diagonal, along with the maximum and minimum AUCs obtained for the 512 variations of the full design on that dataset. The scatter plots in the upper-triangular matrix follow the usual construction for correlograms. The lower-triangular matrix displays the Spearman’s

’s: the mean estimate appears as the printed numeral and as the area of the solid circle; the bounds of the 95%-confidence interval appear as the area of the internal and external dashed circles. Negative correlations appear in red.

The correlation between different test datasets is far from perfect. That is, perhaps, obvious, but must be stressed, since it reveals that naively hyperoptimizing a model on one test set will not necessarily generalize to other data. The relationship between splits of different datasets is more subtle. Note how the correlation between the validation and the test splits of ISIC 2017 Challenge, and the dermoscopic and clinical splits of EDRA have the highest correlations. This suggests that results measured on splits of the same dataset may not wholly generalize over data of the same type obtained on different conditions. Both phenomena show how hyperoptimizing on test gives unwarranted advantages, leading to overoptimistic assessments.

The correlogram on Fig. (b)b considers only the results for the test split of the ISIC Challenge. Different metrics appear in the diagonal: average precision, area under the ROC curve, sensitivity (true positive rate), and specificity (true negative rate), for both melanoma and keratosis. The interpretation of the plots, numerals, circles, and colors is the same as above.

This correlogram is interesting for showing that many metrics have correlations that are not that big. Particularly noteworthy is the specificity, which has not only a negative correlation with sensitivity (as expected), but also a negative or very small correlation with most of the other metrics.

(a) Correlogram of mean melanoma/keratosis AUC across test datasets.
(b) Correlogram of metrics on ISIC 2017 Test dataset across metrics.
Fig. 1: Correlograms with pair-wise correlation analyses. Sets appear on the diagonal; upper matrices show the scatter plots, and lower matrices show the Spearman correlation of each pair of sets. On lower matrices, numbers and solid circles’ areas show the mean estimates, and dashed circles’ areas show the 95%-confidence bounds. Non-significant estimates appear without the circles. All numbers in %, negative correlations are in red.
(a) Simulated sequential design on ISIC test split.
(b) Simulated sequential design on EDRA Atlas clinical images.
Fig. 2: Simulation of the sequential optimization of hyperparameters, considering all nine factors (a–i) as the main full factorial. Factors optimized on the dataset shown on the horizontal axis, and performance (mean melanoma/keratosis AUC) measured on the dataset indicated on the captions. For each case, we run 100 simulations, with random optimization sequences and starting points.

The violin plots show the kernel density estimation of the actual data (black dots) and the large red dot shows their mean.

We run a second full factorial design, with seven of the ten factors of the main experiment (a–e, g, i, j), fixing factors (f) and (h), and adding a factor to evaluate the presence versus absence of transfer learning (factor t). The new factorial design, with treatments, shows transfer learning as critical for performance: it explains (favorably) 14.7% of the absolute variation, and a whopping 62.8% of the relative variation of performance (computing those metrics the same way as the in the main experiment, i.e., excluding the residuals, and the choice of test dataset and its interactions from the relative variation), with high significance (p-value below 0.001). We omit the ANOVA table for concision. Those results reinforce previous findings on the importance of transfer learning for melanoma detection [13].

As mentioned, full factorial designs are way too expensive for the majority of situations. The most common procedure is the exact opposite: taking a single factor to optimize, and performing a couple of experiments on that factor alone, keeping all others fixed (starting from a combination considered reasonable). Once a factor is decided, one commits to it and takes the next to optimize, until the procedure is complete.

We evaluate the impact of such sequential procedure, simulating it using the measurements on the full design. We take, at random, both the starting treatment and the sequence of factors to test. For factors not yet optimized, the level is given by the starting treatment. Each factor is optimized in turn, by comparing the performance of the alternative treatments on the full-factorial data of a chosen hyperoptimization dataset. The outcome of a single simulation is the performance of the optimized treatment on a chosen measurement dataset. We use the mean keratosis/melanoma AUC as the performance metric.

Fig. 2 shows the results for pairs of hyperoptimization measurement datasets, where we perform 100 simulations for each pair. The actual measurements appear as black dots, and the violin plots show their estimated density, while the big red dot shows their mean. The most notable observation is the (unrealistic) advantage of hyperoptimizing and measuring on the same dataset: not only do we get higher averages, but also a smaller variability. The advantage of hyperoptimizing and measuring on splits of the same dataset is more subtle, but present.

The expense of the full factorial design, the instability of the sequential procedure, and the limited correlation of performances across datasets seem to leave few options to practitioners. Fortunately, single-model schemes are seldom used today, and ensembles of several models help to alleviate those issues.

We simulated different ensemble strategies, by pooling the predictions of models present in our full design. We evaluate three pooling strategies: average, max, and extremal. Average- and max-pooling work as usual. Extremal pooling takes, from the list of values being pooled, the value most distant from 0.5 — it may be seen as an “hypothesis-invariant” max-pooling. In all cases, after pooling, we re-normalize the probability vector to ensure it sums up to one. Half the models in the full design entered as candidates, and we discarded in this experiment the models with the SVM layer, due to issues in making their probabilities commensurable with the deep-only models.

Fig. 3 shows the main results. Average-pooling was, by far, the best choice for pooling the decision. Such clear-cut advantage came as a surprise for us, as max-pooling often outperforms average-pooling in related tasks. If no other information is available, simply average-pooling randomly selected models is a reasonable strategy.

The use of dozens — even hundreds — of models may sometimes be justified in critical tasks (like medical decisions), but training and evaluating so many deep networks is cumbersome. Fortunately, as Fig. 4 shows, a handful of models seem to work just as well. The results shown here are the “good news” part of this paper: we can escape the expense of the full factorial design, and the instability of the sequential designs, by averaging a dozen or so models with parameters chosen entirely at random — although the random ensembles start very unstable, they soon converge to a reasonable model, in average and variability. If we decide to perform a full factorial, there is good news too: the best models learned in one dataset seem to be informative to compose the ensembles in other datasets, allowing to get top performances with very small ensembles.

Here, again, the unfair advantage of optimizing (selecting the models for the ensemble) and measuring performance on the same dataset appears. The advantage is small but systematic for the test split of ISIC (Fig. (a)a); it is much more apparent for the challenging collection of clinical images of EDRA Atlas (Fig. (b)b).

Fig. 3: Evaluation of ensemble strategies, by pooling the prediction of a given number of partial models using average-, max-, or extremal-pooling. Left: cumulative effect of adding partial models, starting with the best (as evaluated by the internal test split). Right: same plot, with models randomly shuffled. (Best viewed in color.)
(a) Ensembles on ISIC test split.
(b) Ensembles on EDRA Atlas clinical images.
Fig. 4: Detailed analysis of ensembles, contrasting the dataset used to choose the models (shown as different curves) in order to optimize the results for a measurement dataset (indicated in captions). We sampled 10 different random ensembles. (Best viewed in color.)

As a final experiment, we made two simulations of “new” submissions to the ISIC 2017 Challenge. The first simulates a blind procedure, mimicking the conditions of the challenge (limited information about the validation split, no information whatsoever about the test split). In that simulation, we assume a full factorial experiment performed only on our internal validation split, and keep the 32 best models (as measured in that split). We then test 32 incremental ensembles on the ISIC validation set, finding that the ensembles with 15 or 16 models have the best performance. We commit to the ensemble with 15 models and do one evaluation of that ensemble on the ISIC test split, finding AUCs of 0.895 (melanoma), 0.967 (keratosis), and 0.931 (combined). To put those numbers in perspective, in the actual competition the best AUCs were, respectively, 0.874, 0.965, and 0.911 (obtained by different participants).

We also simulated a privileged procedure, hyperoptimizing without restraint on the ISIC Challenge test split itself. We use the full design on the ISIC Test split to select the 32 best models, and then test 32 incremental ensembles, handpicking the best result for melanoma and for keratosis. We find AUCs of 0.916 (melanoma), 0.970 (keratosis), and 0.943 (combined). Again, to get a better grasp of the difference, consider that the 2.p.p. increase on the melanoma AUCs is larger than the difference between the 1st and the 4th teams ranked in the 2017 Challenge.

V Discussion

When one contrasts the blind and the privileged procedures side-by-side, the problems with the latter become apparent, but the privileged protocol is the most common in machine-learning literature (we are often guilty ourselves). Hyperoptimizing on test does not result from researchers’ desire to cheat, but from their natural tendency to exploit scarce existing data as much as possible. Salzberg has warned researchers about the dangers of “repeated tuning” since 1997; his work 

[30] is often cited, but the issue is still far from solved.

Avoiding that pitfall requires a very strong commitment, which researchers seem unable to keep. That only reinforces the importance of regular curated challenges — like the ISIC Challenge — in which the test set is withheld at least until the evaluations are over. The ImageNet competition is perhaps the best example of the extraordinary impact of having such a curated competition every year.

Our findings explain, in part, why performances observed in practice fall much shorter of the numbers we get in our labs. In a single round of experiments, analyzing only two levels for nine factors, the unwarranted advantage of hyperoptimizing on test is already notable. In actual research, with many rounds of experiments over dozens of factors and hundreds of levels, the gap may be much wider.

Our main evaluation of 2560 different models shows that almost half the variability in performance is explained by the amount of data alone. That reinforces the deep-learning creed on the “unreasonable power of data” and has important consequences for the melanoma detection community: in order to move research forward, we need to curate larger shared datasets. The ISIC Archive is an essential step in that direction — but it would have to grow almost tenfold to match the largest (private) dataset reported in the literature [12].

Despite the predominance of data, other factors appeared relevant. The most noteworthy is, perhaps, the use of data augmentation on test samples. The use of deeper models in combination with extra data, also appeared as an important advantage. Increased resolution — even in a very limited scheme — was also advantageous.

Notable negative results were the use of segmentation to help classification, and the use of an extra SVM Layer. The negative result on segmentation is the most surprising, and needs further exploration, since works in literature report many different ways to incorporate segmentation information to classification, often with improved performances. We would like to explore that theme in a future work, exhaustive in that particular scope.

The limited correlation between different possible metrics (in particular with specificity) has consequences for melanoma detection literature: stressing one metric over the other may lead the community to different directions.

The ensemble experiments bring the most encouraging news, by providing reliable performance, without either the expense of the full factorial design, or the instability of the traditional sequential optimization. Even a simple accumulation of enough randomly-sampled models was sufficient to provide adequate performance. Learning the best models on one dataset, however, helps to select the best ensembles for other datasets. Ensembles appear as a promising avenue for future explorations. In future works, we would like to design and evaluate techniques more sophisticated than pooling the decisions of the models, like model stacking, or boosting.


E. Valle and M. Fornaciali are partially funded by Google Research Awards for Latin America 2016 & 2017; E. Valle is also partially funded by a CNPq PQ-2 grant (311486/2014-2). A. Menegola is funded by CNPq. This project is partially funded by CNPq Universal grant (424958/2016-3). RECOD Lab. is partially supported by diverse projects and grants from FAPESP, CNPq, and CAPES. We gratefully acknowledge the donation of a Tesla K40 and a TITAN X GPUs by NVIDIA Corporation, used in this work. We thank Fabio Perez, Micael Carvalho, and Fillipe D. M. de Souza, for the final revision. We thank Prof. M. Emre Celebi for kindly providing the machine-readable metadata of the EDRA Interactive Atlas of Dermoscopy and for the help with the final revision of the first draft.