I Introduction
Deep learning became the gold standard for image classification tasks. Automated melanoma screening is no exception, with a state of the art that improved rapidly since the adoption of those models.
Deep learning, however, poses its own challenges, requiring high computational power and copious amounts of annotated data. Deep learning is also challenging to parameterize, often involving dozens of hyperparameters [1]. Although highlevel frameworks have simplified its use, much craftsmanship is involved in designing and optimizing its models.
We examine those difficulties, with an exhaustive evaluation of ten factors faced when designing deep networks for melanoma detection. We explore the variations of those factors in two full factorial designs. Our main experiment evaluates nine factors, for five different test datasets, resulting in treatments. Our assessment of transfer learning evaluates eight factors for five datasets, with treatments. We analyze those results using a multiway ANOVA, to obtain decisions that generalize across datasets.
Full factorial designs are too expensive for most practical contexts, but the complete set of data we obtained allowed us to simulate two alternative designs: the traditional sequential optimization (a single factor at a time) and ensembles of models. We show that the latter provides the best performance, at reasonable costs.
Our participation at the ISIC (International Skin Imaging Collaboration) 2017 Challenge [2] inspired this paper. At the time, with a strategy of aggressively optimizing our models, we were ranked first in melanoma classification [3], with an AUC of 0.874.
Here we have very different goals, focusing on methodological issues for designing and evaluating deep learning models for melanoma detection. The main contributions are:

A main evaluation of nine factors affecting the design of deeplearning for melanoma detection, together with their crossinteractions up to the third level;

A quantitative appraisal of the importance of transfer learning, in perspective with seven of the nine factors analyzed in the main experiment;

A critique of hyperparameter optimization, demonstrating that the customary procedure of optimizing hyperparameters and evaluating techniques on the same dataset leads to overoptimistic results;

A demonstration of how simple ensemble solutions can provide good results, without sacrificing methodological rigor;

Stateoftheart AUCs in all three classification tasks of the ISIC Challenge 2017/Part 3. The focus of this paper is not on increasing metrics, but those results showcase the previous point;

Finally, our entire source code is available, and our procedures are explained stepbystep, from the acquisition of the data, until the generation of the tables and graphs of this paper^{1}^{1}1https://github.com/learningtitans/datadepthdesign. We are committed to making this work as reproducible as technically feasible.
We organized the remaining text as follows: we discuss the recent state of the art in Section II, and very briefly overview our participation on ISIC Challenge 2017. We detail our goals, materials, and methods in Section III. Experimental results and analyses appear in Section IV. Finally, we review and discuss the main findings in Section V.
Ii Survey of recent techniques
Since 2016, state of the art on automated melanoma took a sharp turn towards deep learning. We will focus on those results, without emphasis on relevant previous literature [4, 5], and refer the interested reader to more comprehensive surveys [6, 7]. We assume that the reader has passing familiarity with deep learning and its issues — if needed, we refer the reader to a complete overview [1].
Existing works either train deep networks from scratch [8, 9, 10]; or reuse the weights from pretrained networks [11, 12, 13, 14, 15, 16], in a scheme called transfer learning.
Transfer learning is usually preferred, as it alleviates the main issue of deep learning for melanoma detection: way too small datasets — most often comprising a few thousand samples. (Contrast that with the ImageNet dataset, employed to evaluate deep networks, with more than
a million samples.) Training from scratch is preferable only when attempting new architectures, or when avoiding external data due to legal/scientific issues. Menegola et al. [13] explain and evaluate transfer learning for melanoma detection in more detail. The results in this paper reinforce that transfer learning is critical for best performance. In the main experiments, all treatments employ transfer learning.Whether using transfer or not, works vary widely in their choice of deeplearning architecture, from the relatively shallow (for today’s standards) VGG [11, 13, 17, 14], midrange GoogLeNet [11, 12, 15, 18, 19], until the deeper ResNet [15, 20, 3, 21, 16] or Inception [12, 3, 22]. On the one hand, more recent architectures tend to be deeper, and to yield better accuracies; on the other hand, they require more data and are more difficult to parameterize and train. Although highlevel frameworks for deep learning have simplified training those networks, a good deal of craftsmanship is still involved. In this work we contrast two architectures: ResNet101v2 [23], and Inceptionv4 [24]. Both are very deep and show outstanding performance on ImageNet, but Inceptionv4 is considerably deeper and bigger than ResNet101v2.
Data augmentation is another technique used to bypass the need for data, while also enhancing networks’ invariance properties. Augmentation creates a myriad of new samples by applying random distortions (e.g., rotations, crops, resizes, color changes) to the existing samples. Augmentation provides best performance when applied to both train and test samples, but only the most recent melanoma detection works follow that scheme [9, 13, 3, 21, 22]. Trainonly augmentation is still very common [11, 12, 17, 14, 25, 18, 19].
In this work, we evaluate two augmentation distortion setups
(TensorFlow/Slim’s default for each network, and an attempt to customize it for skin lesions). We always apply augmentation to the train samples, but evaluate the impact of applying or not
augmentation for test samples.Works based on global features or bags of visual words often preprocess the images (some recent works using deep learning do it as well) to reduce noise, remove artifacts (e.g., hair), enhance brightness and color, or highlight structures [26, 27, 8, 20, 10, 28]. The deeplearning ethos usually forgoes that kind of “handmade” preprocessing, relying instead on networks’ abilities to learn those invariances — with the help of data augmentation if needed.
On the other hand, segmentation as preprocessing is common on deeplearning for melanoma detection [9, 18], sometimes employing a dedicated network to segment the lesion before forwarding it to the classification network [11, 25, 16]. Those works usually report improved accuracies. In this work, we also evaluate the impact of using segmentation to help classification.
If adhoc preprocessing (e.g., hair removal) is atypical in deeplearning, statistical
preprocessing is very common. Many networks fail to converge if the expected value of input data is too far from zero. Learning an average input vector during training and subtracting it from each input is standard, and performing a comparable procedure for standard deviations is usual. The procedure is so routine, that with rare exceptions
[29, 3], researchers do not even mention it. In this work, we evaluate the impact of two image normalization schemes: TensorFlow/Slim’s default for each network, and an attempt to customize it for skin lesions.Deep network architectures can directly provide the classification decisions, or can provide features for the final classifier — often support vector machines (SVM). Both the former
[12, 14, 15], and the latter [8, 17, 13, 16] procedures are readily found for melanoma detection. We evaluate those choices of adding or not an SVM layer. Also common, are ensemble techniques, which fuse the results from several classifiers into a final decision [20, 3, 21, 22, 15].As shown above, the literature on automated melanoma screening is vast: even the limited scope chosen for this survey comprises more than a dozen works. Making comparisons across those techniques was, until very recently, next to impossible due to poorly documented choices of datasets, splits, implementation details, or even evaluation metrics
[7]. The ISIC Challenge [2] has sharply improved that scenario, by providing standards for data and metrics, and by requiring participants to publish working notes.A subtler way ISIC Challenge served the community was by clearly separating validation and test datasets, and by keeping test datasets secret until all evaluations were over. When test sets are open, researchers have a strong tendency to optimize hyperparameters on test (hyperoptimize on test, for short).
Hyperoptimizing on test is not the same as overfitting the training: the latter is a technical shortcoming in which one overadjusts one’s parameters (or hyperparameters) to train (or validation) data, while the former is a methodological error in which one adjusts one’s hyperparameters using test data. Hyperoptimizing on test is also different than directly train–test contamination — mixing train and test samples, or using statistics from the test set in the model — a crude methodological blunder that most practitioners avoid. Hyperoptimizing on test is a subtle methodological error in which one indirectly uses information from the test set, by, for example, designing, refining and evaluating different models over a test set, picking the best one and reporting its results in that same test set [30]. Since the whole process may stretch over weeks or months, researchers fail to realize they are using privileged information to improve the model.
As we will discuss, hyperoptimizing on test leads to overoptimistic estimations of performance.
Iia ISIC Challenge 2017
We documented our participation in Parts 1 and 3 of ISIC Challenge 2017 in our working notes [3]. In this section, we briefly summarize that participation, our findings, and contrast our aims then and now.
Our team has been working on melanoma classification since early 2014 [31], and has been employing deep learning with transfer learning for that task since 2015 [32]. On the other hand, we were tackling skinlesion segmentation for the very first time. Our team reached 1st place in melanoma classification (AUC = 0.874), 3rd place at keratosis classification (AUC = 0.943), and 3rd place in overall melanoma/keratosis classification (mean AUC = 0.908). We also reached 5th place in skinlesion segmentation (Jaccard score = 0.754). The organizers published a comprehensive summary of the entire challenge [2].
The experiments presented here are inspired by the findings we made during the challenge, but this paper takes different directions, especially regarding methodology. If during the challenge we tempered scientific rigor with the sportive desire to win, here we stress only the former: this work does not focus on maximizing AUCs, but on broader issues of experimental design and validation for the proposed methods.
Iii Methods, Data, and Models
Our first — and more practical — goal is an exhaustive evaluation of ten factors involved in designing deep learning models (Table I).
Symbol  Factor  Levels 

a  Model  ResNet101 v2 versus Inception v4 
b  Train dataset  Train split of ISIC Challenge 2017 versus Full: Level 1 + ISIC Archive + U. of Porto PH + U. of Edinburgh Dermofit 
c  Input resolution  Preaugmentation resoltion — 299299 pixels (305305 if using segmentation) versus 598598 pixels 
d  Train augmentation  TensorFlow/Slim’s default versus Level 1 + rotations = on, fast_mode = off, minimum_area = 0.20 
e  Input normalization  TensorFlow/Slim’s default versus Subtract mean of samples’ pixels 
f  Segmentation  No segmentation information versus Segmentation preencoded at input 
g  Training length  Short (about half the length of Full) versus Full (30k batches for ResNet / 40k batches for Inception; batch_size = 32) 
h  SVM decision layer  Absent versus Present 
i  Test augmentation postdeep  No (decision on single nonaugmented sample) versus Yes (decision on average of 50 randomaugmented samples) 
j  Test dataset  Split of Train/Full vs. Validation of ISIC Chall. ’17 vs. Test of ISIC Chall. ’17 vs. EDRA/Dermoscopic vs. EDRA/Clinical 
t  Transfer learning  Training from scratch (weights initialized at random) vs. Transfer from ImageNet (checkpoint published by Tensorflow/Slim) 
Most of those factors are not particular to melanoma detection, but are relevant for all image classifiers using deep learning. However, a preoccupation with resolution (c), augmentation customization (d), and segmentation (f) makes more sense for melanoma detection — or at least for medical images in general — than for generalpurpose tasks, like ImageNet.
Our second, more philosophical, goal is discussing methodological issues in the design and evaluation of classification models, especially those which, like melanoma detection, aim at medical applications. We are far from the first to point out [30, 33] that the common practice of metaoptimizing a technique on the same test set used to evaluate the technique leads to overoptimistic results. In this paper, we showcase that effect as we crossanalyze the results in five different test sets.
Iiia Software and Hardware
We ran the experiments on a cluster of Ubuntu Linux machines, on a variety of NVIDIA GPUs, including Titan X Maxwell, Titan X Pascal, and Tesla K40. We built the classification models on Python/TensorFlow v.1.3. using the Slim framework. Slim provides readytouse models for ResNet101v2 and Inceptionv4, which we used, with slight adaptations. The statistical analysis ran in R.
In this text we aimed to provide the details needed to understand all results; for complete reproducibility, we provide the endtoend pipeline, from the raw data to the tables and figures, at our code repository.
IiiB Experimental Design
The main experimental design was a twolevel full factorial design for nine of the ten factors mentioned above (a–i), for each one of the five test datasets (j), resulting in treatments evaluated. We run a second experiment to evaluate the impact of transfer learning, evaluating seven factors (a–e, g, i, t), and fixing (f) as no segmentation and (h) as SVM layer absent, resulting in treatments evaluated.
In all experiments, we used the area under the Receiver Operating Characteristic curve (AUC) as main metric. Following the ISIC Challenge 2017, we use the mean AUC between the melanomavsall and the keratosisvsall as the measured outcome in all experiments.
The analysis for both experiments was a classical multiway ANOVA, in which the test datasets entered as one of the factors. That choice highlights our aim to make decisions that generalize across datasets, in contrast to maximizing the performance for a particular dataset.
ANOVA provides both a significance analysis (pvalue), and a partition of the variance, which allows to roughly estimate the relative importance/influence of each factor, or combinations of factors. For effect size/explanatory power, we use the
measure, which is the ratio of the variances (sumsofsquares), extensively used due to its simplicity. The ANOVA table of the main experiment is summarized in Table III and explored in the next section. We also employed correlograms to highlight issues relating to the choice of test dataset, and performance metric (Fig. 1). We discuss the results for the transfer experiment in Section IV.Most of the time, our full factorial designs are too costly to use — thus our next set of experiments, exploring ensemble techniques, helps in more practical situations. We evaluated a straightforward ensemble, which just pools the decision of several classifiers, and showed that it provides very good performances, without the costs of a full design.
We also simulated the most common procedure employed by researchers and practitioners: sequential optimization of hyperparameters, in which one starts from a given configuration of hyperparameters, selects one of them to evaluate, commits to the best results, and proceeds to evaluate the next. Although such procedure is very fast (it allows optimizing the nine factors our main design in just 18 experiments), it is suboptimal in comparison to ensembles.
Finally, we showed that the customary procedure of optimizing the hyperparameters in the same test set used to evaluate the technique leads to overoptimistic results in both the ensemble and the sequential design.
Details and results of all procedures appear in Section IV.
IiiC Data
Due to deeplearning greediness for data, we sought all highquality publicly available (for free, or for a fee) sources to compound our dataset:

ISIC 2017 Challenge [2], the official challenge dataset, with 2,000 dermoscopic images (374 melanomas, 254 seborrheic keratoses, and 1,372 benign nevi).

ISIC Archive, with over 13,000 dermoscopic images.

Dermofit Image Library [34], with 1,300 clinical images (76 melanomas, 257 seborrheic keratoses).

PH2 Dataset [35], with 200 dermoscopic images (40 melanomas).

EDRA Interactive Atlas of Dermoscopy [36], with 1,000+ clinical cases (270 melanomas, 49 seborrheic keratoses), each with at least two images (dermoscopic, and closeup clinical).
We used essentially the same data sources we employed during the ISIC 2017 challenge, except for the no longer available dataset created by the Department of Medical Informatics, RWTH Aachen University (cited in our report [3] as “IRMA Dataset”). Even with that exclusion, the new dataset grew, due to a more careful matching of diagnostics among the sources (instead of dropping the doubtful cases).
Besides that matching, which we performed with a thesaurus of the terms used in the different datasets, we also annotated images by case, by aliases (same image with different names), and by nearduplicates (two almostcopies of the same lesion). When creating the train and test sets, we barred any cases, aliases, or nearduplicates from splitting across sets.
Data sources affect the train and test datasets. For the train dataset (factor b), we contrasted (1) using only the official train split of the ISIC Challenge 2017 dataset, to (2) joining the train split of the ISIC Challenge, the ISIC Archive, the Dermofit Library, and the PH2 Dataset and extracting from that full dataset a train split. For the test dataset (factor j), we contrasted (1) an internal test split extracted from our full dataset; (2) the official validation split and (3) the official test split of ISIC Challenge 2017; (4) the dermoscopic images and (5) the clinical images of the EDRA Interactive Atlas of Dermoscopy. Table II summarizes the final assembled sets.
Type  Melanoma  Nevus  Keratosis 

ISIC Challenge 2017 train split  374  1372  254 
Full train (composition of datasets)  1227  10124  710 
Internal test split from full  135  3129  89 
ISIC Challenge 2017 validation split  30  78  42 
ISIC Challenge 2017 testing split  117  393  90 
EDRA Atlas of Dermoscopy (each version)  518  1154  95 
IiiD Models
For the main experiment, we always employed pretrained models that proved successful for the ImageNet challenge, loading the weights from checkpoints published in Tensorflow repository. For the transfer experiment, we contrasted those with the same models initialized with random weights.
To evaluate the choice of the model (factor a) we contrasted two architectures: ResNet101v2 [23], and Inceptionv4 [24], using the reference implementation available in TensorFlow/Slim v1.3.
In this paper, segmentation was used only as an ancillary input for classification (factor f). For the ISIC Challenge 2017, we had used a segmentation network based on the work of Ronneberger et al. [37] and Codella et al. [38]
. For this work, we streamlined that model, reducing the number of parameters, removing the fullyconnected and Gaussiannoise layers, and adding batchnormalization and dropout layers. The new model
^{2}^{2}2https://github.com/learningtitans/isbi2017part1 is faster to train and occupies much less disk space. We trained the segmentation models on the same images as their corresponding classification models.Because of the lack of literature consensus on how to use segmentation for melanoma detection, we opted for schemes with minimal changes to both data and networks. We preevaluated two candidates: pixelwise multiplying the input RGB images by the segmentation masks versus preencoding the four planes (R, G, B, and mask) into three planes, keeping the rest of the networks unchanged. For the full design, we only considered the latter, which appeared more promising on those preliminary tests.
Preencoding the masks required slightly adapting ResNet and Inception, by adding the preencoding adapter layers. For both ResNet/Inception we added three convolutional layers before the input, two layers with 32 filters, and a third with 3 filters. All convolutional layers used
kernels and stride of 1. Since ResNet101v2 and Inceptionv4 models require input images of
pixels, the adapter layer took pixel images, to account for the 2 border pixels lost at each convolutional layer.Iv Experiments
As explained, the main experiment was a full factorial design with nine twolevel factors (a–i), and five test datasets (factor j). We used a classical multiway ANOVA with the mean AUC for melanoma and keratosis as the measured outcome (with the small technicality of taking the logit of that measure, since, when working with rates, the logit helps to fulfill ANOVA’s assumption of Gaussian residuals). We considered all main effects, and up to 3way interactions. We considered higherorder interactions unlikely and assigned them to the residuals.
Table III shows a summary of main experiment’s ANOVA, with the symbols for the factors and interactions on the first column, and the names of the main factors on the second. The remaining columns show the outcomes of the test. The most important columns are pvalue, which measures statistical significance, and explanation (%), which measures effectsize/explanatory power. The pvalue is inferred, as usual, from the Fstatistic of ANOVA, while the explanatory power uses the measure.
We present the absolute explanation (considering the entire table) for reference, but our analysis is focused on the relative explanation, which ignores the choice of the test set (j) and the residuals. The reason for ignoring those is that they are not actual choices for designing a new model; therefore, relative explanations indicate better the relative importance of choices to practitioners.
The original full table contained all main effects, and up to 3way interactions. However, not surprisingly, 126 of the resulting 176 lines were nonsignificant interactions, which were omitted here. We also left out those interactions with relative explanations lower than 1%, even if significant. With the notable exception of the customized data augmentation (d), all main effects were significant, but most of their relative explanations were small.
Explanation (%)  Best AUC (%)  Worst AUC (%)  

Factor  pvalue  Absolute  Relative  Treatment  Mean  Treatment  Mean  
a  Model architecture  0.001  0  1  resnet  84  inception  83 
b  Train dataset  0.001  5  46  full  85  challenge  81 
c  Input resolution  0.001  1  5  598  84  299–305  82 
d  Data augmentation  0.17  0  0  default  83  custom  83 
e  Input normalization  0.001  0  0  default  83  erase mean  83 
f  Use of segmentation  0.001  0  2  no  84  yes  83 
g  Duration of training  0.003  0  0  full  83  half  83 
h  SVM layer  0.001  0  4  no  84  yes  83 
i  Augmentation on test  0.001  1  12  yes  84  no  82 
j  Test dataset  0.001  75  full.split  96  edra.clinical  66  
a:b  0.001  1  8  inception/full  86  inception/challenge  80  
a:f  0.001  0  2  resnet/no  84  inception/yes  82  
b:e  0.001  0  2  full/default  86  challenge/default  80  
b:j  0.001  2  full/full.split  98  chall/edra.clinical  63  
h:j  0.001  0  no/full.split  97  yes/edra.clinical  65  
i:j  0.001  0  yes/full.split  97  no/edra.clinical  65  
a:b:d  0.001  0  2  inception/full/custom  86  inception/challenge/custom  78  
a:d:e  0.001  0  2  resnet/custom/default  85  inception/custom/default  81  
a:f:j  0.001  0  resnet/yes/full.split  97  inception/yes/edra.clinical  65  
b:d:e  0.001  0  1  full/custom/default  86  challenge/custom/default  79  
c:e:f  0.001  0  1  598/default/no  86  299–305/default/yes  82  
Residuals  —  12 
The analysis of the relative explanation shows an unsurprising, but still disappointing result: the performance gains (b) are almost wholly due to the usage of more data. Other than data, the most important factor was the use of data augmentation on test (d). We performed it, as usual, by taking the test image, generating a number (in our case, 50) of augmented samples exactly like in training, collecting the prediction for each of the samples, and pooling the decisions (in our case, by taking the average prediction). Although not surprising for the literature of deep learning, that finding is relevant for the literature of melanoma detection, where many works still forgo augmentation in the test.
Most of the findings tended to confirm the (limited) observations we made during the ISIC Challenge 2017, with two notable exceptions. Input resolution (c), which we deemed unimportant during the challenge, turned out to have a nonnegligible effect. That result is particularly interesting, because we used a very rough form of augmented resolution, by inputting highresolution images to the augmentation engine, but still feeding normalresolution crops to the network. On the other hand, the use of an SVM decision layer (h), which we considered advantageous during the Challenge turned out to have a largeeffect… only negative! Globally, ANOVA shows it is better not to use the SVM.
Normalization (e) and training duration (g) showed tiny (1%), but still significant positive effects. The choice for those factors must consider their very different costs: adding normalization costs next to nothing, both in implementation complexity and in training time. Training duration doubled the already manyhourslong training times.
As usual, most of the interactions were not significant, and even the ones that were, had effect sizes too small to be worth noting. A notable exception was the interaction between model architecture and train dataset (a:b), whose 8% of relative explanation was bigger than most main effects. Model choice alone favors the simplest ResNet over Inception, but the combination of Inception with the full dataset is so advantageous that it offsets that effect. We had already observed, informally, this synergy between more data and deeper models during the Challenge.
The most disappointing result was the use of segmentation, which was more than unhelpful, harmful. This result, however, is contingent on our choice for adding segmentation to classification.
We performed an additional correlation analysis with the full factorial experiment (Fig. 1), to highlight the correlations (a) among results on different test datasets; and (b) among different metrics. To keep the scatter plots directly interpretable, instead of taking the logit of the rates, we dealt with the nonlinearity by using Spearman’s instead of Pearson’s as correlation measure.
The correlogram on Fig. (a)a considers, as the ANOVA, the mean melanoma/keratosis AUC. The test dataset names appear in the diagonal, along with the maximum and minimum AUCs obtained for the 512 variations of the full design on that dataset. The scatter plots in the uppertriangular matrix follow the usual construction for correlograms. The lowertriangular matrix displays the Spearman’s
’s: the mean estimate appears as the printed numeral and as the area of the solid circle; the bounds of the 95%confidence interval appear as the area of the internal and external dashed circles. Negative correlations appear in red.
The correlation between different test datasets is far from perfect. That is, perhaps, obvious, but must be stressed, since it reveals that naively hyperoptimizing a model on one test set will not necessarily generalize to other data. The relationship between splits of different datasets is more subtle. Note how the correlation between the validation and the test splits of ISIC 2017 Challenge, and the dermoscopic and clinical splits of EDRA have the highest correlations. This suggests that results measured on splits of the same dataset may not wholly generalize over data of the same type obtained on different conditions. Both phenomena show how hyperoptimizing on test gives unwarranted advantages, leading to overoptimistic assessments.
The correlogram on Fig. (b)b considers only the results for the test split of the ISIC Challenge. Different metrics appear in the diagonal: average precision, area under the ROC curve, sensitivity (true positive rate), and specificity (true negative rate), for both melanoma and keratosis. The interpretation of the plots, numerals, circles, and colors is the same as above.
This correlogram is interesting for showing that many metrics have correlations that are not that big. Particularly noteworthy is the specificity, which has not only a negative correlation with sensitivity (as expected), but also a negative or very small correlation with most of the other metrics.
The violin plots show the kernel density estimation of the actual data (black dots) and the large red dot shows their mean.
We run a second full factorial design, with seven of the ten factors of the main experiment (a–e, g, i, j), fixing factors (f) and (h), and adding a factor to evaluate the presence versus absence of transfer learning (factor t). The new factorial design, with treatments, shows transfer learning as critical for performance: it explains (favorably) 14.7% of the absolute variation, and a whopping 62.8% of the relative variation of performance (computing those metrics the same way as the in the main experiment, i.e., excluding the residuals, and the choice of test dataset and its interactions from the relative variation), with high significance (pvalue below 0.001). We omit the ANOVA table for concision. Those results reinforce previous findings on the importance of transfer learning for melanoma detection [13].
As mentioned, full factorial designs are way too expensive for the majority of situations. The most common procedure is the exact opposite: taking a single factor to optimize, and performing a couple of experiments on that factor alone, keeping all others fixed (starting from a combination considered reasonable). Once a factor is decided, one commits to it and takes the next to optimize, until the procedure is complete.
We evaluate the impact of such sequential procedure, simulating it using the measurements on the full design. We take, at random, both the starting treatment and the sequence of factors to test. For factors not yet optimized, the level is given by the starting treatment. Each factor is optimized in turn, by comparing the performance of the alternative treatments on the fullfactorial data of a chosen hyperoptimization dataset. The outcome of a single simulation is the performance of the optimized treatment on a chosen measurement dataset. We use the mean keratosis/melanoma AUC as the performance metric.
Fig. 2 shows the results for pairs of hyperoptimization measurement datasets, where we perform 100 simulations for each pair. The actual measurements appear as black dots, and the violin plots show their estimated density, while the big red dot shows their mean. The most notable observation is the (unrealistic) advantage of hyperoptimizing and measuring on the same dataset: not only do we get higher averages, but also a smaller variability. The advantage of hyperoptimizing and measuring on splits of the same dataset is more subtle, but present.
The expense of the full factorial design, the instability of the sequential procedure, and the limited correlation of performances across datasets seem to leave few options to practitioners. Fortunately, singlemodel schemes are seldom used today, and ensembles of several models help to alleviate those issues.
We simulated different ensemble strategies, by pooling the predictions of models present in our full design. We evaluate three pooling strategies: average, max, and extremal. Average and maxpooling work as usual. Extremal pooling takes, from the list of values being pooled, the value most distant from 0.5 — it may be seen as an “hypothesisinvariant” maxpooling. In all cases, after pooling, we renormalize the probability vector to ensure it sums up to one. Half the models in the full design entered as candidates, and we discarded in this experiment the models with the SVM layer, due to issues in making their probabilities commensurable with the deeponly models.
Fig. 3 shows the main results. Averagepooling was, by far, the best choice for pooling the decision. Such clearcut advantage came as a surprise for us, as maxpooling often outperforms averagepooling in related tasks. If no other information is available, simply averagepooling randomly selected models is a reasonable strategy.
The use of dozens — even hundreds — of models may sometimes be justified in critical tasks (like medical decisions), but training and evaluating so many deep networks is cumbersome. Fortunately, as Fig. 4 shows, a handful of models seem to work just as well. The results shown here are the “good news” part of this paper: we can escape the expense of the full factorial design, and the instability of the sequential designs, by averaging a dozen or so models with parameters chosen entirely at random — although the random ensembles start very unstable, they soon converge to a reasonable model, in average and variability. If we decide to perform a full factorial, there is good news too: the best models learned in one dataset seem to be informative to compose the ensembles in other datasets, allowing to get top performances with very small ensembles.
Here, again, the unfair advantage of optimizing (selecting the models for the ensemble) and measuring performance on the same dataset appears. The advantage is small but systematic for the test split of ISIC (Fig. (a)a); it is much more apparent for the challenging collection of clinical images of EDRA Atlas (Fig. (b)b).
As a final experiment, we made two simulations of “new” submissions to the ISIC 2017 Challenge. The first simulates a blind procedure, mimicking the conditions of the challenge (limited information about the validation split, no information whatsoever about the test split). In that simulation, we assume a full factorial experiment performed only on our internal validation split, and keep the 32 best models (as measured in that split). We then test 32 incremental ensembles on the ISIC validation set, finding that the ensembles with 15 or 16 models have the best performance. We commit to the ensemble with 15 models and do one evaluation of that ensemble on the ISIC test split, finding AUCs of 0.895 (melanoma), 0.967 (keratosis), and 0.931 (combined). To put those numbers in perspective, in the actual competition the best AUCs were, respectively, 0.874, 0.965, and 0.911 (obtained by different participants).
We also simulated a privileged procedure, hyperoptimizing without restraint on the ISIC Challenge test split itself. We use the full design on the ISIC Test split to select the 32 best models, and then test 32 incremental ensembles, handpicking the best result for melanoma and for keratosis. We find AUCs of 0.916 (melanoma), 0.970 (keratosis), and 0.943 (combined). Again, to get a better grasp of the difference, consider that the 2.p.p. increase on the melanoma AUCs is larger than the difference between the 1st and the 4th teams ranked in the 2017 Challenge.
V Discussion
When one contrasts the blind and the privileged procedures sidebyside, the problems with the latter become apparent, but the privileged protocol is the most common in machinelearning literature (we are often guilty ourselves). Hyperoptimizing on test does not result from researchers’ desire to cheat, but from their natural tendency to exploit scarce existing data as much as possible. Salzberg has warned researchers about the dangers of “repeated tuning” since 1997; his work
[30] is often cited, but the issue is still far from solved.Avoiding that pitfall requires a very strong commitment, which researchers seem unable to keep. That only reinforces the importance of regular curated challenges — like the ISIC Challenge — in which the test set is withheld at least until the evaluations are over. The ImageNet competition is perhaps the best example of the extraordinary impact of having such a curated competition every year.
Our findings explain, in part, why performances observed in practice fall much shorter of the numbers we get in our labs. In a single round of experiments, analyzing only two levels for nine factors, the unwarranted advantage of hyperoptimizing on test is already notable. In actual research, with many rounds of experiments over dozens of factors and hundreds of levels, the gap may be much wider.
Our main evaluation of 2560 different models shows that almost half the variability in performance is explained by the amount of data alone. That reinforces the deeplearning creed on the “unreasonable power of data” and has important consequences for the melanoma detection community: in order to move research forward, we need to curate larger shared datasets. The ISIC Archive is an essential step in that direction — but it would have to grow almost tenfold to match the largest (private) dataset reported in the literature [12].
Despite the predominance of data, other factors appeared relevant. The most noteworthy is, perhaps, the use of data augmentation on test samples. The use of deeper models in combination with extra data, also appeared as an important advantage. Increased resolution — even in a very limited scheme — was also advantageous.
Notable negative results were the use of segmentation to help classification, and the use of an extra SVM Layer. The negative result on segmentation is the most surprising, and needs further exploration, since works in literature report many different ways to incorporate segmentation information to classification, often with improved performances. We would like to explore that theme in a future work, exhaustive in that particular scope.
The limited correlation between different possible metrics (in particular with specificity) has consequences for melanoma detection literature: stressing one metric over the other may lead the community to different directions.
The ensemble experiments bring the most encouraging news, by providing reliable performance, without either the expense of the full factorial design, or the instability of the traditional sequential optimization. Even a simple accumulation of enough randomlysampled models was sufficient to provide adequate performance. Learning the best models on one dataset, however, helps to select the best ensembles for other datasets. Ensembles appear as a promising avenue for future explorations. In future works, we would like to design and evaluate techniques more sophisticated than pooling the decisions of the models, like model stacking, or boosting.
Acknowledgment
E. Valle and M. Fornaciali are partially funded by Google Research Awards for Latin America 2016 & 2017; E. Valle is also partially funded by a CNPq PQ2 grant (311486/20142). A. Menegola is funded by CNPq. This project is partially funded by CNPq Universal grant (424958/20163). RECOD Lab. is partially supported by diverse projects and grants from FAPESP, CNPq, and CAPES. We gratefully acknowledge the donation of a Tesla K40 and a TITAN X GPUs by NVIDIA Corporation, used in this work. We thank Fabio Perez, Micael Carvalho, and Fillipe D. M. de Souza, for the final revision. We thank Prof. M. Emre Celebi for kindly providing the machinereadable metadata of the EDRA Interactive Atlas of Dermoscopy and for the help with the final revision of the first draft.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 [2] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler et al., “Skin Lesion Analysis Toward Melanoma Detection: A Challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), Hosted by the International Skin Imaging Collaboration (ISIC),” arXiv preprint arXiv:1710.05006, 2017.
 [3] A. Menegola, J. Tavares, M. Fornaciali, L. T. Li, S. Avila, and E. Valle, “RECOD Titans at ISIC Challenge 2017,” arXiv preprint arXiv:1703.04819, 2017.
 [4] M. E. Celebi, H. A. Kingravi, B. Uddin, H. Iyatomi, Y. A. Aslandogan, W. V. Stoecker, and R. H. Moss, “A methodological approach to the classification of dermoscopy images,” Computerized Medical Imaging and Graphics, vol. 31, no. 6, pp. 362–373, 2007.
 [5] H. Iyatomi, H. Oka, M. E. Celebi, M. Hashimoto, M. Hagiwara, M. Tanaka, and K. Ogawa, “An improved internetbased melanoma screening system with dermatologistlike tumor area extraction algorithm,” Computerized Medical Imaging and Graphics, vol. 32, no. 7, pp. 566–579, 2008.
 [6] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. der Laak, B. G. Clara, and I. Sánchez, “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017.
 [7] M. Fornaciali, M. Carvalho, F. V. Bittencourt, S. Avila, and E. Valle, “Towards automated melanoma screening: Proper computer vision & reliable results,” arXiv preprint arXiv:1604.04024, 2016.
 [8] S. Sabbaghi, M. Aldeen, and R. Garnavi, “A deep bagoffeatures model for the classification of melanomas in dermoscopy images,” in IEEE Engineering in Medicine and Biology Society, 2016, pp. 1369–1372.

[9]
E. NasrEsfahani, S. Samavi, N. Karimi, S. M. R. Soroushmehr, M. H. Jafari, K. Ward, and K. Najarian, “Melanoma detection by analysis of clinical images using convolutional neural network,” in
IEEE Engineering in Medicine and Biology Society, 2016, pp. 1373–1376.  [10] X. Jia and L. Shen, “Skin lesion classification using class activation map,” arXiv preprint arXiv:1703.01053, 2017.
 [11] L. Yu, H. Chen, Q. Dou, J. Qin, and P.A. Heng, “Automated melanoma recognition in dermoscopy images via very deep residual networks,” IEEE Transactions on Medical Imaging, vol. 36, no. 4, pp. 994–1004, 2017.
 [12] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun, “Dermatologistlevel classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017.
 [13] A. Menegola, M. Fornaciali, R. Pires, F. V. Bittencourt, S. Avila, and E. Valle, “Knowledge transfer for melanoma screening with deep learning,” in IEEE International Symposium on Biomedical Imaging, 2017, pp. 297–300.
 [14] A. R. Lopez, X. Giroi Nieto, J. Burdick, and O. Marques, “Skin lesion classification from dermoscopic images using deep learning techniques,” in IEEE International Conference on Biomedical Engineering (BioMed), 2017, pp. 49–54.
 [15] B. Harangi, “Skin lesion detection based on an ensemble of deep convolutional neural network,” arXiv preprint arXiv:1705.03360, 2017.
 [16] N. C. Codella, Q.B. Nguyen, S. Pankanti, D. Gutman, B. Helba, A. Halpern, and J. R. Smith, “Deep learning ensembles for melanoma recognition in dermoscopy images,” IBM Journal of Research and Development, vol. 61, no. 4, pp. 5:1–5:15, 2017.
 [17] Z. Ge, S. Demyanov, B. Bozorgtabar, M. Abedini, R. Chakravorty, A. Bowling, and R. Garnavi, “Exploiting local and generic features for accurate skin lesions classification using clinical and dermoscopy imaging,” in IEEE International Symposium on Biomedical Imaging (ISBI), 2017, pp. 986–990.
 [18] X. Yang, Z. Zeng, S. Y. Yeo, C. Tan, H. L. Tey, and Y. Su, “A novel multitask deep learning model for skin lesion segmentation and classification,” arXiv preprint arXiv:1703.01025, 2017.
 [19] C. N. Vasconcelos and B. N. Vasconcelos, “Increasing deep learning melanoma classification by classical and expert knowledge based image transforms,” arXiv preprint arXiv:1702.07025, 2017.
 [20] K. Matsunaga, A. Hamada, A. Minagawa, and H. Koga, “Image classification of melanoma, nevus and seborrheic keratosis by deep neural network ensemble,” arXiv preprint arXiv:1703.03108, 2017.
 [21] L. Bi, J. Kim, E. Ahn, and D. Feng, “Automatic skin lesion analysis using largescale dermoscopy images and deep residual networks,” arXiv preprint arXiv:1703.04197, 2017.
 [22] T. DeVries and D. Ramachandram, “Skin lesion classification using deep multiscale convolutional neural networks,” arXiv preprint arXiv:1703.01402, 2017.

[23]
K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
networks,” in
European Conference on Computer Vision
, 2016, pp. 630–645. 
[24]
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inceptionv4, inceptionresnet and the impact of residual connections on learning,”
AAAI Conference on Artificial Intelligence
, pp. 4278–4284, 2017.  [25] I. G. Díaz, “Incorporating the knowledge of dermatologists to convolutional neural networks for the diagnosis of skin lesions,” arXiv preprint arXiv:1703.01976, 2017.
 [26] Q. Abbas, M. Emre Celebi, I. F. Garcia, and W. Ahmad, “Melanoma recognition framework based on expert definition of abcd for dermoscopic images,” Skin Research and Technology, vol. 19, no. 1, 2013.
 [27] P. Wighton, T. K. Lee, H. Lui, D. I. McLean, and M. S. Atkins, “Generalizing common tasks in automated skin lesion diagnosis,” IEEE Transactions on Information Technology in Biomedicine, vol. 15, no. 4, pp. 622–629, 2011.
 [28] T. Yoshida, M. E. Celebi, G. Schaefer, and H. Iyatomi, “Simple and effective preprocessing for automated melanoma discrimination based on cytological findings,” in IEEE International Conference on Big Data, 2016, pp. 3439–3442.

[29]
J. Kawahara, A. BenTaieb, and G. Hamarneh, “Deep features to classify skin lesions,” in
IEEE International Symposium on Biomedical Imaging, 2016, pp. 1397–1400.  [30] S. L. Salzberg, “On comparing classifiers: Pitfalls to avoid and a recommended approach,” Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 317–328, 1997.
 [31] M. Fornaciali, S. Avila, M. Carvalho, and E. Valle, “Statistical learning approach for robust melanoma screening,” in Conference on Graphics, Patterns and Images (SIBGRAPI), 2014, pp. 319–326.
 [32] M. Carvalho, “Transfer schemes for deep learning in image classification,” Master’s thesis, University of Campinas, 2015.

[33]
A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in
IEEE Conference on Computer Vision and Pattern Recognition
, 2011, pp. 1521–1528.  [34] L. Ballerini, R. B. Fisher, B. Aldridge, and J. Rees, “A color and texture based hierarchical KNN approach to the classification of nonmelanoma skin lesions,” in Color Medical Image Analysis, 2013, pp. 63–86.
 [35] T. Mendonça, P. M. Ferreira, J. S. Marques, A. R. Marcal, and J. Rozeira, “PH2 A dermoscopic image database for research and benchmarking,” in IEEE Engineering in Medicine and Biology Society, 2013, pp. 5437–5440.
 [36] G. Argenziano, H. P. Soyer, V. De Giorgi, D. Piccolo, P. Carli, M. Delfino et al., “Dermoscopy: a tutorial,” EDRA, Medical Publishing & New Media, 2002.
 [37] O. Ronneberger, P. Fischer, and T. Brox, “UNet: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer Assisted Intervention, 2015, pp. 234–241.
 [38] N. Codella, J. Cai, M. Abedini, R. Garnavi, A. Halpern, and J. R. Smith, “Deep learning, sparse coding, and svm for melanoma recognition in dermoscopy images,” in International Workshop on Machine Learning in Medical Imaging, 2015, pp. 118–126.
Comments
There are no comments yet.