Code for the ISIC2018 Lesion Diagnosis Challenge
In this paper we present the methods of our submission to the ISIC 2018 challenge for skin lesion diagnosis (Task 3). The dataset consists of 10000 images with seven image-level classes to be distinguished by an automated algorithm. We employ an ensemble of convolutional neural networks for this task. In particular, we fine-tune pretrained state-of-the-art deep learning models such as Densenet, SENet and ResNeXt. We identify heavy class imbalance as a key problem for this challenge and consider multiple balancing approaches such as loss weighting and balanced batch sampling. Another important feature of our pipeline is the use of a vast amount of unscaled crops for evaluation. Last, we consider meta learning approaches for the final predictions. Our team placed second at the challenge while being the best approach using only publicly available data.READ FULL TEXT VIEW PDF
This article presents the design, experiments and results of our solutio...
Objective: This work addresses two key problems of skin lesion
As skin cancer is one of the most frequent cancers globally, accurate,
In this paper, we describe our method for the ISIC 2019 Skin Lesion
In this study, we investigate what a practically useful approach is in o...
We describe our methods to address both tasks of the ISIC 2019 challenge...
This paper reports the method and evaluation results of MedAusbild team ...
Code for the ISIC2018 Lesion Diagnosis Challenge
Deep learning and in particular convolutional neural networks (CNNs) have become the standard approach for automated diagnosis based on medical images . For the problem of skin lesion diagnosis, a new dataset has recently been made available to the public . The dataset consists of 10000 dermoscopic images showing skin lesions which have been diagnosed based on expert consensus, serial imaging or histopathology. Using this dataset, the challenge ”ISIC 2018: Skin Lesion Analysis Towards Melanoma Detection” has been proposed . We participate in this challenge with an automated method that relies on multiple state-of-the-art CNNs, heavy data augmentation, loss weighting, extensive, unscaled cropping and meta learning. In the following, we describe the details of our approach. Our code is available to the public111https://github.com/ngessert/isic2018.
The key metric of this challenge is a weighted accuracy (WACC) across the seven classes. This is equivalent to the average recall or sensitivity. Hence, the metric is defined as:
The baseline dataset is the HAM10000 dataset introduced by Tschandl et al. . In the following, we refer to this dataset as HAM. In addition, we used the public ISIC dataset which comprises roughly 13500 images. In the following, we refer to this dataset as ISIC. We checked all images for potential overlap between HAM and ISIC.
The first obvious problem of these datasets is heavy class imbalance. Table I shows the class distribution of HAM and ISIC. Therefore, we consider countering class imbalance as a key challenge to be addressed. Also, we have to assume that ISIC will not be that useful as it is even more imbalanced than HAM. For this reason, we only consider ISIC for few, high performing models in our final ensemble.
For internal validation, we split HAM into five sets with equal (imbalanced) class distribution for 5-fold cross-validation. All images were separated based on lesion affiliation, i.e., we made sure that images from the same lesion cannot occur both in a training and validation split. The information for this separation was provided by the organizers. We add the entire ISIC dataset to each training subset, if it is used.
For HAM, we kept the image size of . Note, that histogram equalization was applied to selected images by the dataset publishers 
. For ISIC, we resized all images to the size of HAM using bicubic interpolation.
During training, we applied online data augmentation to each image. First, we applied random cropping with a fixed size of . Note, that we did not use any scaling or aspect-ratio changes which are typically used 
. Then, we randomly flipped images along both dimensions with a probability of. Furthermore, we distorted the images with random changes in brightness and saturation. Last, we subtracted the per-channel training set mean from the images.
We tested random rotations, scaling, contrast, hue and aspect-ratio changes without an improvement in performance.
Overall, we rely on an ensemble for our final submission which has been successful in most challenges 
. In terms of model choice, we first assessed the value of using pretrained architectures. We found that fine-tuning a model trained on ImageNet performed signficantly better than training from scratch. Therefore, we chose to build our ensemble from pretrained models available to the public. We use the popular frameworks Tensorflow
and PyTorch. In particular, we use the Tensorflow Slim model library  and the PyTorch pretrained models library .
The models to be included are selected based on 5-fold cross-validation performance. We tested the Inception [4, 10] and ResNet [11, 12] variants as a baseline first. These included InceptionV3, InceptionResNetV2, ResNet50-V1, ResNet50-V2 (post norm structure) and ResNet101-V2 (post norm structure). Next, we considered more recent architectures which included PolyNet , ResNeXt , Densenet , SENets  and DualPathNet . Compared to the baseline models, the more recent architectures performed better which is why we quickly excluded the baseline.
Thus, the models that are included in the final ensemble are variants of Densenet, ResNeXt, PolyNet and SENets. For most of our hyperparameter searches we used Densenet121 and assumed that the choices translate well to the other architectures.
We found the training strategy to be crucial as well. We identified the most relevant hyperparameters to be the starting learning rate, the learning rate schedule and the choice of potential early stopping.
Especially the latter is important in terms of model training for the final submission. Usually, the final submission model should be trained on all available data with a fixed learning rate schedule. During cross-validation, we performed early stopping which means that we saved a checkpoint of the model when the best WACC was achieved throughout the entire training process. We compared this to the last checkpoint being saved and noticed a significant difference in the WACC metric. In general, this implies that our chosen learning rate schedule is not optimal. However, we found it difficult to obtain the optimal learning rate schedule and the difference between the best and last model remained large. We did not observe this difference in such strength for normal accuracy or the area-under-curve (AUC) metric. We suspect that the WACC is very sensitive to changes of the classes with a small number of examples. Overall, this implies that our final model for submission could be trained using a validation set with early stopping instead of training a model on the entire available dataset. As this leads to suboptimal exploitation of the available data, we included both models from cross-validation with early stopping (5 models) and models trained on the entire dataset (1 model). The fully trained models are weighted times higher than individual CV models in order to achieve a reasonable balance.
(Adam + Nesterov momentum) and RMSprop in terms of optimizers and noticed only slight difference in performance with Adam generally performing best. We chose Adam for all models. In terms of learning rate schedule, we follow a stepwise approach. We chose a starting learning rate ofand started reducing it with a factor of after epochs. Then, we continued reducing it with every epochs. We stopped optimization after epochs. In between, we saved the model with the best WACC on the validation set, based on evaluation every epochs. We used a batch size of .
For the loss function we used a standard cross-entropy loss as a basis:
where is the ground-truth label, is the softmax-normalized model output and the number of classes. As the seven classes in the dataset are highly imbalanced we considered different balancing approaches. In particular, we explored different ways of loss balancing and also sampling balanced batches. For loss balancing, we considered multiplying each classes’ loss by its inverse normalized frequency, i.e., the weighting term is defined as
where is the total number of samples and is the number of samples for class . This method puts a very strong weight on the highly underrepresented classes DF and VASC. Moreover, we considered a less extreme approach with
where denotes the total number of classes. Last, we considered sampling balanced batches by oversampling the underrepresented classes during training. While all approaches improved results over no balancing at all, the differences were minor. We used the first loss balancing approach for all cases as it performed best for most models. It should be noted that we adjusted this method for use with HAM and ISIC combined. As ISIC is even more imbalanced than HAM, this balancing strategy would lead to unreasonable high loss amplification for underrepresented classes. Therefore, we derived the weighting from HAM only, both for training with HAM only and with HAM and ISIC.
Besides class balancing, we also tried to incorporate the meta information on the method of diagnosis. The organizers provided the information whether each lesion was diagnosed by ”single image expert consensus”, ”serial imaging showed no change”, ”confocal microscopy with consesus dermoscopy” or ”histopathology”. Assuming that, within a given class, the means of diagnosis relates to the diagnostic difficulty, we tried to incorporate this meta information into the loss by increasing the loss for more difficult cases. Although this approach showed slightly increased performance for some models it appeared to be inconsistent across models and we did not incorporate it into all models of our final ensemble. E.g., for our reference model Densenet121 performance got even worse.
All training is performed on NVIDIA GeForce GTX 1080 Ti graphics cards. As some of the larger models have large memory requirements due to their feature map sizes, the graphics cards’ memory was insufficient for our standard batch size. For these cases, we scaled down the learning rate and batch size by the same factor.
Our evaluation strategy for the generation of the final predictions is shown in Figure 1. We made use of extensive multi-crop evaluation. The crops are unscaled and of size which is identical to the size chosen for training. We perform evaluations per model which results in predictions that need to be combined. For the models that do not have a validation set, we performed averaging across the
predictions. For the CV models, we incorporated a meta learning step. We constructed a flattened feature vector out of thepredictions and used the results from the validation set for training an SVM. Then, we predicted the final label of the test set based on the
CNN predictions. For this last model, we considered both random forests and SVMs with different kernels. We found SVMs with an RBF kernel to work best. Note, that a similar meta learning strategy was also used by one of last year’s challenge winners.
As a last step, we combined the predictions from the CV models and the fully trained models by averaging over all models. Instead of averaging we also considered voting, i.e., we counted how many models predicted a certain class. We found that averaging generally performs slightly better.
During training, we kept evaluating on the validation set using 16 crops in order to keep the computational effort at reasonable levels. For evaluation, we also tested increased numbers of crops, however, after 36 crops the improvement was negligible.
We performed model selection for our final ensemble based on the 5-Fold CV performance with crop evaluation of each architecture. For this selection, we simply averaged the predictions from all crops for evaluation. Then, we searched for an optimal combination of our architectures by averaging the predictions of a subset. In theory, an exhaustive search over all possible architecture combinations could be performed. However, this search would lead to millions of possible combinations which takes a significant amount of time. Instead, we first ranked all architectures by their 5-Fold CV performance. Then, we considered combinations where the best architectures are included only. We noticed that this approach usually leads to an optimal combination which includes roughly to out of the available architectures.
|Densenet121 with SVM|
|Densenet121 with ISIC|
|Densenet121 no pretraining|
|Densenet121 16-crop eval.|
|Densenet121 4-crop eval.|
|Densenet121 no weighting|
|Densenet121 batch balancing|
|Densenet121 diagnosis weighting|
We report some preliminary results for the key parts of our approach. The results are derived from 5-Fold CV as the official validation set is not supposed to provide an indication of the performance on the test set. Since we did not use a held-out test set, we do not follow the procedure shown in Figure 1 for the results reported in this section. Instead, we simply average the predictions from the 36 crops and use them as the final prediction on each fold. We summarize the mean accuracy, mean AUC and WACC in Table II for important architecture variations and an ensemble. In terms of models, we found that SENet performed best as a single model. Moreover, a large ensemble performs better than any single model approach. Our final ensemble contains 54 models with the following architectures: SENet154, ResNeXt101 32x4d, Densenet201, Densenet161, Densenet169, SE-Resnet101, PolyNet.
In this paper we propose an approach for automatic skin lesion diagnosis for the ”ISIC 2018: Skin Lesion Analysis Towards Melanoma Detection” challenge. We use a large ensemble of state-of-the-art CNN models. One of our key choices was to use full-sized images with unscaled, smaller crops for training in combination with extensive unscaled multi-crop evaluation. We combine the crops both by simple averaging and a meta learning strategy. This allows us to capture detailed, high-resolution features while also taking global context into account. Moreover, the HAM dataset is very challenging as it is highly unbalanced in terms of classes and the evaluation metric treats all classes equally. As this imbalance represents the real-world case where most examined lesions are benign, this is an important issue to be addressed. Therefore, we considered several balancing approach where simple loss weighting with inverse, normalized class frequency performed best. We also considered incorporating the meta information on how difficult it was to diagnose the lesion. We used the information by weighting the loss additionally by factor for each type of diagnosis. However, we observed inconsistent results across models. This indicates that our way of using the knowledge is not optimal. Also, the assumption that more extensive evaluation equals cases that are harder to learn is likely oversimplified. Finally, we constructed a large ensemble whose models were selected based on 5-Fold CV performance for our final predictions. Regarding single model performance, it is notable, that more recent architectures outperformed older standard architectures. Considering that many researches still use plain ResNets or even VGG as a baseline we suggest that it is reasonable to move to more recent architecture proposed for the natural image domain (ImageNet, etc.). With the overall goal of providing the best diagnosis we see this as an important step. For future work, our method could be refined with a more extensive, less intuition-driven hyperparameter search. Moreover, the combination of local features and global context could be incorporated into a single end-to-end trainable architecture.
This work was partially funded by the Forschungszentrum Medizintechnik Hamburg (02fmthh2017). Also, we gratefully acknowledge the support of this research by the NVIDIA Corporation under the GPU Grant Program.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” inAAAI, 2017, pp. 4278–4284.