Automated skin lesion classification is a challenging problem that is typically addressed using convolutional neural networks. Recently, the ISIC 2018 Skin Lesion Analysis Towards Melanoma Detection challenge resulted in numerous high-performing methods that performed similar to human experts for evaluation of dermoscopic images. To improve diagnostic performance further, the ISIC 2019 challenge comes with several old and new problems to consider. In particular, the test set of the ISIC 2019 challenge contains an unknown class that is not present in the dataset. Also, the severe class imbalance of real-world datasets is still a major point that needs to be addressed. Furthermore, the training dataset, previously HAM10000 , was extended by additional data from the BCN_20000  and MSK dataset . The images have different resolutions and were created using different preprocessing and preparation protocols that need to be taken into account.
In this paper we describe our procedure for our participation in the two tasks of the ISIC 2019 Challenge. For task 1, skin lesions have to be classified based on dermoscopic images only. For task 2, dermoscopic images and additional patient meta data have to be used. We largely rely on established methods for skin lesion classifcation including loss balancing, heavy data augmentation, pretrained, state-of-the-art CNNs and extensive ensembling [9, 10]. We address data variability by applying a color constancy algorithm and a cropping algorithm to deal with raw, uncropped dermoscopy images. We deal with the unknown class in the test set with a data driven approach by using external data. For task 2, we incorporate additional meta information into the model using a dense neural network which is fused with the CNN’s features.
2 Materials and Methods
The main traning dataset contains dermoscopic images, acquired at multiple sites and with different preprocessing methods applied beforehand. It contains images of the classes melanoma (MEL), melanocytic nevus (NV), basal cell carcinoma (BCC), actinic keratosis (AK), benign keratosis (BKL), dermatofibroma (DF), vascular lesion (VASC) and squamous cell carcinoma (SCC). A part of the training dataset is the HAM10000 dataset which contains images of size that were centered and cropped around the lesion. The dataset curators applied histogram corrections to some images . Another dataset, BCN_20000, contains images of size . This dataset is particularly challenging as many images are uncropped and lesions in difficult and uncommon locations are present . Last, the MSK dataset contains images with various sizes.
The dataset also contains meta information about the patient’s age group (in steps of five years), the anatomical site (eight possible sites) and the sex (male/female). The meta data is partially incomplete, i.e., there are missing values for some images.
In addition, we make use of external data. We use the dermoscopic images from the 7-point dataset . Moreover, we use an in-house dataset which consists of images. The in-house dataset also contains images that we use for the unknown class (UNK). We include images of healthy skin, angiomas, warts, cysts and other benign alterations. The key idea is to build a broad class of skin variations which should encourage the model to assign any image that is not part of the eight main classes to the ninth broad pool of skin alterations. We also consider the three types of meta data for our external data, if it is available.
For internal evaluation we split the main training dataset into five folds. The dataset contains multiple images of the same lesions. Thus, we ensure that all images of the same lesion are in the same fold. We add all our external data to each of the training sets. Note that we do not include any of our images from the unknown class in our evaluation as we do not know whether they accurately represent the actual unknown class. Thus, all our models are trained to predict nine classes but we only evaluate on the known, eight classes.
We use the mean sensitivity for our internal evaluation which is defined as
where are true positives, are false negatives and is the number of classes. The metric is also used for the final challenge ranking.
Our cropping strategy for uncropped dermoscopy images. Left, the original images is shown. In the center, the binarized image with the circular fit is shown. Right, the cropped image is shown.
Image Preprocessing. As a first step, we use a cropping strategy to deal with the uncropped images which often show large, black areas. We binarize the images with a very low threshold, such that the entire dermoscopy field of view is set to
. Then, we find the center of mass and the major and minor axis of an ellipse that has the same second central moments as the inner area. Based on these values we derive a rectangular bounding box for cropping that covers the relevant field of view. The processes is illustrated in Figure1
. We automatically determin the necessecity for cropping based on a heuristic that tests whether the mean intensity inside the bounding box is substantially different from the mean intensity outside of the bounding box. Manual inspection showed that the method was robust. In the training set,were automatically cropped. In the test set, images were automatically cropped. Next, we apply the Shades of Gray color constancy method with Minkowski norm , following last year’s winner . This is particular important as the datasets used for training differ a lot. Furthermore, we resize the larger images in the datasets. We take the HAM10000 resolution as a reference and resize all images’ longer side to pixels while preserving the aspect ratio.
Meta Data Preprocessing.
For task 2, our approach is to use the meta data with a dense (fully-connected) neural network. Thus, we need to encode the data as a feature vector. For the anatomical site and sex, we chose a one-hot encoding. Thus, the anatomical site is represented by eight features where one of those features is one and the others are zero for each lesion. The same applies for sex. In case the value is missing, all features for that property are zero. For age, we use a normal, numerical encoding, i.e. age is represented by a single feature. This makes encoding missing values difficult, as the missingness should not have any meaning (we assume that all values are missing at random). We encode a missing value asas is also part of the training set’s value range. To overcome the issue of missing value encoding, we also considered a one-hot encoding for the age groups. However, initial validation experiments should slightly worse performance which is why we continued with the numerical encoding.
2.3 Deep Learning Models
General Approach. For task 1, we employ various CNNs for classifying dermoscopic images. For task 2, our deep learning models consist of two parts, a CNN for dermoscopy images and a dense neural network for meta data. The approach is illustrated in Figure 2. As a first step, we train our CNNs on image data only (task 1). Then, we freeze the CNNs weights and attach the meta data neural network. In the second step, we only train the meta data network’s weights and the classification layer. We describe CNN training first, followed by the meta data training.
CNN Architectures. We largely rely on EfficientNets (EN) 
that have been pretrained on the ImageNet dataset with the AutoAugment v0 policy. This model family contains eight different models that are structurally similar and follow certain scaling rules for adjustment to larger image sizes. The smallest version B0 uses the standard input size . Larger versions, up to B7, use increased input size while also scaling up network width (number of feature maps per layer) and network depth (number of layers). We employ EN B0 up to B6. For more variability in our final ensemble, we also include a SENet154 
and two ResNext variants pretrained with weakly supervised learning (WSL) onmillion images .
CNN Data Augmentation.
Before feeding the images to the networks, we perform extensive data augmentation. We use random brightness and contrast changes, random flipping, random rotation, random scaling (with appropriate padding/cropping), and random shear. Furthermore, we use CutOut with one hole and a hole size of . We tried to apply the AutoAugment v0 policy, however, we did not observe better performance.
CNN Input Strategy. We follow different input strategies for training that transform the images from their original size after preprocessing to a suitable input size. First, we follow a same-sized cropping strategy which we employed in the last year’s challenge . Here, we take a random crop from the preprocessed image. Second, we follow a random-resize strategy which is popular for ImageNet training . Here, the image is randomly resized and scaled when taking a crop from the preprocessed image.
CNN Training. We train all models for epochs using Adam. We use a weighted cross-entropy loss function where underrepresented classes receive a higher weight based frequency in the training set. Each class is multiplied by a factor where is the total number of training images, is the number of images in class and controls the balancing severity. We found to work best. We also tried to use the focal loss  with the same balancing weights without performance improvements. Batch size and learning rate are adopted based on GPU memory requirements of each architecture. We halve the learning every epochs. We evaluate every epochs and save the model achieving the best mean sensitivity (best). Also, we save the last model after epochs of training (last). Training is performed on NVIDIA GTX 1080TI (B0-B4) and Titan RTX (B5,B6) graphics cards.
Meta Data Architecture. For task 2, the meta data is fed into a two-layer neural network with neurons each. Each layer contains batch normalization, a ReLU activation and dropout with . The network’s output is concatenated with the CNN’s feature vector after global average pooling. Then, we apply another layer with batch normalization, ReLU and dropout. As a baseline we use neurons which is scaled up for larger models, using EfficientNet’s scaling rules for network width. Then, the classification layer follows.
Meta Data Augmentation.
We use a simply data augmentation strategy to address the problem of missing values. During training, we randomly encode each property as missing with a probability of. We found this to be necessary as our images for the unknown class do not have any meta data. Thus, we need to ensure that our models do not associate missingness with this class.
Meta Data Training. During meta data training, the CNN’s weights remain fixed. We still employ our CNN data augmentation strategies described above, i.e., the CNN still performs forward passes during training and the CNN’s features are not fixed for each image. The meta data layers, i.e., the two-layer network, the layer after concatenation and the classification layer are trained for epochs with a learning rate of and a batch size of .
Prediction. After training, we create predictions, depending on the CNN’s input strategy. For same-sized cropping we take ordered crops from the preprocessed image and average the softmaxed predictions of all crops. For random-resize cropping, we perform predictions for each image with four differently scaled center crops and flipped versions of the preprocessed images. For the meta data, we pass the same data through the network for each crop. Again, the softmaxed predictions are averaged.
Ensembling. Finally, we create a large ensemble out of all our trained models. We use a strategy where we select the optimal subset of models based on cross-validation performance . Consider
configurations where each configuration uses different hyperparameters (e.g. same-sized cropping) and baseline architectures (e.g. EN B0). Each configurationcontains trained models (best), one for each cross-validation split . We obtain predictions for each and . Then, we perform an exhaustive search to find such that maximizes the mean sensitivity . We consider our top performing configurations from the ISIC 2019 Challenge Task 1 in terms of CV performance in . We perform the search using the best models found during training only but but we also include the last models in the final ensemble to have a larger variability. Finally, we obtain predictions for the final test set using all models of all .
|Configuration||Sensitivity T1||Sensitivity T2|
|ResNext WSL 1 SS|
|ResNext WSL 2 SS|
|EN B0 SS|
|EN B0 SS|
|EN B0 RR|
|EN B1 SS|
|EN B1 RR|
|EN B2 SS|
|EN B2 RR|
|EN B3 SS|
|EN B3 RR|
|EN B4 SS|
|EN B4 RR|
|EN B5 SS|
|EN B5 RR|
|EN B6 SS|
All cross-validation results for different configurations. We consider same-sized cropping (SS) and random-resize cropping (RR) and different model input resolutions. Values are given in percent as mean and standard deviation over all five CV folds. Ensemble average refers to averaging over all predictions from all models. Ensemble optimal refers to our search strategy for the optimal subset of configurations.refers to training with eight classes without the unknown class. T1 refers to Task 1 without meta data and T2 refers to Task 2 with meta data. ResNext WSL 1 and 2 refer to ResNeXt-101 WSL 32x8d and 32x16d, respectively .
For evaluation we consider the mean sensitivity for training with images only and for training with additional meta data. The results for cross-validation with individual models and our ensemble are shown in Table 1. Overall, large ENs tend to perform better. Comparing our input strategies, both appear to perform similar for most cases. Including the ninth class with different skin alterations slightly reduces performance for the first eight classes. Ensembling leads to substantially improved performance. Our optimal ensembling strategy improves performance slightly. The optimal ensemble contains nine out of the sixteen configurations.
Regarding meta data, performance tends to improve by to percent points through the incorporation of meta data. This increase is mostly observed for smaller models as larger models’ show only minor performance changes. The final ensemble shows improved performance.
For our final submission to the ISIC 2019 Challenge task 1 we created an ensemble with both the best and last model checkpoints. For task 2, we submitted an ensemble with the best model checkpoints only and an ensemble with both best and last model checkpoints. The submission with only the best model checkpoints performed better. Overall, the performance on the official test set is substantially lower than the cross-validation performance. The performance for task2 is lower than the performance for task 1.
|Class||Task 1||Task 2|
Table 2 shows several metrics for the performance on the official test set. For task 1, the performance for the unknown class is substantially lower than for all other classes across several metrics. For task 2, the performance for the unknown class is also substantially reduced, compared to task 1.
We explore multi-resolution EfficientNets for skin lesion classification, combined with extensive data augmentation, loss balancing and ensembling for our participation in the ISIC 2019 Challenge. In previous challenges, data augmentation and ensembling were key factors for high-performing methods . Also, class balancing has been studied  where loss weighting with a cross-entropy loss functions performed very well. We incorporate this prior knowledge in our approach and also consider the input resolutions as an important parameter. Our results indicate that models with a large input size perform better, see Table 1. For a long time, small input sizes have been popular and the effectiveness of an increased input size is likely tied to EfficientNet’s new scaling rules . EfficientNet scales the models’ width and depth according to the associated input size which lead to high-performing models with substantially lower computational effort and fewer parameters compared to other methods. We find that these concepts appear to transfer well to the problem of skin lesion classification.
When adding meta data to the model, performance tends to improve slightly for our cross-validation experiments. The improvement is particularly large for smaller, lower-performing models. This might indicate that meta data helps models that do not leverage the full information that is available in the images alone.
The ISIC 2019 Challenge also includes the problem to predict an additional, unknown class. At the point of submission, there was no labeled data available for the class, thus, cross-validation results do not reflect our model’s performance with respect to this class. The performance on the official test provides some insights into the unknown class, see Table 2. First, it is clear that the performance on the unknown class is substantially lower than the performance on the other classes. This could explain why there is a substantial difference between our cross-validation results and the results on the official test set. Second, we can observe a substantial performance reduction for the unknown class between task 1 and task 2. This might explain the lack of improvement for task 2, although our cross-validation performance improved with additional meta data. This is likely linked to the fact that we do not have meta data for our unknown class training images. Although we tried to overcome the problem with meta data dropout, our models appear to overfit to the characteristic of missing data for the unknown class.
Overall, we find that EfficientNets perform well for skin lesion classification. In our final ensembling strategy, various EfficientNets were present, although the largest ones performed best. This indicates that a mixture of input resolutions is a good choice to cover multi-scale context for skin lesion classification. Also, SENet154 and the ResNext models were automatically selected for the final ensemble which indicates that some variability in terms of architectures is helpful.
In this paper we describe our method that ranked first in both tasks of the ISIC 2019 Skin Lesion Classification Challenge. We overcome the typical problem of severe class imbalance for skin lesion classification with a loss balancing approach. To deal with multiple image resolutions, we employ various EfficientNets with different input cropping strategies and input resolutions. For the unknown class in the challenge’s test set, we take a data driven approach with images of healthy skin. We incorporate meta data into our model with a two-path architecture that fuses both dermoscopic and meta data information. While setting a new state-of-the-art for skin lesion classification, we find that predicting an unknown class and the optimal use of meta data are still challenging problems.
Szegedy et al., C. (2016) Rethinking the inception architecture for computer vision.In: CVPR, pp. 2818–2826
-  Kawahara et al., J. (2018) 7-point checklist and skin lesion classification using multi-task multi-modal neural nets. IEEE Journal of Biomedical and Health Informatics
-  Codella, N., Rotemberg, V., Tschandl, P., Celebi, M.E., Dusza, S., Gutman, D., Helba, B., Kalloo, A., Liopyris, K., Marchetti, M., et al. (2019) Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368
-  Codella, N.C., Gutman, D., Celebi, M.E., Helba, B., Marchetti, M.A., Dusza, S.W., Kalloo, A., Liopyris, K., Mishra, N., Kittler, H., et al. (2018) Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 168–172. IEEE
-  Combalia, M., Codella, N.C., Rotemberg, V., Helba, B., Vilaplana, V., Reiter, O., Halpern, A.C., Puig, S., Malvehy, J. (2019) Bcn20000: Dermoscopic lesions in the wild. arXiv preprint arXiv:1908.02288
-  Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V. (2018) Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501
-  DeVries, T., Taylor, G.W. (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552
-  Gessert, N., Schlaefer, A. (2019) Left ventricle quantification using direct regression with segmentation regularization and ensembles of pretrained 2d and 3d cnns. In: International Workshop on Statistical Atlases and Computational Models of the Heart
-  Gessert, N., Sentker, T., Madesta, F., Schmitz, R., Kniep, H., Baltruschat, I., Werner, R., Schlaefer, A. (2018) Skin lesion diagnosis using ensembles, unscaled multi-crop evaluation and loss weighting. arXiv preprint arXiv:1808.01694
-  Gessert, N., Sentker, T., Madesta, F., Schmitz, R., Kniep, H., Baltruschat, I., Werner, R., Schlaefer, A. (2019) Skin Lesion Classification Using CNNs with Patch-Based Attention and Diagnosis-Guided Loss Weighting. IEEE Transactions on Biomedical Engineering
Hu, J., Shen, L., Sun, G. (2018) Squeeze-and-excitation networks.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141
-  Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P. (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988
-  Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., van der Maaten, L. (2018) Exploring the limits of weakly supervised pretraining. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196
-  Tan, M., Le, Q.V. (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946
Tschandl, P., Codella, N., Akay, B.N., Argenziano, G., Braun, R.P., Cabo, H., Gutman, D., Halpern, A., Helba, B., Hofmann-Wellenhof, R., et al. (2019) Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study.The Lancet Oncology
-  Tschandl, P., Rosendahl, C., Kittler, H. (2018) The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 5, 180,161