Malignant melanoma (MM) is the deadliest type of skin cancer and its incidence rate has increased over the last years . However, early detection of MM significantly increases the overall survival rate . Due to the particular morphological patterns of MM which can resemble other pigmented skin lesions, only 65-80% of melanomas are correctly diagnosed using unaided visual inspection by experienced medical experts . However, using supportive imaging modalities such as dermoscopy can improve diagnostic accuracy by up to 50% . In general, diagnostic accuracy highly correlates with expert’s experience. Therefore, using a computer-based technique as a tool to support inexperienced physicians or as a second opinion is of great interest.
The most promising solutions for computer-based skin lesion classification make use of deep learning and in particular of convolutional neural networks (CNNs) . As the number of publicly available skin lesion images is rather small, transfer learning is the conventional approach to use CNNs for skin lesion classification. Here, pre-trained CNNs are usually fine-tuned to perform the classification task [17, 6, 2, 18]. Various well-known pre-trained CNNs, such as ResNet , GoogLeNet , and DenseNet , have been introduced and can be used for skin lesion classification.
Resizing skin lesion images is generally necessary to allow for fine-tuning of pre-trained CNNs since pre-trained CNNs are typically trained with natural images of a fixed image size that is significantly smaller than dermoscopic images. Furthermore, due to computational limitations, it might be impossible to fine-tune networks with original high-resolution skin lesion images. On the other hand, downsampling may lead to a loss of useful medical information and the ideal resizing factor to fine-tune pre-trained CNNs remains an open question. In some previous studies [12, 29, 4], resized skin lesion images larger than the original input size of the utilised CNN were used. However, these studies were limited to a fixed re-scale factor or to a certain CNN. Therefore, the effect of input image size on the skin lesion classification performance still needs to be further explored.
In this paper, we investigate the effect of image re-scaling on skin lesion classification performance of several fine-tuned CNNs, namely ResNet-18, ResNet-50, and DenseNet-121. We examine the classification performance with re-scaled input images of five different resolutions: , , , , and pixels. To our knowledge, this is the first work investigating the effect of using both very small images and very large images on skin lesion classification performance. Moreover, we propose a three-level fusion approach by ensembling the results of different fine-tuned networks that were trained with images at different sizes. Experimental results show this approach to yield the state-of-the-art skin lesion classification performance when applied on the ISIC 2017 challenge test dataset with an average AUC of 92.9%.
Ii Materials and Methods
We employ two datasets from the ISIC archive111https://www.isic-archive.com/#!/topWithHeader/onlyHeaderTop/gallery. For fine-tuning pre-trained CNNs, we use the training, validation and test sets of the ISIC 2016 challenge dataset  as well as the training and validation set of the ISIC 2017 challenge dataset , which includes three types of skin lesions. From these two datasets, 2187 dermoscopic images are extracted for training, comprising 411 MM, 254 seborrheic keratosis (SK), and 1372 benign nevi (BN) images. For testing our algorithm, we use the 600 test images from the ISIC 2017 challenge dataset. Both training and test images contain various artefacts and are of image resolutions ranging from to pixels.
For pre-processing, first, we apply a grayworld colour constancy algorithm to normalise the colours of the images as suggested in 
. Then, we subtract the mean intensity RGB value of the ImageNet dataset from each individual channel of all training and test images. Finally, we resize the images to the five different resolutions of , , , , and
pixels using bi-cubic interpolation. For non-square images, the aspect ratio is changed during downsampling.
Ii-C Pre-trained CNNs
We use three well-established pre-trained CNNs with different depths and architectures, namely ResNet-18 , ResNet-50  and DenseNet-121 . These networks have been shown to give excellent classification performance for various medical image classification tasks including skin lesion classification [16, 29, 6]
. ResNet has a special building block called residual block with skip connections between the input and the output layer in each block. DenseNet’s architecture consists of dense blocks that connect each layer to all other layers in a feed-forward fashion. Both ResNet and DenseNet architectures can alleviate the vanishing gradient problem and strengthen feature propagation through the network. Although networks of varying depths exist for both networks, we choose the shallower depths of ResNet-18 and ResNet-50 for the ResNet model and DenseNet-121 for the DenseNet model to prevent overfitting to our limited training data.
For fine-tuning, the original fully connected (FC) layers of the pre-trained networks are replaced by two new FC layers with 64 and 3 nodes to adapt to the ternary (MM, SK, BN) classification task similar to 
. We randomly initialise the weights of these layers from a Gaussian distribution with zero mean and standard deviation of 1. We freeze the initial weight layers of the networks to speed up training and also to prevent overfitting. For DenseNet-121, the dense blocks up to the third block are frozen, while for ResNet-18 and ResNet-50 the residual blocks up to the fourth and the 17th block, respectively, are frozen. All three networks are initially pre-trained on natural images of size
pixels. For the other resolutions, we adapt the average pooling layer just before the FC layers to avoid dimensionality mismatch. We examine the effect of training the networks with three different optimisers, namely stochastic gradient descent with momentum (SGDM)
, root mean square propagation (RMSProp)13]
. We set the learning rate and momentum to 0.001 and 0.9 for SGDM and the learning rate to 0.0001 for RMSProp and Adam. However, we keep the learning rate of the new FC layers 10 times larger compared to all other weight layers for all networks. We choose varying batch sizes based on the used network, image resolution and the used GPU memory ranging from 16 to 64. Finally, we train the networks for 15 epochs while the learning rate is dropped by a factor of 10 at the fifth and tenth epoch. To artificially increase the training size, we augment training data by image rotations (90, 180 and 270 degrees) and horizontal image flipping, leading to an 8-fold increase of training data. The same augmentation scheme is applied in the inference phase. Thus, for a single test image, rotated and horizontally flipped versions of the test image are fed to the fine-tuned networks and the average result over all 8 augmented images is used for a single test image.
Ii-E Three-level network fusion
, we train each network three times with the same hyper-parameters and one optimiser and repeat the procedure for each of the three optimisers. We then take the average over all derived classification probability vectors (i.e., the average over 9 classification outputs for a single network). At level 2, we further fuse the results from the individual networks trained using four different image resolutions (i.e.,, , , and pixels). At the third and final fusion level, we fuse the predicted probability vectors of the various architectures to yield the final classification result. The final classification is thus derived from 108 sub-models as shown in Fig. 1.
|average over optimisers||ResNet-18||85.46||93.39||89.42|
|average over optimisers||ResNet-50||85.64||92.48||89.06|
|average over optimisers||DenseNet-121||85.53||93.36||89.45|
As suggested for the ISIC 2017 skin lesion classification challenge, we use the area under the receiver operating characteristics (ROC) curve (AUC) as the main evaluation index. We train all models to solve a ternary classification problem as there are three skin lesion types in the dataset. However, as the ISIC 2017 challenge evaluation is based on two binary classification tasks, namely MM vs. all and SK vs. all, we convert the three elementary prediction vectors to two elementary prediction vectors using a one-versus-all approach. This allows us to compare our results with other algorithms previously applied on the same dataset.
The trained models are evaluated on the (unseen) 600 test images of the ISIC 2017 challenge for skin lesion classification. In particular, there are 117 MMs, 90 SKs, and 393 BN dermoscopic images. We use identical pre-processing and augmentation techniques for all test images as described in Section II-B and Section II-D.
As described in Section II-E, to obtain a more robust and improved classification performance for each individual network and for each image resolution, we fuse the results of 9 models (level 1 fusion in Fig. 1). The results obtained by this fusion scheme for (as an example) images of pixels are given in Table I.
Next, we investigate the effect of input image sizes on the classification performances of the various fine-tuned deep models. The results of this are given in Table II for the five image sizes with the results of each network obtained by level 1 fusion from 9 models as explained above.
Table III shows the results obtained by the higher-level fusion schemes (i.e., level 2 and level 3 fusion in Fig. 1) as described in Section II-E. We exclude he smallest image size ( pixels) since, as is apparent from Table II, these led to significantly degraded classification performance.
|ResNet-18 (level 2)||all sizes||89.12||96.26||92.69|
|ResNet-50 (level 2)||all sizes||88.50||96.03||92.27|
|DenseNet-121 (level 2)||all sizes||87.69||95.77||91.73|
|level 3 fusion||all sizes||89.16||96.57||92.86|
We also evaluate another fusion strategy which performs fusion of the three networks’ outputs for a single image resolution and compare this with the proposed three-level fusion approach. The results of this comparison are shown in Table IV.
|fusion of all nets||80.59||89.67||85.13|
|fusion of all nets||86.39||94.26||90.32|
|fusion of all nets||86.86||94.42||90.64|
|fusion of all nets||88.99||96.05||92.52|
|fusion of all nets||88.70||95.64||92.17|
|three-level fusion||all sizes||89.16||96.57||92.86|
Fig. 2 shows the receiver operating characteristic curve (ROC) of the best approach which combines the results from all fusion levels.
give examples of correctly and incorrectly classified skin lesion images of our three-level fusion approach, respectively. For these, to convert the three elementary prediction vectors, we choose the highest probability as the predicted class by the model in each prediction vector.
We compare the performance of our proposed fusion approach (i.e., last row in Table III) with the top three performers of the ISIC 2017 competition as well as with other state-of-the-art algorithms which have been applied on the same dataset and have shown superior classification performance compared to the ISIC 2017 challenge winners. The results of the comparison are given in Table V in terms of AUC.
|Matsunaga et al. ||n/a||86.8||95.3||91.1|
|Menegola et al. ||87.4||94.3||90.8|
|Mahbod et al. ||87.3||95.5||91.4|
|Zhang et al. ||87.5||95.8||91.7|
|Yan et al. ||88.3||n/a||n/a|
|Guo et al. ||87.4||95.9||91.7|
Matsunaga et al. , top-ranked in the competition, made use of colour constancy as a pre-processing step and two separate classifiers for the two binary classification problems (i.e., MM vs. all and SK vs. all). For each classifier, a fine-tuned ResNet-50 was used as backbone model. As post-processing step, sex and age information were fused with the classifier’s output to yield final classification results. The down-sampling factor was not reported in their approach. Gonzalez-Diaz , the runner up, performed classification in a multi-step approach that employed three deep models including a full CNN to segment lesion areas in the images, a constrained CNN to add more clinical features for better categorisation, and a modified fine-tuned ResNet-50 to perform final classification. They used images resized to pixels for network training. Menegola et al. , the third-ranked team, used extensive external data sources from the ISIC archive and ensembled seven fine-tuned models (six models based on the Inception-v4 architecture  and one model based on the ResNet-101 architecture). Images down-sampled to pixels were used which made the algorithm faster compared to the other winners of the competition.
Several methods were developed after the ISIC 2017 competition and reported better performance compared to the challenge winners. In our earlier work 
, we used inter and intra-architecture network fusion to extract deep features from several fine-tuned deep models. However, a single image resolution ofpixels was used in this approach. Zhang et al.  proposed a novel attention residual learning CNN whose residual blocks aim to prevent the degradation problem and with an attention mechanism to force the network to focus on lesion areas. A pre-trained ResNet-50 model served as backbone model and image patches of size pixels acquired by central cropping of the original images at different scales were used. Similar to , Yan et al.  also used the idea of a learnable attention mechanism. They added two attention modules to a pre-trained VGG16 network and concatenated the features from the attention modules and the last convolutional layer by an average pooling layer before performing classification. Images down-sampled to pixels were used in their study. Guo et al.  proposed a multi-channel ResNet to perform skin lesion classification. In their approach, an OverFeat model  was used to crop skin lesion images. Then, the images were pre-processed by various techniques and used to fine-tune a number of ResNet-50 models. The image features from different models were concatenated and were used to perform skin lesion classification. Similar to other aforementioned methods, all images were resized to a fixed image size of pixels in their method.
Our algorithm is implemented in MatLab (ver. 2018a) based on the MatConvNet framework and the MatLab Neural Network Toolbox. All experiments were conducted on a single workstation with an Intel Corei5-6600k 3.50 GHz CPU, 16 GB of RAM and a single nVIDIA GTX 1070 card with 8 GB of installed memory. The average training times for each deep architecture and each image resolution are reported in Table VI. The training times for the different optimisers vary slightly and the reported results in Table VI are the average training times in minutes.
In this study, we explicitly investigate the effect of image re-scaling on skin lesion classification performance of several CNNs. Moreover, we achieve state-of-the-art classification performance on the ISIC 2017 challenge dataset by proposing a straightforward three-level ensemble strategy that uses multiple fine-tuned CNNs and multi-resolution dermoscopic images.
From the results in Table I, we can see that fine-tuning pre-trained network models with different optimisers deliver comparable classification performance. However, combining the results from the different optimisers leads to a better average AUC compared to the individual AUCs of all three models. This fusion step was inspired by our earlier work in , where a combination of 18 models was used for inter-network fusion. In contrast to there, here, we just fuse the results of nine models to reduce training time (the other nine models in  were trained by the same parameters, but with a different pre-processing step).
The results in Table II show the effect of input image size on the classification performance. The obtained results are of interest for several reasons. First, down-sampled images, even at a drastically reduced image size of pixels, still hold valuable information for classification as even the lowest AUC obtained (82.99%) is useful. However, the general performance of the models trained on pixel images is significantly lower compared to the results obtained using images with higher resolution. Thus, it is evident that heavy downsampling causes a loss of valuable information, which is also the reason we exclude the lowest image resolution from the subsequent fusion schemes.
Second, Table II shows a tendency of improved classification performance with increasing image size. The average results over the different models in Table II are 89.31%, 89.41%, 90.81% and 91.44% for input image resolutions of , , and pixels, respectively. If we fuse the results from all three networks of a specific image resolution (i.e., level 2 fusion), we obtain improved average AUC values of 90.32%, 90.64%, 92.52% and 92.52%, respectively. Thus, in both cases, an increase in image resolution also leads to an improvement in classification accuracy. Since the smallest image resolution in the dataset was pixels and also considering computational limitations, we did not conduct further experiments with images larger than . To our knowledge, fine-tuning any model for skin lesion classification with resized images of pixels is performed for the first time in this work.
Third, Table II allows for a comparison of the individual performances of the employed fine-tuned network models. Considering the average performance of each model for various image resolutions (i.e., 90.93%, 90.11% and 89.68% for ResNet-18, ResNet-50 and DenseNet-121, respectively), ResNet-18 shows the best performance. The classification performance of the same models in the ImageNet Large Scale Visual Recognition Challenge  is reversed (i.e., DenseNet121 is the best and ResNet-18 the worst). However, considering the number of training examples of the utilised dataset, we can infer that deeper models such as DenseNet-121 have a greater potential to overfit to the small training data size in this study, while shallower models such as ResNet-18 generalise better.
The results in Table III show the effect of the second- and third-level fusion schemes of our approach. From there, it is apparent that the multi-resolution fusion approach delivers a better classification performance compared to any single image resolution network. Our proposed three-level fusion approach, which combines the results of 108 models is shown to yield better classification performance still, outperforming all single networks and all lower level fusion schemes.
The results in Table IV show that fusing different networks at a single image resolution can also lead to improved classification. In general, the fusion of the three network models, i.e. ResNet-18, ResNet-50 and DenseNet-121 gives better results compared to any single network although, interestingly, this is not the case for the highest image resolution of pixels where ResNet-18 performs slightly better than the combination of the three networks. Our proposed three-level approach is superior compared to all networks fused this way.
The comparative results in Table V show that our proposed fusion scheme outperforms other state-of-the-art algorithms for both MM and SK recognition and with an improvement of at least 1.2% in terms of average AUC, confirming it to be a powerful approach for skin lesion classification.
While all reported results are derived from the ISIC 2017 challenge test dataset, a direct comparison of the classification performance is not trivial as different training sets were used in the different approaches. However, our method exploits fewer external training samples compared to most of the other approaches; 1444 external training samples were used in , 900 in , 7544 in , and 1320 in , while we utilised only 187 external training images (the same number as in ).
Looking at Table VI, it is apparent that the training time required for DenseNet-121 is higher compared to the other networks, which was expected since this architecture is significantly deeper than the other ones. Also as expected, more time is required for training networks with higher resolution images since more convolutions need to be performed in each layer of the networks.
Finally, Fig. 3 illustrates that our proposed algorithm is able to correctly classify challenging skin lesion images that contain various artefacts such as skin hair or ruler charts as well as images that would be difficult to automatically segment correctly. From Fig. 4 we can see that some even more challenging images are still misclassified, including some images where the lesion borders are not well defined and samples where the lesion occupies only a small part of the image.
While we have evaluated the effects of input image sizes as well as the effects of multi-model and multi-resolution image fusion for skin lesion classification, there are some limitations in this work. The biggest limitation of our proposed fusion approach is the training time required to derive the classification models which may not be suitable for application in a clinical setting. However, as it is possible to train different networks in parallel, the overall training time can be significantly reduced by accessing a number of suitable computational devices. Another consideration of our ensembling method is the fusion scheme which is averaging. In , a multi-scale CNN (M-CNN) was proposed that used multiple scale images in one single network. However, as the network width increased drastically, they could only use three convolutional layers which led to a very shallow network. Moreover, with a new architecture proposed, they had to train the model from scratch and hence were unable to take advantage of transfer learning, while their approach also did not allow to evaluate the contribution of each image scale to the final classification performance. However, with sufficient computational power, the classification performance of an M-CNN with pre-trained deep models for each image scale can be investigated. Another issue that can be further addressed in future work is the resizing factor. While in this paper we utilised five downsampling factors, the effect of other image resolutions between the minimum and maximum sizes can also be investigated. Finally, the number of pre-trained networks that we use is limited to three pre-trained CNNs. Exploring other architectures may be addressed in future studies.
In this paper, we have investigated the effect of image resolutions for transfer learning classification performance in the context of skin lesion analysis. The results of our study show that while down-sampling images to a very low resolution may not be optimal for fine-tuning pre-trained convolutional neural networks, even low-resolution images yield acceptable classification results. In contrast, images with higher resolution support further improved classification performance. In addition, we have presented a three-level fusion approach that combines results from different networks and different image resolutions and is demonstrated to result in the best classification performance compared to a number of state-of-the-art algorithms for skin lesion analysis and evaluated on the ISIC 2017 skin lesion classification challenge dataset.
This research has received funding from the Marie Sklodowska-Curie Actions of the European Union’s Horizon 2020 programme under REA grant agreement no. 675228. The authors thank nVIDIA corporation for their generous GPU donation.
-  (2015) Improving dermoscopy image classification using color constancy. 19 (3), pp. 1146–1152. External Links: Cited by: §II-B.
-  (2018) Skin cancer classification using convolutional neural networks: systematic review. 20 (10), pp. e11936. External Links: Cited by: §I, §I.
-  (2017) Skin lesion analysis toward melanoma detection: a challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), hosted by the International Skin Imaging Collaboration (ISIC). arXiv preprint arXiv:1710.05006. Cited by: §II-A.
-  (2017) Skin lesion classification using deep multi-scale convolutional neural networks. arXiv preprint arXiv:1703.01402. Cited by: §I.
-  (2017) Incorporating the knowledge of dermatologists to convolutional neural networks for the diagnosis of skin lesions. arXiv preprint arXiv:1703.01976. Cited by: TABLE V, §III, §IV.
-  (2018) Skin lesion diagnosis using ensembles, unscaled multi-crop evaluation and loss weighting. arXiv preprint arXiv:1808.01694. Cited by: §I, §II-C.
-  (2017) A multi-scale convolutional neural network for phenotyping high-content cellular images. 33 (13), pp. 2010–2019. Cited by: §IV.
-  (2018) Multi-channel-ResNet: an integration framework towards skin lesion analysis. Informatics in Medicine Unlocked 12, pp. 67 – 74. External Links: Cited by: TABLE V, §III.
-  (2016) Skin lesion analysis toward melanoma detection: a challenge at the International Symposium on Biomedical Imaging (ISBI) 2016, hosted by the International Skin Imaging Collaboration (ISIC). arXiv preprint arXiv:1605.01397. Cited by: §II-A.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §I, §II-C.
-  (2017) Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition, Cited by: §I, §II-C.
-  (2016) Deep features to classify skin lesions. In International Symposium on Biomedical Imaging, pp. 1397–1400. External Links: Cited by: §I.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §II-D.
-  (2004) Dermatoscopy of pigmented skin lesions. 139 (6), pp. 541–546. Cited by: §I.
-  (2014) Epidemiology of skin cancer. In Sunlight, Vitamin D and Skin Cancer, pp. 120–140. External Links: Cited by: §I.
-  (2018) Breast cancer histological image classification using fine-tuned deep network fusion. In Image Analysis and Recognition, A. Campilho, F. Karray, and B. ter Haar Romeny (Eds.), Cham, pp. 754–762. External Links: Cited by: §II-C.
-  (2019) Fusing fine-tuned deep features for skin lesion classification. 71, pp. 19–29. External Links: Cited by: §I, §II-D, §II-E, TABLE V, §III, §IV, §IV.
-  (2019-05) Skin lesion classification using hybrid deep neural networks. In International Conference on Acoustics, Speech and Signal Processing, pp. 1229–1233. External Links: Cited by: §I.
-  (2017) Image classification of melanoma, nevus and seborrheic keratosis by deep neural network ensemble. arXiv preprint arXiv:1703.03108. Cited by: TABLE V, §III, §IV.
-  (2017) RECOD titans at ISIC challenge 2017. arXiv preprint arXiv:1703.04819. Cited by: TABLE V, §III, §IV.
-  (2012) Machine learning: a probabilistic perspective. MIT press. External Links: Cited by: §II-D.
-  (2015) ImageNet large scale visual recognition challenge. 115 (3), pp. 211–252. External Links: Cited by: §II-B, §IV.
-  (2018) Melanoma. 392 (10151), pp. 971–984. External Links: Cited by: §I.
-  (2013) OverFeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229. Cited by: §III.
Inception-v4, Inception-Resnet and the impact of residual connections on learning. In Association for the Advancement of Artificial Intelligence, pp. 4278–4284. Cited by: §III.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §I.
-  (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. 4 (2), pp. 26–31. Cited by: §II-D.
-  (2019) Melanoma recognition via visual attention. In Information Processing in Medical Imaging, A. C. S. Chung, J. C. Gee, P. A. Yushkevich, and S. Bao (Eds.), Cham, pp. 793–804. External Links: Cited by: TABLE V, §III.
-  (2017) Aggregating deep convolutional features for melanoma recognition in dermoscopy images. In International Workshop on Machine Learning in Medical Imaging, pp. 238–246. Cited by: §I, §II-C.
-  (2019) Attention residual learning for skin lesion classification. 38 (9), pp. 2092–2103. External Links: Cited by: TABLE V, §III, §IV.