Melanoma is the leading cause of deaths due to skin cancer. Its prognosis is very good when detected early, but deteriorates rapidly as the disease progresses. Therefore, early diagnosis is critical and screening — the search for new cases — must be a continuous process.
Image processing can help melanoma screening programs, differentiating malignant from benign skin lesions. Nowadays, image classification has been mostly done through deep learning techniques. Unfortunately, it is not common to find huge amounts of medical data enabling medical computer vision with deep learning. So researchers usually employtransfer learning techniques in order to deal with the lack of enough annotated images.
In this report we aim to clarify how transfer schemes may influence the final results of automated melanoma screening. The main aspects under investigation are: (1) if (and how) consecutive transfer schemes — specializing the classification tasks along the pipeline — improve the results; (2) if transfer learning done between similar datasets/tasks improve the results; and (3) how much fine tuning improves results for small datasets.
Our main contributions are not in understanding fine level details about transfer learning (e.g. parametrization), but to clarify how transfer schemes should be organized to improve final results. Although we focus in the automated melanoma screening problem, we understand that the main findings may generalize to other specific-context datasets.
In visual tasks, the low-level layers of a deep neural network (DNN) tend to be fairly general; specialization increases as we move up in the network.
Therefore, transfer learning is commonly used on a straightforward way: you just need to freeze the weights of a pre-trained original DNN up to a chosen layer, replacing and retraining the other layers for the new task or even plugging an SVM classifier on the top layer (or any other classifier if you want to). This approach is calledvanilla transfer learning without fine-tuning.
In transfer learning with fine-tuning we update the network weights to adjust a original model for the desired task, improving classification results.
In this report, we forgo comparisons with the state-of-the-art, in order to focus on these questions: what is a good transfer learning scheme? Which original models work best for transfer? What is the relative impact of the original model choice versus the use of fine-tuning?
Let’s investigate two approaches of transfer: simple and combined. Among simple transfer learning, let’s analyse the results with and without fine tuning, selecting a model trained in a huge general-context dataset
(e.g. ImageNet) or a model trained in a smallerspecific-domain dataset (e.g. images of medical domain). The combined approach concatenates two sequential transfer steps: the first starting with a powerful general-context dataset, transferring the knowledge to a specific-domain dataset used to refine the model. The refined model is then used to transfer knowledge again, now to the target dataset (in our case, melanoma images).
As baseline, we will also use a DNN trained from scratch. That setup does not involve transfer learning: the model is generated from and tested only with melanoma images.
All approaches (from scratch, and with transfer learning) use the same DNN architecture: we adopt the VGG-M model proposed by Chatfield et al. 
. When we use transfer learning, we copy all layers but the last from the pre-trained VGG-M model and adapt the final layer for the target task. When we fine-tune, we exchange the model output layer for a softmax output layer with two or three classes, according to the experiment, and train that complete neural model as usual, backpropagating the errors and updating the weights throughout the network. However, in all networks, including the fine-tuned ones, we ignore the output layer and employ an SVM classifier to make the decision, using the next-to-last layer output as features (SectionIV). We describe the overall procedures next.
Ii-a Simple transfer learning WITHOUT fine tuning
The VGG-M network model is defined using Lasagne library111Lasagne library: https://lasagne.readthedocs.io/en/latest/;
We load the weights of a pre-trained network. In our approaches we have two options:
ImageNet: we used a pre-trained MatConvNet VGG-M model222MatConvNet: http://www.vlfeat.org/matconvnet/pretrained/, trained in the ImageNet dataset. Since the model available is a .mat file, it was necessary to transcript the weights to a Lasagne-readable format (we distribute the script that implements this transcription);
Medical domain: the VGG-M model is trained from scratch using a dataset of medical images. See specific details at II-D;
To match the input images to the size required by VGG-M, we resize all images to pixels, using Pillow/ANTIALIAS333Pillow Imaging Library (version 2.3.0): https://pillow.readthedocs.io/, distorting the aspect ratio to fit when needed;
As a centering step, all input images are subtracted by the dataset mean used to train the model. For the ImageNet case, the mean is available in the .mat file whilst the mean from the medical dataset was calculated inside the code. These centralized images fed the pre-trained model;
We use the weights of the pre-trained network to extract the outputs of Group 7/Layer 19, which are vectors ofdimensions that describe the input images;
We -normalize those vectors, which will be mid-level features for the classifying step;
We separate 10% of the training set for validation;
We use the mid-level features from the remaining of the training set to feed a linear SVM classifier. We choose the Sklearn implementation444Sklearn: http://scikit-learn.org/stable/. We performed a grid search exploring the margin hardness
, seeking through the use of internal cross-validation for the best SVM classifier that minimizes the F-score (since this score can be used for all experimental designs (see SectionIII-B);
We incorporate the validation set into the training set and create a final SVM model using the best of the grid search;
Finally, we use that “final SVM model” and the mid-level features from the melanoma testing set to obtain the reported mean Average Precision scores.
Ii-B Simple transfer learning WITH fine tuning
We do steps 1) to 4) of II-A;
We augment the training set generating new perturbed images in order to balance the classes. The perturbations are: zoom, rotation, shear, translation, flipping and stretching transformations. We apply those transformations in each image, with parameters chosen at random. We include the new generated images in the training set, while images of the most favored class are excluded at random until the classes are balanced;
We perform a training step with 200 epochs over the loaded model with a learning rate schedule, starting with, and reducing to at epoch 100, and then to at epoch 150. As the model is being fine-tuned, we save the weights that minimizes the validation loss;
Then, we do the steps 5) to 10) of II-A, but in step 5) we use the weights with lowest validation loss to extract the outputs of Group 7/Layer 19;
Ii-C Combined transfer learning
The combined transfer learning is a sequence of two consecutive simple transfers. First we load the ImageNet model and fine tune it to a smaller medical domain dataset. Then we use that fine tuned model to perform a second fine tuning step, now over the melanoma dataset.
Ii-D Training a DNN from scratch
As mentioned before, our baseline is a DNN model trained from scratch using melanoma images. The steps are similar of transfer learning with fine tuning (II-B), but with small differences:
We initialize the network with random weights;
The mean used to center the images are now calculated over the training set of the melanoma dataset;
Iii Experimental Details
We used the dataset of the Interactive Atlas of Dermoscopy . This Atlas is a multimedia guide (Booklet + CD-ROM) designed for training medical personnel to diagnose skin lesions. The CD-ROM contains 1000+ clinical cases, each with at least two images of the lesion: close-up clinical image, and dermoscopic image. Most images are 768 pixels wide 512 high. Besides the images, each case is composed by clinical data, histopathological results, diagnosis, and level of difficulty. The latter measures how difficult (low, medium and high) the case is considered to diagnose by a trained human. The diagnoses include, besides melanoma (several subtypes), basal cell carcinoma, blue nevus, Clark’s nevus, combined nevus, congenital nevus, dermal nevus, dermatofibroma, lentigo, melanosis, recurrent nevus, Reed nevus, seborrheic keratosis, and vascular lesion. There is also a small number of cases classified simply as ‘miscelaneous’.
Iii-A2 Other datasets
The following datasets were only used train original models for transfer learning.
Diabetic Retinopathy: the other specific-domain dataset used on our experiments is the training set of the the Kaggle Challenge for Diabetic Retinopathy Detection555https://www.kaggle.com/c/diabetic-retinopathy-detection/data. This dataset is composed by more than 35,000 high-resolution retina images taken under a variety of imaging conditions. More information can be found at the challenge website;
ImageNet: as general-context dataset, we employed the ILSVRC-2012 challenge dataset, containing about 1.2M training images of 1,000 object categories from ImageNet . As mentioned before, we did not train this dataset from scratch, but used the MatConvNet VGG-M pretrained model, just converting the .mat file (containing the complete description of the network and the all layer weights pre-trained for the ImageNet task) for our framework.
Iii-B Experimental Designs
We investigated three protocols, trying to identify if (and how) label variations can impact the method. The protocols are:
Malignant vs. Benign lesions: melanomas and basal cell carcinomas were considered positive cases and all other diagnoses were negative cases;
Melanoma vs. Benign lesions: melanomas were positive cases while all other diagnoses were negative ones, removing basal cell carcinomas;
Basal cell carcinoma vs. Melanoma vs. Benign lesions: here we have three classes, one for melanoma, other for basal cell carcinomas and all other diagnoses were classified under a single third label.
For each protocol we employed 52-fold cross-validation, that is, the data was ‘randomly’ splitted in two groups: A and B. We trained in group A and tested in group B. Then, we reverted the protocol: we trained in B and tested in A. We have done 5 semi-random splits, making an effort to balance each group according to the diagnosis of the case (that is, almost 50% of the cases of each diagnosis for each group A and B).
We only used dermoscopic images, removing the ones with acral lesions. We used images of all difficulties: low, medium and high. We do not removed images with hair, dots, rulers and other signs not belonged by the lesions. Some images contained a black “frame” around the picture, which we removed, cropping by hand.
We used the mean Average Precision (mAP) for protocols (1) and (2). For protocol (3) (with three classes) we employed the macro mean Average Precision. The reference implementation adopted for both metrics were the ones used at PASCAL VOC 2007666PASCAL VOC 2007 Development Kit: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/#devkit.
We performed six experiments with each protocol, always using the VGG-M DNN architecture . The feature vectors were extracted using the 19th layer and the classification was done using SVM. The experiments are:
Training and testing a network from scratch with melanoma images;
Training a network from scratch with diabetic retinopathy images, transfer learning for melanoma without fine tuning;
Training a network from scratch with diabetic retinopathy images, transfer learning for melanoma with fine tuning;
Uploading a pretrained network with ImageNet images, transfer learning for melanoma without fine tuning;
Uploading a pretrained network with ImageNet images, transfer learning for melanoma with fine tuning;
Uploading a pretrained network with ImageNet images, transfer learning for diabetic retinopathy with fine tuning then transferring again for melanoma images with fine tuning;
A reference implementation for our approaches is available in the repository linked at our website777To find the source code of this paper, visit our website: https://sites.google.com/site/robustmelanomascreening.
We show our results in Table I. The main findings are:
As expected, training a DNN from scratch is not essentially better than performing transfer learning with fine-tuning. Moreover, training a DNN from scratch is time consuming ;
Performing fine-tuning improves the classification results. It’s true for both transfer learning from specific-domain dataset and from general-context. This result was already expected, according to the literature ;
Surprisingly, transfer learning between less related tasks (from ImageNet to melanoma) performed better than tasks in the same domain (medical images, from retina to melanoma). This result is consistent among all protocols, independently from employing fine-tuning or not.
Even more surprising, the combined transfer performed worse than the simple transfer. Besides that: the combined transfer had a similar result of the transfer from retina dataset, as if the whole knowledge learned from ImageNet was “erased” by the specific-domain dataset on its fine-tuning step;
Melanoma and basal cell carcinomas are two types of skin cancers. When both types are grouped under the positive class (first line), the classification improves. Maybe this occur because the unbalancing of positive and negative classes is smaller;
Removing basal cell carcinomas from the experiments diminish the classification results, maybe because the models have fewer images to learn the differences between them (second line, regarding the other ones);
|Experiments (mAP (%))|
|Baseline||Transfer from retina||Transfer from ImageNet||Combined transfer|
|no FT||with FT||no FT||with FT|
|Malignant vs. Benign||55.4 2.5||49.3 1.2||57.1 4.0||69.8 2.4||73.0 2.9||57.4 1.1|
|Melanoma vs. Benign||53.3 4.3||47.5 1.9||53.1 2.7||60.5 3.1||-||54.4 3.9|
|Basal cell carcinoma vs. Melanoma vs. Benign lesions||53.1 1.6||48.9 1.6||51.6 2.6||61.6 2.9||-||52.9 1.3|
As we mentioned before, although the fine-tuning processes occur in a completely neural network pipeline, we choose to show the final results using an SVM classifier in order to enable fair comparisons with past experiments : the results of transfer learning from ImageNet without fine tuning are the most comparable to the ones reported in our previous publication (regarding small differences at experimental designs). Since the results are very similar — with both codes employing VGG-M DNN + SVM classifier — we infer that the code used in this paper is correct.
Some explanations for transfer from retina dataset be worse than transfer from ImageNet are (a) that the the last model was much more optimized than the first one, (b) the ImageNet dataset is much bigger than the retina dataset and also (c) that the retina model was created with unbalanced training set.
Maybe the combined transfer did not perform as well as expected because diabetic retinopathy may be easier to diagnose than melanoma. So in the first transfer step the network did not learn to be as specialized as needed to “see” details/differences on skin lesion images. This can also justify why transfer from retina is worse than transfer from ImageNet.
In this report we investigate how different transfer learning schemes influence automated melanoma classification results. We evaluated transfer learning from a general-context dataset (ImageNet) and from a specific-domain dataset (diabetic retinopathy). We also investigated if sequential transfer steps improve the final classification result.
We show results consistent with the literature regarding training a DNN from scratch and differences between doing fine-tuning or not. We conclude that general findings of deep learning state-of-the-art are also applicable for automated melanoma screening literature, thus guiding future research.
Although we expected that transfer learning from related tasks (in our case, from and to medical domain datasets) could lead to better results, it was not observed. Some conditions that may had influenced the results are the dataset sizes, parametrization used for training the models and quality of the datasets (in terms of annotations, standardization and image acquisition processes). In this case, further investigation is needed.
We also conclude that the experimental design is sensitive to the image annotation, that is, small changes in the fold assembling can cause huge impacts in the final results. This finding is particularly important and will be discussed in future experiments.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research. We are also grateful to Prof. Dr. M. Emre Celebi for kindly providing the machine-readable matadata of The Interactive Atlas of Dermoscopy. A. Menegola is funded by CNPq; S. Avila is funded by PNPD/CAPES; R. Pires is funded by CAPES; M. Fornaciali, R. Pires and E. Valle are partially funded by Google Research Awards for Latin America 2016.
- Fornaciali et al.  M. Fornaciali, M. Carvalho, F. V. Bittencourt, S. Avila, and E. Valle, “Towards automated melanoma screening: Proper computer vision & reliable results,” arXiv preprint arXiv:1604.04024, 2016.
- Fornaciali et al.  M. Fornaciali, S. Avila, M. Carvalho, and E. Valle, “Statistical learning approach for robust melanoma screening,” in Conference on Graphics, Patterns and Images (SIBGRAPI), 2014, pp. 319–326.
- Yosinski et al.  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems, 2014, pp. 3320–3328.
- Chatfield et al.  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in British Machine Vision Conference, 2014.
- Argenziano et al.  G. Argenziano, H. P. Soyer, V. De Giorgi, D. Piccolo, P. Carli, M. Delfino et al., “Dermoscopy: a tutorial,” EDRA, Medical Publishing & New Media, 2002.
Deng et al. 
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A
large-scale hierarchical image database,” in
IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
Tajbakhsh et al. 
N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang, “Convolutional neural networks for medical image analysis: Full training or fine tuning?”IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1299–1312, 2016.