Skin Lesions Classification Using Convolutional Neural Networks in Clinical Images

12/06/2018 ∙ by Danilo Barros Mendes, et al. ∙ University of Brasilia 0

Skin lesions are conditions that appear on a patient due to many different reasons. One of these can be because of an abnormal growth in skin tissue, defined as cancer. This disease plagues more than 14.1 million patients and had been the cause of more than 8.2 million deaths, worldwide. Therefore, the construction of a classification model for 12 lesions, including Malignant Melanoma and Basal Cell Carcinoma, is proposed. Furthermore, in this work, it is used a ResNet-152 architecture, which was trained over 3,797 images, later augmented by a factor of 29 times, using positional, scale, and lighting transformations. Finally, the network was tested with 956 images and achieve an area under the curve (AUC) of 0.96 for Melanoma and 0.91 for Basal Cell Carcinoma.



There are no comments yet.


page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Today, skin cancer is a public health and economic issue, that for long years have been approached with the same methodology by the dermatology field [1]. This is troublesome when we analyze that for the last 30 years the numbers of cases diagnosed with skin cancer have increased significantly [2]. It is more troublesome when money comes in the equation, seeing that millions of dollars are being spent in the public sector [3]. A major part of this is spent in the individual analysis of the patient. Where the doctor analyzes the lesion and takes action on the pieces of evidence seen. If any of these steps were to be optimized, it could mean a decrease in expenditure for the whole dermatology sector.

Dermatology is one of the most important fields of medicine, with the cases of skin diseases outpacing hypertension, obesity and cancer summed together. That is accounted because skin diseases are one of the most common human illness, affecting every age, gender and pervading many cultures, summing up to between 30% and 70% of people in the United States. This means that in any given time at least 1 person, out of 3, will have a skin disease [4]. Therefore, skin diseases are an issue on a global scale, positioning on 18th in a global rank of health burden worldwide [5].

Furthermore, medical imaging can show itself as a resource of high value, as dermatology has an extensive list of illness that it has to treat. In addition, the field has developed its own vocabulary to describe these lesions. However, verbal descriptions have their limitations and a good picture can replace successfully many sentences of description and is not susceptible to the bias of the message carrier.

In addition, the recommended way to detect early skin diseases is to be aware of new or changing skin growths [6]. Analysis with the naked eye is still the first resource used by specialists, along with techniques such as ABCDE, that consists of scanning the skin area of interest for asymmetry, border irregularity, uniform colors, large diameters and evolving patches of skin over time [7]. In this way, the analysis from medical images is analogous to the analysis with the naked eye and thus can be applied the same techniques and implications. This supports the idea that skin cancer often is detectable through naked eye and medical photography.

Worldwide the most common case of cancer is skin cancer, been melanoma, basal and squamous cell carcinoma (BCC and SCC) the most frequent types of the disease [8]. This type of the disease is most frequent in countries with the population with predominant white skin or in countries like Australia or New Zealand [9].

In Brazil, it is estimated that for the biennium of 2018-2019, there will be

new cases of non-melanoma skin cancer (BCC and SCC mostly) [10]. Moreover, it is visible that the incidence of these types of skin cancer had risen for many years. This increase can be due to the combination of various factors, such as longer longevity of the population, more people being exposed to the sun and better cancer detection [8].

In the United States, the numbers add up to deaths estimated for 2017 [6]. Skin cancer accounts for more than cases (not including carcinoma in situ, nor non-melanoma cancers) in the US alone in the year of 2017 [6].

Despite skin cancer being the most common type of cancer in society, it does not represent a great death rate in its first stages, since the patient has a survival rate of . However, if the patients are diagnosed in the later stages the 5-year survival rate decreases to .

In Brazil, were expected to occur new cases of non-melanoma skin cancer in 2010. From that, it was expected that were diagnosed in early stages. However, even with early diagnosis this amount of cases means around R$ million (Reais) to the health public system and R$ million to the private system per year [3].

Moreover, we can divide skin lesions into two major groups, one being malignant lesions and the other benign lesions. The first is composed mostly of skin cancers and the latter being composed of any lesion that does not pose a major threat. One counterexample of this division is the actinic keratosis, that presents itself as a potential SCC, as it has the potential to develop into it. Thus, actinic keratosis is classified as a precancerous lesion

[11]. Furthermore, this work analyzed and chose lesions in total, 4 malignant and 8 benign (being 1 precancerous), as seen on Figure 1. The lesions where chosen mainly on the public data available online, to be described in subsection III-A.

Fig. 1: Skin lesions groups.
Source: Authors.

Seeing the problems involved in diagnosing skin lesions, this work envisions to create a learning model to classify skin lesions in one of 12 conditions of interest. With this purpose, the classifier aims to correct distinguish lesions analyzing clinical images with the condition. Furthermore, this can prove to be a useful tool to aid patients and doctors on a daily basis operation.

Furthermore, this work is done with the vision of being the stepping stone for newer approaches that democratize and distribute access to health care. A good lesion classification model may be the motor that will accelerate the construction of tools that puts the possibility of early diagnosis and alert on patient‘s hands, even far isolated patients, where few doctors can reach. These tools may save many lives and reduce several costs with the treatment of late-stage diseases.

The related work in this field proved that there are many algorithms capable of tackling this problem, but there is an astonishing difference between shallow and deep methods in machine learning. With that in view, this work will guide its efforts in using deep neural networks to achieve its main objective. For this to happen, the gathering of good practices and techniques used to approach the classification of clinical images is needed.

Ii Related Work

Many algorithms and tools have been created to aid these professionals in their task of detecting diseases in many fields [12, 13, 14, 15, 16]

. This has proven to add more reliability and confidence to doctors in their practices as they have more information to diagnose patients. For dermatology and skin lesions detection has not been different. History shows that many approaches had been made over the course of years, applications with shallow algorithms such as K-Nearest Neighbors (KNN)


and Support Vector Machines (SVM)

[18] had been proven to accomplish good results, but are as well tiresome to build applications that involve such approaches.

Seeing this, some researchers have been applying this approach to classifying skin lesions with success. One common thing in this domain is the lack of quality and scarcity of open data. It is common to see works with only a couple hundred of examples. That is a characteristic of the medical field. There are many hospitals and clinics that hold huge amounts of data and do not make it public mainly because of privacy issues with patients. However, many authors still apply efforts to push forward the technology in such fields, overcoming these barriers. For the purposes of this work, we listed some related researches that uses deep learning in dermatology, applying neural networks to skin lesions.

Matsunaga et al. (2017

) proposed an approach to classify melanoma, seborrheic keratosis, and nevocellular nevus, using dermoscopic images. In their work, they proposed an ensemble solution with two binary classifiers, that still leveraged from age and sex information of the patients, if they were available. Furthermore, they utilized techniques of data augmentation, using a combination of 4 transformations (rotation, translation, scaling and flipping). For the architecture, they chose the ResNet-50 implementation on the framework Keras, with personal modifications. This model was pre-trained with the weights for a generic object recognition model and finally used two optimizers AdaGrad and RMSProp. This work was then submitted to the ISBI Challenge 2017 and won first place, ahead of other 22 competitors.

Nasr-Esfahani et al. (2016) showed a technique that uses imaging processing as a previous step before training. This result in a normalization and noise reduction on the dataset, since non-dermoscopic images are prone to have non-homogeneous lightning and thus present noise. Moreover, this work utilizes a pre-processing step using k-means algorithm to identify the borders of a lesion and extract a binary mask, which the lesion is present. This is done to minimize the interference of the healthy skin in the classification. Furthermore, Nasr-Esfahani et al. (2016) used a technique called data augmentation to increase the dataset, using three transformations (cropping, scaling and rotation) and multiplied the dataset by a factor of 36 times. Finally, a pre-trained convolutional neural network (CNN) is used to classify between melanoma and melanocytic nevus for epochs ( iterations, using a batch size of 64 and a dataset with examples).

Menegola et al. (2017

) presented a thorough study for the 2017 ISIC Challenge in skin-lesion classification. In this work, it is presented experimentations with some pre-trained deep-learning models on ImageNet for a three-class model classifying melanoma, seborrheic keratosis, and other lesions. Models such as ResNet-101 and Inception-v4 were vastly experimented with several configurations of the dataset, utilizing 6 data sources for the composition of the final dataset. It was also reported the use of data-augmentation with at least 3 different transformations (cropping, flipping, and zooming). Also, it is reported that the points that were critical to the success of the project were mainly due to the volume of data gathered, normalization of the input images and utilizing meta-learning. The latter is elucidated as an SVM layer in the final output of the deep-learning models, that map the outputs to the three classes that were proposed in the challenge. Finally, this work won the first place in the 2017 ISIC Challenge for skin lesion classification.

Kwasigroch et al. (2017

) present a solution similar to the previous 3. This is due to the inherent limits and problems that are existent in this domain, data scarcity. In this work transfer-learning is applied, using two different learning models, VGG-19 and ResNet-50, both pre-trained on ImageNet

classes dataset. These were used to classify between malignant and benign lesions, using

dermoscopic images. For the correct learning process, it was also used the up-sampling of the underrepresented class. This process was done using a random number of transformations, chosen between rotation, shifting, zooming, and flipping. Furthermore, in this paper, it was presented 3 experiments, first with the VGG-19 architecture with the addition of two extra convolutional layers, two fully connected layers, and one neuron with a sigmoid function. Second it experimented with the ResNet-50 model, and finally a implementation of VGG-19 with an SVM classifier as the fully-connected layer. As a final result, the modified implementation of the VGG-19 had the best results. However, the main reason for the poor results in the ResNet-50 model was due to the small amount of training data. Maybe with larger amounts of data, it would be possible to train a small model and produce better results.

Esteva et al. (2017) presented a major breakthrough in the classification of skin lesions. This research compared the result of the learning model with 21 board-certified dermatologists and proven to be more accurate in this task. It was performed to classify clinical images, indicating whether a lesion is a benign or malignant one. For this result were used images, consisting of different diseases and including dermoscopic images. Furthermore, it was used a data-augmentation approach to mitigate problems as variability in zoom, angle, and lighting present in the context of clinical images. The augmentation factor was by 720 times, using rotation, cropping, and flipping. Here, an Inception-v3 pre-trained model was utilized as the main classifier, fine-tuning every layer and training the final fully connected layer. Moreover, the training was done for over than 30 epochs using a learning rate of

, with a decay of 16 after every 30 epochs. The classification was done in such a way that the model was trained to classify between 757 fine-grained classes, and then as the probabilities were predicted it was fed into an algorithm that selected the two different classes (malignant or benign). Using this approach, this work achieved a new state of the art result.

Seog Han et al. (2018) proposed to classify the skin lesions as unique classes, not composing meta-classes such as benign and malignant. It used the ResNet-152 pre-trained on the ImageNet model to classify 12 lesions. However, for training was used other 248 additional classes, that were added to decrease the false positive and improve the analysis of the middle layers of the model. Furthermore, this was done in such a way that the train sampling for the 248 diseases did not outgrow the main 12, thus when used for inference the model predicted one of the 12 illness, even when the lesion does not belong to one of them. For training was used images, augmented approximately 20 to 40 times, using zooming and rotation. These images were gathered from two Korean hospitals, two publicly available and biopsy-proven datasets, and one dataset constructed from 8 dermatologic atlas websites. Furthermore, the training lasted for 2 epochs using a batch size of 6 and a learning rate of without decay before 2 epochs. This early stopping was done to avoid overfitting on the dataset. Finally, it was reported that the ethnic differences presented in the context were responsible for poor results in different datasets, thus it was necessary to gather data from different ethnics and ages to correct mold the solution to reflect the real world problem present in skin lesions classification.

Finally, we can observe that every one of these works has one aspect in common, data scarcity. This is a characteristic of the medical domain, there are very few annotated examples of data that are publicly available. The works that proven to have more impact had to collect data from other sources, mainly private hospitals or clinics. Furthermore, this step of data collection did not fully mitigate the problem, it was still necessary to use techniques such as transfer-learning [24, 25] and data-augmentation [26, 27, 28].

Iii Methods and Materials

Iii-a Datasets

Due to the scarcity of data present in the medical field, the datasets chosen were not the selection of the best on a collection of options. The process of choosing one mainly took into account the criterion of public availability. Aside from that, the only pre-requisite was that the dataset was composed with only clinical images (photos taken from cameras without other tools or distorting lenses).

From these criteria, only two datasets fitted the description. The datasets contained 10 (ten) distinct lesions, containing 4 malignant illnesses at maximum. Another additional dataset was gathered from dermatologic websites, using a script for scrapping pages. The latter dataset was acquired from the work of Seog Han et al. (2018) and is not publicly available due to copyrights owned by the websites. Finally, these datasets are further discussed below.


The first dataset used is provided by the Department of Dermatology at the University Medical Center Groningen (UMCG) [29]. This dataset contains 170 images that are divided between 70 melanoma and 100 nevus cases. Furthermore, these images were processed with an algorithm for hair removal.


The second dataset is provided by the Edinburgh Dermofit Image Library [17] and is publicly available for purchase, under an agreement with the license of use111Available at This dataset is the more complete one found on the web. It contains images, that are divided into 10 lesions, including melanoma, BCC, and SCC. These images are all diagnosed based on experts opinions. In addition, it is also provided the binary segmentation of the lesion, for each one. It is valid to note that the images are not all in the same size.

Furthermore, the lesions and its respective numbers are listed in the table I.

Lesion Type Number of images
Actinic Keratosis 45
Basal Cell Carcinoma 239
Melanocytic Nevus (mole) 331
Seborrhoeic Keratosis 257
Squamous Cell Carcinoma 88
Intraepithelial Carcinoma 78
Pyogenic Granuloma 24
Haemangioma 97
Dermatofibroma 65
Malignant Melanoma 76
TOTAL 1,300
TABLE I: Lesion sampling for Edinburgh dataset.


This last dataset, was acquired from running several scripts for scrapping different dermatological websites222These websites included,,,,,,,, So that is the reason that this dataset was baptized as Atlas. This dataset was obtained from Seog Han et al. (2018) in a personal submitted request. It contains images downloaded from websites and distributed between six lesions.

The difference from the Edinburgh dataset is that this contains two lesions that are not present on the first, Wart and Lentigo – both benign lesions –, as it can be seen on table II. This, alongside with the Atlas and MED-NODE datasets, sums up to 12 lesions, that are the interest of this work.

Lesion Type Number of images
Basal Cell Carcinoma 1,561
Lentigo 69
Malignant Melanoma 228
Melanocytic nevus (mole) 626
Seborrheic keratosis 897
Wart 435
TOTAL 3,816
TABLE II: Lesion sampling for Atlas dataset.

One difference between Atlas and the first two datasets is the quality of the images, since the dataset was collected from web pages, is not all images that present the same quality, nor the same common viewpoints observed on the Edinburgh dataset. Therefore, this dataset is the most heterogeneous in matters of quality of imaging, viewpoints, the age of patients and ethnicity. However, this dataset in its entirety is not officially diagnosed by specialists, but on the other hand, these photos were displayed on websites that are reliable and used by students. So, there is a heuristic that these images were revised before putting to display in these websites and can be trusted.

Iii-B Handling data scarcity

As noted previously, for the correct generalization of the weights and biases of a network, a huge amount of data is needed. However, the medical field lacks this amount of images and if only used the data public provided, a good generalization of the problem cannot be met if we wish to train a deep neural network.

Iii-B1 Transfer Learning

In practice, the domains that are faced in the industry, rather than the academia, usually have low numbers of labeled data. This poses a major obstacle to train a deep convolutional neural network from scratch, since the data may not demonstrate a true representation of the real world. Thus, it is common to see works that utilize the pre-trained weights of a previously trained architecture, this can lead to 2 major approaches.

The approaches are: using a CNN as a fixed feature extractor or fine-tuning the trained model. The first is mostly used to collect features of images and then use them to train a linear classifier in a new dataset. The second strategy is to continue the training of the network, replacing completely the final layer, but updating the parameters through backpropagation.

A common use of transfer in computer vision, more specifically object classification, is to use pre-trained models that were trained on the ImageNet dataset. Some recent work done by

Kornblith et al. (2018

) shows that ResNets take the lead in performance when treated as feature extractors. While only fine-tuning some models to other datasets, they achieved a new state-of-the-art. All these tests used pre-trained weights and fine-tuned them with Nesterov momentum for

steps, which sometimes corresponded as more than epochs using a batch size of . Finally, it was proven, empirically, that the Inception-v4 architecture achieves overall better results for this task than the other 12 pre-trained classification models.

Therefore, transfer learning optimizes and cuts short most of the time in the training of new applications. However, this can add some constraints to the work. One example of this is when using a pre-trained network is not possible to extract and change arbitrarily the layers of the network. Another point is that normally, small learning rates are applied to CNN weights that are being fine-tuned. This is because we already expect that the weights are good, and we do not want to distort them too much [25].

Iii-B2 Data augmentation

Data augmentation is a technique used where we do not have an infinite amount of data to train our models. This can be done by introducing random transformations to the data. In image classification, this can be translated as rotating, flipping and cropping the image. These perturbations add more variability to the input, thus this could mean an overfitting reduction in our model by teaching it about invariances in the data domain [28, 31, 32]. Therefore, these transformations do not change the meaning of the input, thus, the label originally attributed to it still holds its importance.

Although some transformations in an image can be done agnostic to the field of application (e.g. translation), some other transformations are entitled to domain-specific characteristics. For this work we used an additional transformation that randomizes the natural light effect in the picture, this was done to mimic the transformations seen in indoors clinics due to different light sources. Furthermore, to increase the variability added by augmenting data, the probability of application and magnitude variability are added to the transformations.

Have seen the needs for augmentation, it was used the Augmentor Python library [33]

for implementing the process of augmenting the dataset. The library has predefined transformations and has a hot-spot for new implementations of transformations. This was quite useful when implementing the method to add light variance to the augmentations.

Each transformation chosen to be applied had been based on general guidelines of data augmentation [31, 32] or on the nature of the data. Thus, the transformations were aligned in a pipeline fashion, where each had a probability that defined the likelihood of being applied to the image and at the end, the new image was saved in the destination. Furthermore, the operations used for this work were the ones listed in table III.

Transformation Probability
Rotation 0.5
Random zoom 0.4
Flip horizontally 0.7
Flip vertically 0.5
Random distortion 0.8
Lightning variance 0.5
TABLE III: Transformations applied for data augmentation.

Iii-C Datasets Preparation

The first thing done, before applying transformations to the dataset, was to separate a test set, usually a 10% to 20% of each lesion, depending on the experiment. Following this, if needed for the experiment, was done the data augmentation process. Then the remaining sample was analyzed to see how much was necessary to augment each class. The process augmented the remaining dataset, usually, by a factor of 29 times. That summed with the original dataset was accounted to 30 times the original amount.

After processing the images necessary to compose the training and test datasets, the images for the training dataset were processed to create an LMDB file [34] for fast access to the data in training time. In this process the training dataset is divided between a training set and a validation set. Thus, this split is done in a way that 80% of the data is used for training and 20% is for validation. However, this split is done in a stratified way, so that each split has a fair amount of each class.

Finally, these slices of the dataset are kept separated and are used as such for the experiment.

Iii-D Architecture

The architecture used for this work has been the ResNet-152, used with pre-trained weights trained on ImageNet database. This architecture was chosen mainly for the results the family (ResNet-50, ResNet-101, and ResNet-152) had achieved on other related works.

Iii-E Metrics

The metrics used in the experiments were consistent throughout this work. This decision was made to build the ground necessary to compare the results between different experiments. Therefore, two metrics were used in training time and three for the testing step.

Training Time

For the training time, the main metric used was the accuracy metric. Nonetheless, as the model classifies 12 classes, the accuracy reported has two variants: top-1 accuracy and top-5 accuracy (or accuracy@5).

Testing Step

For the testing step, it was created a process that the predictions for both the validation and the test datasets were generated. With these predictions in hand, as well as the true labels of the examples, it was possible to create a confusion matrix for the model. Furthermore, with the confusion matrix at hand, was simple to compute other metrics, such as precision, recall (or sensitivity), and accuracy as well.

Another metric used to evaluate the models was the AUC (Area Under the Curve), along with the ROC (Receiver Operating Characteristic) curve. The ROC curve is a mapping of the sensitivity (probability of detection) versus 1

specificity (probability of false alarm), using various thresholds points. Typically, this metric is implemented in systems to analyze how accurately the diagnosis of a patient state is (diseased or healthy) [35]. Furthermore, the AUC summarizes the ROC curve and effectively combines the specificity and the sensitivity that describes the validity of the diagnosis [36].

Alongside with the ROC curve analysis, is common to calculate the optimal cut-off point. This is used to further separate the test results, so that a diagnosis of diseased or not is provided. When the point is closest to where the sensitivity is equal one and specificity is equal zero, it has achieved the best result possible [37, 38].

Iii-F Best Experiment

The best results achieved on this work were with the use of the ResNet-152 architecture, trained over an augmented dataset with a mixture of MED-NODE, Edinburgh and Atlas datasets. The augmentation made was of 29 times for each class, leaving the classes unbalanced.

Furthermore, the ResNet architecture had to be modified to accommodate the needs of the problem at hand. So, the last layer of the architecture was changed from classes to classes. Therefore, the final architecture produced followed the same schema seen in Figure 2.

Fig. 2: ResNet-152 architecture used.
Source: Authors.

Moreover, the technique of transfer learning was applied to generate the best results more rapidly. For that, the hyperparameters of the network had to be tuned and carefully set, for that same purpose.

Iii-F1 Dataset

The dataset used for the experiment was a derivative of the previous experiment. The original dataset consisted of a mixture of MED-NODE, Edinburgh and Atlas images, moreover, the dataset did not go under augmentation processes and was divided into three separated directories following the division of %, %, and % for testing, evaluation, and training, respectively.

Moreover, the training and validation datasets were augmented, using the transformations listed in table III, by a factor of 29 times. The testing dataset did not suffer any transformations. Finally, the final numbers for the datasets can be seen in table IV.

Number of images
Lesion Type Train Validation Test
Actinic Keratosis
Basal Cell Carcinoma
Intraepithelial Carcinoma
Malignant Melanoma
Melanocytic Nevus (mole)
Pyogenic Granuloma
Seborrheic Keratosis
Squamous Cell Carcinoma
TABLE IV: Number of images used in the dataset for the final experiment.

Iii-F2 Training

For the training phase, it was used transfer learning techniques. Thus, it was necessary to gather the ResNet-152 pre-trained weights for the ImageNet dataset333Available at Last accessed on June 26th, 2018. first, and then modify the network for the purpose of this work.

The learning rate was chosen to be higher than the used in the related works, for two major factors. First of all, one of the early experiments done showed that with a low learning rate it was found a plateau on the very start of the training. Thus, the network did not have the power to learn the features of the skin lesions. Secondly, it was found that increasing the learning rate often aids to reduce underfitting [39].

Additionally, the final dense layer has a 10 times factor of multiplication for the learning rate, compared to the other layers of the network. However, different from the process of freezing the early layers, used in the same research, this work approximates more to the approach implemented in [13], that fine-tuned all the layers of the network.

This was done with the premise in mind, that although the ImageNet dataset is far diverse and comprehends many different objects, it does not have classes that approximate in characteristics and problems encountered in this dataset of skin lesions. Furthermore, it the weights in the early layers may not be properly trained to extract fine features such as the ones found within the problem that is faced in this work. Therefore, it was needed to fine-tune the learnable parameters since the early layers and learn the final classifier from scratch.

Iii-F3 Hyperparameters

For this work it was used the Caffe framework

[40], since it allowed and simplified the changes that were needed to do in the layer levels. Furthermore, all the hyperparameters were defined in a separated configuration file called “Solver”, necessary to define a .prototxt file with free parameters used in the training. These hyperparameters can be seen in table V.

Hyperparameter Value
TABLE V: Hyperparameters used.

Due to the infrastructure limitations it was only possible to set the batch_size to 5. However, the Caffe framework provides an hyperparameter that serves as a hold on the update of the gradients. The iter_size

defines how many iterations the gradients will wait until the update. Altough using this hyperparameter may affect the batch normalization layers used in the architecture, the final results did not show this effect. Furthermore, the maximum iteration parameter was chosen to calculate the number of

epochsto 10.

Iii-F4 Infrastructure

All the experiments were conducted under the same environment, that consisted of Antergos 18.3 (Linux kernel 4.16) running BVLC Caffe [40] with support for an NVIDIA GTX 1070 GPU (Cuda 9.1 and cuDNN 7.1).

Iv Results

Finally, the model used in the testing phase was the product of the iteration number . This training phase took an uninterrupted total time of 35 hours (approximately seconds for every 50 iterations).

Iv-a Metric Results

With the confusion matrix generated for the predictions in the testing dataset, was found that for all the 11 lesions, with exception of the Actinic Keratosis, achieved a accuracy higher than 80%. Thus accounting for a 78% total accuracy for the model. However, this metric has a bias attached to it, since the distribution of the classes is not even, and therefore can cause misleading in the analysis of this metric.

Finally, the AUC and cut-off values for each ROC curve have been calculated (Figure 3). The table VI shows a comparison between results in these three works.

Lesion Esteva et al. Seog Han et al. This work
Actinic Keratosis - 0.83 0.96
Basal cell carcinoma - 0.90 0.91
Dermatofibroma - 0.90 0.90
Hemangioma - 0.83 0.99
Intraepithelial carcinoma - 0.83 0.99
Lentigo - 0.95444Metric calculated with an Asian dataset, thus may not serve as a comparative in a stricto sensu. 0.95
Malignant Melanoma 0.96 0.88 0.96
Melanocytic nevus - 0.94 0.95
Pyogenic granuloma - 0.97 0.99
Seborrheic keratosis - 0.89 0.90
Squamous cell carcinoma - 0.91 0.95
Wart - 0.94footnotemark: 0.89
TABLE VI: Comparative between AUC metrics.
(a) Actinic Keratosis
(b) Basal Cell Carcinoma
(c) Dermatofibroma
(d) Haemangioma
(e) Intrapithelial Carcinoma
(f) Lentigo
(g) Malignant Melanoma
(h) Melanocytic Nevus
(i) Pyogenic Granuloma
(j) Seborrhoeic Keratosis
(k) Squamous Cell Carcinoma
(l) Wart
Fig. 3: AUC and Optimal Cut-off for each lesion.
Source: Authors.

Iv-B Interpretability

For this work, it was judged to be important to bring an input of the model‘s interpretability, since the application is sensible with human life. Moreover, this work want to raise the importance for the use of interpretability techniques for machine learning in medical applications. In addition, an model with an explainability may be more well received by medical practitioners, since it demystify the decisions taken by the model.

Moreover, model interpretability is important at several points. For example, in the training phase if a model is behaving unexpectedly an engineer must know what is happening in order to reverse the situation. For that, if the engineer has in hands the interpretability features of the data, it may found out that the reasons to the behavior might be because of a hindsight bias [41] present in the dataset. So, when debugging an application that has a machine learning model, it is important to have the tools to debug properly.

Another fact that is important to discuss is trust. When people are using systems, specially life-critical systems such as medical software, people want to trust that nothing unexpected will happen. And trust for machine learning models can be gained in two ways: through daily use evidence, that can be achieved by getting a high accuracy, for example; another route is to explain how did the model reach the decision, that in this way the end-user will have not only proof that the decision was correct, but how did it know it was a correct decision to make. The second approach can lead to a scenario that the “black-box” becomes less so, and less intimidating, thus leading the user to be more inclining to use the prediction to take action [42]. According to Gunning and DARPA (2016), it is necessary to create a new approach to how the machine learning models present their predictions.

Iv-B1 Preliminary Results

An analysis of the model predictions was made, as a way to discover which were the examples in the validation dataset, that the model most got right and wrong. This is used as an artifact for interpretability, since it can bring new insights about the heuristics that are causing the model‘s decisions. Additionally, it is also commonly used even among simpler applications due to the simplicity of using this technique.

Furthermore, the GradCAM technique [44]

was applied for this work, as a method to have visual feedback to where the network‘s activations, previous to the softmax layer, were more predominant. This gives more artifacts to build an explanation for the model decisions. This can also be used to identify problems where the network learns adjacent features that may be present in lesion samples, but does not necessarily hold meaning with the lesion (e.g. nails or hair in a scalp near the lesion),

vide the “Husky vs Wolf” experiment in [45], where background snow was a predominant feature in classifying a Wolf.

For the most wrong predictions, it was found that the causes may fall under 2 factors: the lesion analyzed indeed caused confusion between the lesions (Figure 3(a)); the model did not generalize well and was struggling to extract predominant features in some of the images, thus giving more importance to areas that were not relevant, from a practical perspective (Figure 3(b) and 3(c)).

(a) Predicted as Haemangioma with 100% confidence.
Source: Edinburgh dataset.
(b) Predicted as Basal Cell Carcinoma with 100% confidence.
Source: Edinburgh dataset.
(c) Predicted as Melanocytic Nevus with 97% confidence.
Source: MED-NODE dataset.
Fig. 4: GradCAM applied to the most wrong predictions for Malignant Melanoma. Columns from left to right: original image; original image fused with heat-map; heat-map produced by GradCAM.

The most correct lesions brought more information on what the model was already good at, and how this translate in a perspective of image features. For example, in Figures 4(a) and 4(b), the model found out that the regions that mostly identify the lesions as Malignant Melanomas are indeed the ones that would bring more relevance to the doctor‘s decision making process. However there are still some examples, such as Figure 4(c), that are more emblematic and need a expert‘s eye to shed a light on it. Moreover, we can speculate that the model took advantage of the geometric and color asymmetry in the lesion to make an accurate decision.

(a) Melanoma with 100% confidence.
Source: Edinburgh dataset.
(b) Melanoma with 100% confidence.
Source: Edinburgh dataset.
(c) Melanoma with 100% confidence.
Source: MED-NODE dataset.
Fig. 5: GradCAM applied to the most correct predictions for Malignant Melanoma. Columns from left to right: original image; original image fused with heat-map; heat-map produced by GradCAM.

Furthermore, it was found that the model generalizes well for the examples that it was shown, correctly activating the regions that contained the lesion, even in images that the pose made challenging the localization of the lesion.

Nonetheless, this kind of interpretation is important not only in a developer vision but also for the possible doctors that were to receive a simple prediction and to take action on another human life based on it. With these other tools, the doctor has much more to support the next decision that is necessary to take on the patient. Therefore, making the models as interpretable as accurate can transform a tool in a good counselor.

V Discussions

In this paper, we discussed the importance of automatic classification method to support skin lesions diagnosis. Furthermore, we listed a group of researches and their achieved results for the same problem. However, it is still a problem with several difficulties, even more when we study clinical images, that may present an immense diversity due to variables such as cameras and environments.

Seeing this, this work presented a model capable of classifying 12 skin lesions, that reached results comparable with state-of-the-art. Additionally, it was presented studies on the model decision taking process with interpretability techniques. However, regardless of the excellent results encountered in this work, it is necessary to further test the model with more data, with more diversity (different ethnics and ages), and then investigate the results for improvements.