Skin cancer is one of the most common cancer types in the white population Leiter et al. (2014). Among the different types of skin cancers, malignant melanoma (MM) accounts for only a small percentage of cases, nevertheless, it is responsible for the majority of skin cancer deaths Apalla et al. (2017). When detected at an early stage, MM can be cured by excision of the lesion, while diagnosis at later stages is associated with a greater risk of death Leiter et al. (2014); Schadendorf et al. (2018). Thus, early detection and accurate diagnosis of MM are crucial for the patient.
Histological examination of a skin lesion is the gold standard for diagnosis and prognosis Scolyer et al. (2020). But, as it is an invasive, costly, and time-consuming procedure, clinicians and patients alike want to reduce the number of necessary diagnostic skin biopsies Ibrahim et al. (2017). The diagnostic process starts with a visual inspection of suspicious lesions by analysing the skin lesion patterns Kittler (2007). The non-invasive and optical technique of dermatoscopy allows for a more detailed examination of the skin compared to examination by the naked eye alone and it can improve the skin lesion classification performance up to 50% Youl et al. (2007). However, even with dermatoscopes, the diagnostic performance correlates with the experience of the dermatologist Tschandl et al. (2019a).
Automated skin lesion classification with computer-aided diagnostic (CAD) systems has been attempted in dermatology for over 30 years Oliveira et al. (2018). Such systems could serve as a decision aid for clinicians, particularly in enhancing the decision-making of less-experienced clinicians. Many semi or fully automatic methods have been proposed for this task Oliveira et al. (2018)
. However, proposing an accurate computerized skin lesion classification method is a challenging task due to the similar morphological appearance of different skin lesion types and also due to the various artefacts contained in dermatoscopic images. Originally, CAD systems for skin lesion classification were mainly based on (1) image pre-processing and artefact removal, (2) lesion segmentation, (3) feature extraction from the lesion area and (4) lesion classification using classical image processing or machine learning approachesOliveira et al. (2018); Celebi et al. (2015). Various image pre-processing techniques such as colour space transformation, contrast enhancement, and image filtering were used to prepare or normalise the images for the classification Celebi et al. (2015, 2009). Image border detection and segmentation were also considered as important steps for image cropping or artefact removal. Classical image processing techniques such as histogram thresholding, clustering, or active contours were widely used in the literature for skin lesion segmentation Celebi et al. (2015). More advanced segmentation techniques were also proposed in recent years using supervised machine learning approaches Oliveira et al. (2018); Celebi et al. (2015, 2009). However, proposing an accurate skin lesion segmentation technique is a very challenging task due to several reasons such as low contrast between the lesion and its surrounding background, irregular border shapes, fuzzy borders, and fragmentation Celebi et al. (2009)
. After performing segmentation, skin lesion features can be extracted from the lesion area. Intensity-based features, shape-based features, and textural-based features were among the most used features for skin lesion classification. The extracted features were then used for training classifiers such as decision trees, artificial neural networks, and support vector machinesOliveira et al. (2018); Celebi et al. (2015, 2009); Mahbod et al. (2018a).
With the advent of convolutional neural networks (CNNs) and considering their excellent performance for a variety of medical classification tasks Vu et al. (2019); Mahbod et al. (2018b), many CNN-based approaches have been developed to perform skin lesion classification with superior performances compared to other classical techniques Dick et al. (2019). In contrast to the conventional methods, many CNN-based approaches for skin lesion classification were directly applied on raw or pre-processed skin lesion images without prior image segmentation. Top performers of the International Skin Imaging Collaboration (ISIC) challenges in 2016, 2017, 2018 and 2019111https://www.isic-archive.com/#!/topWithHeader/tightContentTop/challenges (Accessed on 2020-08-04) are examples of such approaches.
Despite excellent classification performance of the CNN-based approaches for skin lesion classification without using any lesion segmentation masks, the potential impact of skin lesion segmentation on the performance of CNN-based classifiers has not been systematically investigated Celebi et al. (2015). There are only a few studies that exploited lesion segmentation information in the CNN-based classification workflow to improve the performance.
In a study by Yu et al. Yu et al. (2017), a single network was proposed that performed lesion classification in two stages. In the first stage, a very deep fully convolutional residual network was used to perform lesion segmentation. Then, the images were cropped based on the predicted segmentation masks and the cropped images were sent to a deep residual network to perform classification. The results obtained from the ISIC 2016 challenge dataset Gutman et al. (2016) showed improved classification when the segmentation stage was used (accuracy of 85.5% with both stages vs. accuracy of 82.8% with a single classification stage). However, in terms of AUC, there was only a slight improvement in the classification performance (AUC of 78.3% with both stages vs. AUC of 78.2% with a single classification stage). This method achieved the first rank in the ISIC 2016 challenge for the defined binary skin lesion classification task. A similar approach was proposed in Al-masni et al. (2020). Again, two stages were used to perform lesion segmentation and the cropped images were used to perform classification. However, in this work a full resolution CNN was used as the segmentation network and other pre-trained CNNs were used in the classification network. Evaluated on the ISIC 2016 challenge test set and by setting the best hyper-parameters, an accuracy of 81.1% and an AUC of 76.6% were achieved by this method. This approach was also trained and evaluated on the subsequent ISIC challenge datasets. Applied on the ISIC 2017 Codella et al. (2017) and the ISIC 2018 Codella et al. (2019) challenge datasets an average accuracy of 81.6% (in comparison to 88.8% of the ISIC 2017 challenge top performer) and 89.3% were achieved, respectively. As the reported results for the ISIC 2018 challenge were based on the random split of the training set to training, validation, and test set, comparison to the ISIC 2018 challenge top performers is not feasible.
Guo et al. Guo and Yang (2018) proposed a multi-channel ResNet to classify images from the ISIC 2017 challenge dataset. They performed an experiment to compare the classification results with and without a lesion detection model. The results were slightly better when the lesion detection network was used (average AUC of 87.4% vs. 87.1%). By utilising various ensembling approaches, the result was further improved to an average AUC of 91.7%.
In the work of Chen et al. Chen et al. (2018), a multi-task framework was proposed to perform segmentation and classification within the same model. A special feature passing gate was proposed, which linked the segmentation network to the classification network to exploit useful features in the workflow. The method was trained and tested on the ISIC 2017 challenge dataset. Although the achieved results were superior in comparison to a single classification network (accuracy of 80.1% vs. 77.2%), they were inferior compared to the top performer of the ISIC 2017 challenge222The actual evaluation index of the ISIC 2017 challenge was AUC. As AUC results were not reported in Chen et al. (2018), we compared it to the top performer in terms of accuracy (accuracy of 80.1% vs. 88.8% Bi et al. (2017)).
Yang et al. Yang et al. (2018) proposed a multi-target CNN with three different branches to perform segmentation and two binary classification tasks for the ISIC 2017 challenge dataset. A pre-trained GoogleNet CNN was used in the encoder part of the network while the U-Net-like decoder model was used for the segmentation branches. The reported results were superior compared to a single GoogleNet-based classification model (AUC of 88.6% vs. 85.7%), but inferior compared to the ISIC 2017 challenge top performer (AUC of 88.6% vs. 91.1% Matsunaga et al. (2017)).
Díaz González-Díaz (2019) proposed a network that incorporated the useful clinical information in the classification workflow. Using two segmentation networks (lesion segmentation net to produce binary lesion masks and structure segmentation net to produce eight feature segmentation masks), a very good classification result was achieved with an average AUC of 91.0%, which ranked the approach at the 2nd place in the ISIC 2017 challenge leaderboard. However, no comparative results were reported to investigate the added value of the utilised segmentation networks for classification performance improvement.
Burdick et al. Burdick et al. (2018)
used a subset of the ISIC 2016 challenge dataset to perform a binary classification (MM vs. benign skin lesions). In their best approach, a pre-trained Inception-V3 model was fine-tuned. The training and test images were pre-processed by applying a disk morphological operation around the skin lesions of different sizes. Then, images were zero-padded and resized and finally sent to the classifier. They observed a better classification performance when the skin lesion border enlargement was applied (accuracy of 69.3% vs. 57.3%). However, in their study, just a small test set of 75 images was used, which could potentially lead to unreliable results.
Tang et al. Tang et al. (2020) proposed a novel Global-Part CNN to perform skin lesion classification. Their developed algorithm consisted of two sub-models which were trained sequentially. The first model (Global-CNN) was trained on resized skin lesion images using a fine-tuned Xception network. The results from this part were used for fusion and also to create class activation maps (CAMs). The created CAMs from the first model were used for probabilistic cropping of the original image to train the second model (Part-CNN). The results from these two models were fused using a weighted ensembling strategy. Applied on the ISIC 2017 challenge test set, a very good classification performance with an average AUC of 91.7% was achieved which was further improved to 92.6% through a data-transformed ensemble strategy.
In the studies of Yan et al. Yan et al. (2019) and Zhang et al. Zhang et al. (2019), attention-based models were used to guide the model towards the lesion area of the image. Although no segmentation masks were directly used, the main idea of both works was to force the models to focus on the lesion area of the dermatoscopic images. Both methods were trained and evaluated on the ISIC 2017 challenge dataset and both achieved very good classification scores outperforming the top performer of the ISIC 2017 challenge. An average AUC of 91.7% was reported in Zhang et al. (2019) in comparison to 91.1% average AUC of the ISIC 2017 challenge top performerMatsunaga et al. (2017) and a MM AUC of 88.3% was reported in Yan et al. (2019) in comparison to 87.4% MM AUC of the ISIC 2017 challenge top performer for MM classification Menegola et al. (2017).
In contrast to studies that showed improved performances of skin lesion classification when segmentation masks were used, some other studies reported adverse effects. This adverse effect was specifically evident in the ISIC 2016 challenge where all top performers showed better classification performances when no lesion segmentation masks were used (c.f. by comparison of the results in sections 3 and 3B of the challenge). Li et al. Li et al. (2019) showed qualitatively that areas surrounding the skin lesions could be useful for classification so removing those areas could degrade the classification performance. Celebi et al. Celebi et al. (2008) showed quantitatively that relative colour feature (the colour of a skin lesion pixel in comparison to the average background colour) is an important feature for lesion detection and classification. Bissoto et al. Bissoto et al. (2019) proposed a method to perform skin lesion classification by removing the actual skin lesion from the images and just using the surrounding background information. Although their classification performance was much inferior compared to the traditional methods (i.e. using both lesion and surrounding area), it still delivered acceptable results compared to pure chance (an AUC of 77.4% vs. an AUC of 50% from pure chance).
In summary, the segmentation masks for skin lesion classification were either employed directly (e.g. for cropping or background removal) or indirectly (e.g. for attention-based models) to lead the networks towards the lesion area to perform the classification. However, there are some unsolved issues with the reported results. The proposed methods were either applied on a small dataset of skin lesion images or they showed inferior classification performance with a high margin in comparison to the state-of-the-art models. When the reported results showed an improvement in the classification performance, no statistical tests were performed to confirm a significant contribution of lesion segmentation to the classification performance. Therefore, the benefit of using skin lesion segmentation for improved skin lesion classification remains unsettled.
To address these issues in this study, we first developed a baseline classification model without using any segmentation masks. We used this model as the reference and applied it on a public skin lesion dataset. Then, we utilised either perfect segmentation masks (i.e. segmentation masks created manually by the medical experts) or automatically created segmentation masks (by using one of the state-of-the-art models for skin lesion segmentation) in both training and test phases in different scenarios and investigated the classification performances. We report all classification scores as well as the significance level of the performance changes. Evaluated on the ISIC 2017 challenge test set which contained two classification tasks (i.e. MM vs. all and seborrheic keratosis (SK) vs. all), we observed several interesting outcomes. Most importantly, we observed that segmentation masks did not significantly improve the MM classification performance in any scenario while having a significant positive impact on the SK classification performance with one of the approaches where lesion segmentation masks were used for image cropping. Using the same methodical approach, we noted no significant differences in the classification performance when either manually or automatically created segmentation masks were used. Moreover, the results showed that removing all background information from the skin lesion images significantly degraded the overall classification results.
2 Materials and Methods
The generic flowchart of the proposed method is shown in Fig. 1. Each part is described in detail in the following subsections.
We used a publicly available dataset to be able to compare the classification results with other state-of-the-art models that had previously been applied on the same dataset. ISIC archive is one of the biggest publicly available data sources for dermatoscopic skin lesion images. It includes the images of the well-known ISIC 2016 Gutman et al. (2016), ISIC 2017 Codella et al. (2017), ISIC 2018 Codella et al. (2019); Tschandl et al. (2018); Combalia et al. (2019) and ISIC 2019 challenges Codella et al. (2017); Tschandl et al. (2018). As we needed the labels and also the segmentation masks of both training and test sets to perform the experiments and conduct statistical tests, we chose the ISIC 2017 challenge dataset in this work. This dataset includes most of the images from the ISIC 2016 challenge dataset as well as many additional images. The dataset comprises 2000 training images, 150 validation images, and 600 test images. We used the 2150 training and validation images in the training phase and evaluated the classification performance on the 600 test images. The ISIC 2017 challenge dataset contains three skin lesion types including MM, SK, and benign nevi (BN) classes. The 2150 images utilised in the training phase included 404 MM, 296 SK, and 1450 BN images, while the 600 test images were comprised of 117 MM, 90 SK, and 393 BN images. The images in both training and test sets contained various image artefacts and had different image resolutions ranging from to pixels.
For pre-processing, we first applied the gray world colour constancy algorithm Amidi (2019); Barata et al. (2015) on all training and test images to deal with various lightening conditions in the dataset. This pre-processing step was shown to be beneficial for skin lesion classification and was used by the former ISIC challenge top performers Matsunaga et al. (2017); Gessert et al. (2018)
. Next, we subtracted the mean RGB intensity value of the ImageNet datasetDeng et al. (2009)
from the RGB channels of all training and test images. This is a standard pre-processing technique for transfer learningMahbod et al. (2019, 2019, 2020b). To create a baseline dataset without using any segmentation mask, we resized all training and test images to a fixed image size of pixels. We used the results from this dataset as the benchmark and for comparison to all other results. We refer to this dataset as ”reference dataset” in the paper.
To investigate the effect of using image segmentation on the classification performance, we designed two sets of experiments. In the first set of experiments, we created five transformed datasets using the manual segmentation masks provided by human experts with the following details:
In the first dataset, we used the manual segmentation masks, i.e. the provided ground truth masks for the training and test sets of the ISIC 2017 challenge, to mask out the background (set all background pixels to zero) in all training and test images. Then, we resized the images to a fixed size of pixels. We refer to this dataset as ”background removed dataset”.
In the second dataset, we used the manual segmentation masks to remove the background (set all background pixels to zero). Here, however, we used the exact lesion dimensions to crop the images. After cropping, we resized all images to pixels. We subsequently refer to this dataset as ”background removed and cropped dataset”.
To create the third dataset, we did not remove the background, but similar to the second dataset, we used the exact lesion dimensions to crop the images. Again, we resized all images to a fixed size of pixels. We subsequently refer to this dataset as ”cropped dataset”.
To create the fourth dataset, the lesions inside the segmentation masks were first dilated by a factor of 1.4 along each image dimension. Then, the dilated lesion masks were used for cropping the skin lesion images. The resulting images were then resized to pixels. We subsequently refer to this dataset as ”dilated cropped dataset”.
To create the fifth dataset, we used the same raw images as described for the reference dataset. Here, we also added an additional channel to incorporate the lesion segmentation mask as the fourth channel for each individual training and test image. The images and masks were also resized to a fixed size of pixels like in the other datasets. We subsequently refer to this dataset as ”four-channel input dataset”.
For the second set of experiments, we created additional five transformed test datasets similar to the aforementioned datasets in the first set of experiments. However, this time instead of using the manual segmentation masks to extract the lesion information in the test images, we used one of the state-of-the-art models to perform segmentation (further details about the developed segmentation model, referred to as SkinLinkNet can be found in Section 2.4). This step was important for two reasons. First, to investigate any difference in the classification performances by comparing the results from the perfect segmentation masks and the segmentation masks predicted by an automatic lesion segmentation model. Second, using computer-generated segmentation masks would be more suitable for the clinical practice, where the manual annotations of the lesions of the unseen dermatoscopic images are usually not provided.
2.3 Classification model
Our method developed for classification was inspired by former studies Gessert et al. (2018); Mahbod et al. (2019, 2020a) which had shown excellent classification performances. In this work, we avoided using any external dataset or sophisticated ensembling strategies as we did not aim to achieve the best classification performance. Instead, we developed a baseline single classification model to be used as a reference model in all experiments. For the pre-trained model selection, we used the shallowest version of the EfficientNet family (EfficientNetB0) Tan and Le (2019)
and fine-tuned it with the training set. For the fine-tuning, we removed the FC layer of the pre-trained network and then used a global average pooling layer to connect it to two blocks of batch normalisation layers, dropout layers, and FC layers. We used a drop out factor of 0.3 in both dropout layers. 64 and 3 neurons were used in the first and second FC layers, respectively. For thefour-channel input dataset, where four-channel input images (RGB channels plus the mask channel) were used, we added a convolutional layer to convert the 4 channel inputs to 3 channel data and then connected it to the utilised EfficientNetB0 model. We initialised the weights of the newly added layers by Xavier initialisation method Glorot and Bengio (2010) and kept the learning rate of these newly added layers 10 times larger compared to all other learnable layers. We used Adam optimisation method Kingma and Ba (2015) et al. (2017b)
. We used five-fold cross-validation and for each fold, we saved the best model by monitoring the average AUC score for the validation set. In the inference phase, first, we applied 50-fold test time augmentation and then, the augmented test images were sent to the saved five sub-models (one model for each fold) and the average results were used as the final prediction vectors. The utilised augmentation techniques in training and test phases included both morphological and colour augmentations such as random scaling (scale limit of 0.1 with the probability of 0.3), random rotations (0, 90, 180, 270 degrees with the probability of 0.5), vertical and horizontal flipping (with the probability of 0.5), brightness and contrast shifts (brightness and contrast limit of 0.15 with the probability of 0.4), and random adaptive histogram equalisation (tile grid size of 8 with the probability of 0.1). Further details about the utilised approach for fine-tuning and test time augmentation can be found inMahbod et al. (2020a). This developed classification model was used in all experiments in both training and test phases.
2.4 Segmentation masks
We used two types of segmentation masks to create the datasets as described in Section 2.2 (i.e. manually and automatically created segmentation masks).
For the first set of experiments, we used the provided ground truth segmentation masks of the ISIC 2017 challenge for both training and test images. Although using manually annotated segmentation masks in the test phase is not a practical approach in the clinical setting, it enabled us to reveal the potential impact of the perfect segmentation masks on the classification performance.
In the second set of experiments, we used one of the state-of-the-art skin lesion segmentation models. We developed an identical approach as explained in Tschandl et al. (2019b). We used the LinkNet-152 segmentation model Chaurasia and Culurciello (2017) and pre-trained the encoder part of the model with the HAM10000 dataset Tschandl et al. (2018). For training the full segmentation model, we used the 2000 training images of the ISIC 2017 challenge and monitored the segmentation performance on the 150 images of the validation set of the ISIC 2017 challenge dataset. We used identical hyper-parameters, pre-processing, and data augmentation for training as described in Tschandl et al. (2019b). We resized all images and their corresponding masks to
pixels using bilinear interpolation as a pre-processing step. For data augmentation, we used horizontal and vertical flipping as well as random 90-degrees rotations. A combination of binary cross-entropy and Jaccard loss was used for model training. We applied the trained model to perform segmentation of the 600 test images of the ISIC 2017 challenge. We used the predicted segmentation masks to create the datasets (as explained in Section2.2) for the second set of experiments. We refer to this model as SkinLinkNet in the rest of the paper. For reporting the results for the automatic segmentation model in the results section, we mainly used SkinLinkNet.
To extend our comparison, we also developed two other segmentation models, namely SkinUNet and SkinFPN+ and compared their performance with the SkinLinkNet model.
For SkinUNet, we developed a modified U-Net model Ronneberger et al. (2015). In the encoder part of the model, a pre-trained ResNet34 He et al. (2016) network was used. We used Adam optimiser and a combination of dice loss and focal loss to train the model. Similar to SkinLinkNet, we used the resized pixel images from the ISIC 2017 dataset to train the model. As augmentation, horizontal and vertical flipping (with the probability of 0.5), random brightness and contrast shift (brightness and contrast limit of 0.15 with the probability of 0.4), random 0-, 90-, 180-, and 270-degree rotations (with the probability of 0.5), and random adaptive histogram equalization (tile grid size of 8 with the probability of 0.1) were used.
For the SkinFPN+, we used the training scheme similar to the SkinUNet with two main differences. First, we used the feature pyramid network (FPN) Lin et al. (2017a); Kirillov et al. (2017) as the main architecture with the pre-trained ResNet34 network in the encoder part of the model. Second, we used extensive external data to train this segmentation model (Hence we used the ”+” sign in the model’s name). Besides the ISIC 2017 training and validation images, we used the recently released HAM10000 segmentation masks and images to train this network Tschandl et al. (2020). Similar to the ISIC 2017 dataset, all 10015 images and segmentation masks of the HAM10000 dataset were resized to pixel as a pre-processing step.
We developed these two additional segmentation models to have slightly superior and slightly inferior segmentation models in comparison to the SkinLinkNet model (refer to Table 2 for quantitative comparison). It is worth mentioning that in the first set of experiments, we used perfect manual segmentation masks that can be considered as the best possible segmentation model.
For evaluating the classification performances, we used AUC as the main evaluation index identical to the ISIC 2017 challenge. Although we trained all models to solve a ternary classification task, as suggested in the ISIC 2017 challenge, we calculated the AUCs for two binary classification problems (i.e. MM vs. all and SK vs. all). To convert the ternary classification vectors to two binary classification vectors, we used a one-versus-all approach. In addition to AUC, we also calculated accuracy, sensitivity, and specificity as additional evaluation indexes. To measure those indexes, we converted the classification probability vectors to binarize classification vectors by using a threshold of 0.5. For comparing the statistical differences between different AUCs, we employed the method described by Delonget al. DeLong et al. (1988) and to measure the statistical significance levels of the other evaluation indexes, we used McNemar statistical test Fagerland et al. (2013). Delong et al. non-parametric approach is widely used for comparing empirical ROC curves and AUC scores of paired samples. This method became popular as it does not have the normality assumption for the sample distribution among the two classes. The statistical significance level (p-value) of the Delong et al. method can be derived from the following formulas:
where and are the empirical AUCs of the first and second tests and and
are the variance and covariance functions. The p-value is then calculated aswhere
is the standard normal cumulative distribution functionDeLong et al. (1988); Demler et al. (2012). McNemar is a method that statistically compares predicted labels against ground truth labels for binary matched-pairs data then detects whether the misclassification rates between the two tests are statistically significant or not Fagerland et al. (2013). As this method can be used for binary matched-pair samples, it is a good choice for comparing the accuracy, sensitivity, and specificity of two models.
For measuring the segmentation performances, we calculated the average dice score and average Jaccard score as the main evaluation indexes identical to the ISIC 2017 challenge. To perform statistical tests on the segmentation results, we used Wilcoxon signed-rank test method Gibbons and Chakraborti (2014)
. Wilcoxon signed-rank statistical test method is a non-parametric test for two populations when samples are paired (in this study we have a dice score and Jaccard score for each sample). The test statistic in the method is the sum of the ranks of positive differences between the Jaccard score and the dice score in the two populations. Further details about this method can be found inGibbons and Chakraborti (2014). In all statistical tests, a two-sided significance level of 5% (p-value = 0.05) was used as the threshold.
Keras333https://keras.io/ (Accessed on 2020-08-04) deep learning framework was used in the development of the classification model as well as the development of the SkinUNet and SkinFPN+ segmentation models. For developing the SkinLinkNet
segmentation model, PyTorch framework was used. All pre-processing steps were performed offline and with MATLAB software (version 2018a). We used MedCalc (version 19.1) and MATLAB software (version 2018a) to perform the statistical tests. All experiments were conducted on a single workstation with an Intel Core i7-8700 3.20 GHz CPU, 32 GB of RAM, and a TITIAN V NVIDIA GPU card with 12 GB of installed memory. Training and test time for each experiment wereminutes and minutes, respectively.
All reported results in this section are based on the 600 images of the ISIC 2017 challenge test set. The SkinLinkNet segmentation model was used to report the results for the cases where automatically created segmentation masks were used unless stated otherwise in the text.
We started the experiments by evaluating the classification and segmentation performances of the utilised models.
As mentioned in Section 2, we did not aim to improve the classification results compared to the existing models as it was not the primary aim of this study. As the consequence, we did not use any external datasets or sophisticated ensembling strategy to improve the classification performance. On the other hand, we did not want to have a model that delivers much inferior performance in comparison to the other state-of-the-art algorithms. We developed a baseline classification model that produced results comparable to the state-of-the-art models. Table 1 shows the comparison between the classification performance of the utilised model and the top three performers of the ISIC 2017 classification challenge (rows 1–3)Matsunaga et al. (2017); Menegola et al. (2017); Díaz (2017) as well as four other methods (rows 4–7) González-Díaz (2019); Tang et al. (2020); Yan et al. (2019); Mahbod et al. (2019). The aforementioned four methods had been developed after the competition and all had shown better overall classification performances in comparison to the ISIC 2017 challenge top-ranked team. The last row in Table 1 shows our baseline classifier performance that was trained with the reference dataset.
Table 2 compares the results from the main utilised segmentation algorithm (i.e. SkinLinkNet in the second set of experiments) with the top three performers of the ISIC 2017 segmentation challenge Bi et al. (2017); Yuan (2017); Berseth (2017) as well as the two additional developed segmentation models explained in Section 2.4. The significance levels (p-value) in Table 2 show whether results obtained with state-of-the-art algorithms are significantly different from the SkinLinkNet model.
|Matsunga et al. Matsunaga et al. (2017)||86.8||n/a||95.3||n/a||91.1|
|Gonzales et al. Díaz (2017)||85.6||0.33||96.5||0.17||91.0|
|Menegola et al. Menegola et al. (2017)||87.4||n/a||94.3||n/a||90.8|
|Mahbod et al. Mahbod et al. (2019)||87.3||0.97||95.5||0.63||91.4|
|Gonzales et al. González-Díaz (2019)||87.3||0.97||96.2||0.27||91.7|
|Yan et al. Yan et al. (2019)||88.3||0.47||n/a||n/a||n/a|
|Tang et al. Tang et al. (2020)||88.9||0.28||96.4||0.14||92.6|
|Mahbod et al. (This study)||87.2||–||95.1||–||91.2|
|Yading Yuan Yuan (2017)||76.5 19.6||0.052||84.9 16.6||0.052|
|Matt Berseth Berseth (2017)||76.2 19.7||0.21||84.7 16.4||0.27|
|Bi et al. Bi et al. (2017)||76.0 20.6||0.99||84.4 17.4||0.79|
|SkinUNet||73.3 23.1||0.022||81.9 21.0||0.025|
|SkinFPN+||77.3 18.8||0.001||85.6 15.5||0.001|
|SkinLinkNet||76.0 19.5||–||84.5 16.5||–|
(last row), with the top three performers of the ISIC 2017 segmentation challenge (row 1-3) as well as two additional developed segmentation models (row 4-5) based on the average Dice and average Jaccard index. The reported p-values were calculated with Wilcoxon signed-rank test methodGibbons and Chakraborti (2014) by pair-wise comparison of the Jaccard and Dice scores of the SkinLinkNet model with the other approaches. Information regarding the Jaccard and Dice scores of the top three performers were derived from the ISIC 2017 challenge leaderboard. For those cases where the results were significantly inferior compared to the SkinLinkNet results, the p-values are shown in red and for those cases where the results were significantly superior compared to the SkinLinkNet results, the p-values are shown in blue.
Results displayed in Table 3 show the main findings of this study. The classification performances of the different approaches are compared based on the AUC scores. All reported p-values were derived from comparisons of the results obtained from the reference dataset with the other approaches using Delong et al. method DeLong et al. (1988). The comparison of the results based on the evaluation indexes accuracy, sensitivity and specificity is displayed in Table 4. For these results, the reported significance levels were calculated by comparing the results from the reference dataset with the other approaches using McNemar statistical test method Fagerland et al. (2013).
|bg rm & crop||Manual||87.5||0.89||93.6||0.24||90.5|
|bg rm & crop||SkinLinkNet||85.0||0.25||95.1||0.99||90.1|
|bg rm & crop||Manual||86.7||91.3||70.9||64.4||90.5||96.1||0.50||0.78|
|bg rm & crop||SkinLinkNet||85.7||92.0||57.3||56.7||92.5||98.2||0.99||0.40|
As apparent from Table 3, the best overall classification performance was achieved when the dilated cropped dataset was used. Using dilated cropped dataset, the results also showed slightly superior performance when manual segmentation masks were used in comparison with automatically created segmentation masks by SkinLinkNet (an average AUC of 93.0% vs. 92.6%). To have more extensive comparison for dilated cropping, we also performed additional experiments with the other two developed segmentation models (i.e. SkinUNet and SkinFPN+) and report the results in Table 5. As the results from various approaches with the dilated cropped dataset were very competitive, we performed statistical tests to investigate the significance level of the performance differences. The results of these tests are depicted in Table 6.
|ref (no Segmentation model)||87.2||–||95.1||–||91.2|
|dilated crop (by manual masks)||89.4||0.09||96.7||0.02||93.0|
|dilated crop (by SkinLinkNet)||88.7||0.22||96.6||0.03||92.6|
|dilated crop (by SkinUNet)||88.3||0.16||96.6||0.02||92.7|
|dilated crop (by SkinFPN+)||89.2||0.08||96.9||0.01||93.0|
Fig. 3 and Fig. 4 show some examples of test images which are only classified correctly when the reference dataset was used but incorrectly classified when the dilated cropped dataset was utilised. On the other hand, Fig. 5 and Fig. 6 show some examples that are only classified correctly when dilated cropped dataset was used but incorrectly classified when the reference dataset was used. For visual comparison, we selected the dilated cropped dataset as it has shown better overall classification performances compared to the other datasets.
In this study, we explicitly investigated and explored the effect of using skin lesion segmentation on classification performance through a systematic process and using one of the well-known publicly available datasets (the ISIC 2017 challenge dataset). First, we developed a baseline classifier (without using any segmentation mask, any sophisticated ensemble strategy, and any external training dataset) that delivers a good classification performance comparable to the other state-of-the-art classification models. Then we conducted a comprehensive investigation to explore the effects of using segmentation masks on the skin lesion classification performance through two sets of experiments. We investigated the effects of both manually created segmentation masks (using the ground truth segmentation masks) and automatically created segmentation masks (using our developed SkinLinkNet segmentation model) on the classification performance in 10 different scenarios. In addition to reporting the actual classification scores, we also performed statistical tests to evaluate whether the reported values were significantly different. All derived classification prediction vectors from different scenarios and also the automatically created segmentation masks are available from this Github repository444Upon acceptance of the paper, we will make the repository publicly available: https://github.com/masih4/Skin-lesion-segmentation-effects-of-the-classification-perfromnce
The results reported in Table. 1 show the comparative results of our single classification network and the top three performers of the ISIC 2017 challenge as well as four other state-of-the-art algorithms. While the overall classification performance of the developed model seemed slightly inferior compared to some other approaches, we could not find a significant difference in the performance. We were not able to perform statistical tests to compare the results with the first and third rank approaches as the prediction vectors of those two studies were not available. We found the classification prediction vector of the other studies either through the corresponding Github repositories or by asking from the paper’s authors to share the prediction vectors.
As explained in Section 2.2, we chose the ISIC 2017 challenge dataset in this study as the ground truth for the test set of both segmentation and classification tasks were available. However, to further evaluate the performance of the baseline classification model, we used a similar approach and applied it on the ISIC 2018 challenge dataset. We used the training set of the ISIC 2018 dataset for training the baseline classifier and tested on the test set of the ISIC 2018 challenge. As the ISIC 2018 challenge images had a fixed image size of pixels, we extracted random crops with a fixed size of for the model training. Besides this difference, we used the identical training scheme as explained in Section 2.3. Our single classification model, without using any external dataset and any ensembles, achieved an average recall score of 83.6% on the test set of the ISIC 2018 challenge dataset. Using external datasets for training and a straightforward fusion approach (as explained in our former study in Mahbod et al. (2020a)), the performance was improved to 87.2% which confirms a good classification performance of the developed method. Our single classification model and our fusion model currently rank 27th and 4th, respectively out of more than 200 participating teams and more than 11000 submissions in the ISIC 2018 online evaluation platform for skin lesion diagnosis. 555https://challenge2018.isic-archive.com/live-leaderboards/ Task 3: Lesion Diagnosis (Accessed on 2020-08-04).
As explained in Section 2.2, in the first set of experiments, we used manual segmentation masks provided by the experts, which are the ideal segmentation masks. In the second set of experiments, we made use of automatically created segmentation masks generated by the developed SkinLinkNet segmentation model. To make sure that our segmentation model could produce comparable results to the other state-of-the-art models, we compared the segmentation results with the top three performers of the ISIC 2017 challenge in the segmentation part as shown in Table. 2 (rows 1–3). As the results show, our utilised SkinLinkNet segmentation model delivered slightly inferior segmentation performance compared to the top three performers of the ISIC 2017 challenge. However, after performing statistical tests, we observed no significant differences between the results. To extend the comparison, we also developed two additional segmentation models. One delivered slightly but still significantly inferior segmentation performance and the other one delivered slightly but still significantly superior segmentation performance as shown in Table. 2. It is worth mentioning that a direct comparison of the SkinLinkNet and SkinFPN+ is not rational as the latter one was trained with extensive external data. As explained in Section 2.4, the main idea behind developing the additional segmentation models was to investigate if there is any significant differences between using manually created segmentation masks and automatically created segmentation masks (either through using SkinLinkNet, SkinUNet, or SkinFPN+) on the classification performance (details in Table 5 and Table 6).
The results in Table. 3 show the main finding of this study for both sets of experiments. Several interesting outcomes can be inferred from the results of this table. First, for most of the cases where segmentation masks were used, we could not observe significant differences in the performance in comparison to the reference results (i.e. using no segmentation masks). Second, using the segmentation masks did not significantly improve the MM classification performance in any scenario but significantly degraded it in some cases. Third, using the dilated cropping strategy, the SK classification performance was significantly improved which suggested that this method is the best way to use the segmentation masks in the classification workflow. Finally, for the scenarios where background information was completely removed or 4-channel input images were used, the overall classification performance was inferior compared to the reference results.
We performed similar comparisons between the results based on the other evaluation indexes as shown in Table. 4. Here, only in one of the cases (using dilated cropped dataset), the SK classification result was significantly improved. However, we used a threshold of 0.5 (as suggested in the ISIC 2017 challenge) to measure the accuracy, sensitivity, and specificity which may not be the optimal threshold. As only 600 images were used in the test phase, misclassifying only a few images with the utilised threshold could change the accuracy, sensitivity, and specificity drastically and hence the results may not show the effect of each utilised approach reliably. For a better evaluation, a bigger test set should be used in future studies to investigate the effect of using segmentation masks on the accuracy, sensitivity, and specificity.
For visual evaluation, we show some examples in Fig. 3 and Fig. 4 that are only classified correctly when the reference dataset was used but incorrectly classified when the dilated cropped dataset was utilised. We selected dilated cropped dataset for comparison as it was shown to have the best classification performance among other scenarios (refer to Table 3 and Table 4). The manual and automatic segmentation masks (using SkinLinkNet) are shown as green and red overlaid contours on the raw images.
From these examples, we can assume that when the lesions are a very small part of the images, using segmentation masks for dilated cropping may lead to better classification performance in comparison to the baseline classifier. To investigate this effect quantitatively, we calculated the accuracy of the models using only the dermatoscopic images containing small skin lesions (i.e. ratio of the skin lesion to the entire image was less than 2%). We observed a superior classification performance of the model that was trained with dilated cropped dataset in comparison to the baseline classifier for both MM vs. all (94.3% vs. 91.4%) and SK vs. all (97.1% vs. 94.3%) classification tasks. However, further studies are needed to find out the underlying reasoning of the model prediction. A complete list of images that are classified only with one of the approaches for both classification problems (i.e. MM vs. all and SK vs. all) can be found in our Github repository.
As apparent from the results in Table 3, using the dilated cropped dataset delivered the best overall classification performance. In Table 5, we explicitly compare the classification results from the reference dataset with the dilated cropped dataset when manually (through manual segmentation masks) or automatically (through using either SkinLinkNet, SkinUNet, or SkinFPN+) created segmentation masked were used. The results in this table show that for all cases the overall classification performance of dilated cropped dataset is better over the reference dataset. The results also show a very competitive performance for the dilated cropped dataset in different scenarios. As the results in Table 6 suggest there are no statistical differences when manual segmentation masks or automatically created segmentation masks were used. This means, upon a proper usage of segmentation masks, one can rely on the automated segmentation model to crop the images and then perform classification. Interestingly, it can be also inferred that there is no need to use the best possible automatic segmentation model to perform dilated cropping as it does not have a significant impact on the final classification performance. While the segmentation performance of the SegUNet is statistically inferior compared to SkinLinkNet and the segmentation performance of the SkinLinkNet is statistically inferior compared to SkinFPN+, all delivered comparable classification performance to the manual segmentation masks.
In this paper, we have explored the effects of using skin lesion segmentation masks on the performance of dermatoscopic image classification. Using a baseline classification network and using manually or automatically created segmentation masks in different scenarios, we observed several interesting outcomes. Our results suggest that using segmentation masks in a proper way can significantly improve the overall classification performance. However, using the masks in an inappropriate manner by removing all background information can significantly degrade the classification results. Moreover, we show that by proper exploitation of segmentation masks, there is no significant difference in the classification performance when manually or automatically created segmentation masks are used.
Conflict of interest statement
There are no conflicts of interest to disclose for publication of this paper.
This work was supported by the EU Horizon 2020 CaSR Biomedicine project, No. 675228 and the Austrian Research Promotion Agency (FFG), No. 872636. The authors would like to thank the TissueGnostics Research and Development team666https://www.tissuegnostics.com for valuable suggestions. Moreover, we thank NVIDIA corporation for their generous GPU donation.
- Multiple skin lesions diagnostics via integrated deep convolutional networks for segmentation and classification. Computer Methods and Programs in Biomedicine 190, pp. 105351. External Links: Cited by: §1.
- Machine learning tips and tricks cheatsheet (data accessed august 21, 2019). Note: https://stanford.edu/ shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks Cited by: §2.2.
- Epidemiological trends in skin cancer. Dermatology Practical & Conceptual 7 (2), pp. 1–6. External Links: Cited by: §1.
- Improving dermoscopy image classification using color constancy. 19 (3), pp. 1146–1152. External Links: Cited by: §2.2.
- ISIC 2017-skin lesion analysis towards melanoma detection. arXiv preprint arXiv:1703.00523. Cited by: Table 2, §3.
- Automatic skin lesion analysis using large-scale dermoscopy images and deep residual networks. arXiv preprint arXiv:1703.04197. Cited by: §1, Table 2, §3.
- (De)Constructing bias on skin lesion datasets. In , Vol. , pp. 2766–2774. External Links: Cited by: §1.
- Rethinking skin lesion segmentation in a convolutional classifier. Journal of Digital Imaging 31 (4), pp. 435–440. External Links: Cited by: §1.
- A state-of-the-art survey on lesion border detection in dermoscopy images. In Dermoscopy Image Analysis, Vol. 10, pp. 97–129. External Links: Cited by: §1, §1.
- Automatic detection of blue-white veil and related structures in dermoscopy images. Computerized Medical Imaging and Graphics 32 (8), pp. 670 – 677. External Links: Cited by: §1.
- Lesion border detection in dermoscopy images. Computerized Medical Imaging and Graphics 33 (2), pp. 148 – 153. External Links: Cited by: §1.
- LinkNet: exploiting encoder representations for efficient semantic segmentation. In IEEE Visual Communications and Image Processing, pp. 1–4. External Links: Cited by: §2.4.
- A multi-task framework with feature passing module for skin lesion classification and segmentation. In 15th International Symposium on Biomedical Imaging, pp. 1126–1129. External Links: Cited by: §1, footnote 2.
- Skin lesion analysis toward melanoma detection: a challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), hosted by the International Skin Imaging Collaboration (ISIC). arXiv preprint arXiv:1710.05006. Cited by: §1, §2.1.
- Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (ISIC). arXiv preprint arXiv:1902.03368. Cited by: §1, §2.1.
- BCN20000: dermoscopic lesions in the wild. arXiv preprint arXiv:1908.02288. Cited by: §2.1.
- Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44 (3), pp. 837–845. External Links: Cited by: §2.5, §2.5, Table 1, Table 3, Table 5, Table 6, §3.
- Misuse of DeLong test to compare AUCs for nested models. Statistics in Medicine 31 (23), pp. 2577–2587. External Links: Cited by: §2.5.
- ImageNet: a large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, pp. 248–255. External Links: Cited by: §2.2.
- Incorporating the knowledge of dermatologists to convolutional neural networks for the diagnosis of skin lesions. arXiv preprint arXiv:1703.01976. Cited by: Table 1, §3.
- Accuracy of computer-aided diagnosis of melanoma: a meta-analysis. JAMA Dermatology 155 (11), pp. 1291–1299. External Links: Cited by: §1.
- The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional. BMC Medical Research Methodology 13 (1), pp. 91. External Links: Cited by: §2.5, §2.5, Table 4, §3.
- Skin lesion diagnosis using ensembles, unscaled multi-crop evaluation and loss weighting. arXiv preprint arXiv:1808.01694. Cited by: §2.2, §2.3.
- Nonparametric statistical inference: revised and expanded. CRC Press. Cited by: §2.5, Table 2.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Y. W. Teh and M. Titterington (Eds.), Proceedings of Machine Learning Research, Vol. 9, Chia Laguna Resort, Sardinia, Italy, pp. 249–256. Cited by: §2.3.
- DermaKNet: incorporating the knowledge of dermatologists to convolutional neural networks for skin lesion diagnosis. IEEE Journal of Biomedical and Health Informatics 23 (2), pp. 547–559. External Links: Cited by: §1, Table 1, §3.
- Multi-channel-ResNet: an integration framework towards skin lesion analysis. Informatics in Medicine Unlocked 12, pp. 67 – 74. External Links: Cited by: §1.
- Skin lesion analysis toward melanoma detection: a challenge at the International Symposium on Biomedical Imaging (ISBI) 2016, hosted by the International Skin Imaging Collaboration (ISIC). arXiv preprint arXiv:1605.01397. Cited by: §1, §2.1.
- Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §2.4.
- Dermoscopy versus skin biopsy in diagnosis of suspicious skin lesions. Al-Azhar Assiut Medical Journal 15 (4), pp. 203–209. External Links: Cited by: §1.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, Y. Bengio and Y. LeCun (Eds.), Cited by: §2.3.
- A unified architecture for instance and semantic segmentation. Note: http://presentations.cocodataset.org/COCO17-Stuff-FAIR.pdf Cited by: §2.4.
- Dermatoscopy: introduction of a new algorithmic method based on pattern analysis for diagnosis of pigmented skin lesions. Dermatopathology: Practical and Conceptual 13 (1), pp. 3. Cited by: §1.
- Epidemiology of skin cancer. In Sunlight, Vitamin D and Skin Cancer, pp. 120–140. External Links: Cited by: §1.
- From deep learning towards finding skin lesion biomarkers. In 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 2797–2800. External Links: Cited by: §1.
- Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 936–944. Cited by: §2.4.
- Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988. External Links: Cited by: §2.3.
- Automatic brain segmentation using artificial neural networks with shape context. Pattern Recognition Letters 101, pp. 74–79. External Links: Cited by: §1.
- Breast cancer histological image classification using fine-tuned deep network fusion. In Image Analysis and Recognition, A. Campilho, F. Karray, and B. ter Haar Romeny (Eds.), Cham, pp. 754–762. External Links: Cited by: §1.
Fusing fine-tuned deep features for skin lesion classification. 71, pp. 19–29. External Links: Cited by: §2.2, §2.3, Table 1, §3.
- Transfer learning using a multi-scale and multi-network ensemble for skin lesion classification. Computer Methods and Programs in Biomedicine, pp. 105475. External Links: Cited by: §2.3, §4.
- Investigating and exploiting image resolution for transfer learning-based skin lesion classification. arXiv preprint arXiv:2006.14715. Cited by: §2.2.
- Skin lesion classification using hybrid deep neural networks. In International Conference on Acoustics, Speech and Signal Processing, pp. 1229–1233. External Links: Cited by: §2.2.
- Image classification of melanoma, nevus and seborrheic keratosis by deep neural network ensemble. arXiv preprint arXiv:1703.03108. Cited by: §1, §1, §2.2, Table 1, §3.
- RECOD titans at ISIC challenge 2017. arXiv preprint arXiv:1703.04819. Cited by: §1, Table 1, §3.
- Computational methods for pigmented skin lesion classification in images: review and future trends. 29 (3), pp. 613–636. External Links: Cited by: §1.
- U-Net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. External Links: Cited by: §2.4.
- Melanoma. 392 (10151), pp. 971–984. External Links: Cited by: §1.
- Melanoma pathology reporting and staging. Modern Pathology 33, pp. 15–24. External Links: Cited by: §1.
- EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §2.3.
- GP-CNN-DTEL: global-part CNN model with data-transformed ensemble learning for skin lesion classification. IEEE Journal of Biomedical and Health Informatics (), pp. 1–1. External Links: Cited by: §1, Table 1, §3.
- Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study. The Lancet Oncology 20 (7), pp. 938 – 947. External Links: Cited by: §1.
- Human–computer collaboration for skin cancer recognition. Nature Medicine. External Links: Cited by: §2.4.
- The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data 5, pp. 180161. External Links: Cited by: §2.1, §2.4.
- Domain-specific classification-pretrained fully convolutional network encoders for skin lesion segmentation. Computers in Biology and Medicine 104, pp. 111 – 116. External Links: Cited by: §2.4.
- Methods for segmentation and classification of digital microscopy tissue images. Frontiers in Bioengineering and Biotechnology 7, pp. 53. External Links: Cited by: §1.
- Melanoma recognition via visual attention. In Information Processing in Medical Imaging, A. C. S. Chung, J. C. Gee, P. A. Yushkevich, and S. Bao (Eds.), Cham, pp. 793–804. External Links: Cited by: §1, Table 1, §3.
- Skin lesion analysis by multi-target deep neural networks. In 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 1263–1266. External Links: Cited by: §1.
- The effect of an educational programme to improve the skills of general practitioners in diagnosing melanocytic/pigmented lesions. 32 (4), pp. 365–370. External Links: Cited by: §1.
- Automated melanoma recognition in dermoscopy images via very deep residual networks. IEEE Transactions on Medical Imaging 36 (4), pp. 994–1004. External Links: Cited by: §1.
- Automatic skin lesion segmentation with fully convolutional-deconvolutional networks. arXiv preprint arXiv:1703.05165. Cited by: Table 2, §3.
- Attention residual learning for skin lesion classification. 38 (9), pp. 2092–2103. External Links: Cited by: §1.