Skin Disease Classification versus Skin Lesion Characterization: Achieving Robust Diagnosis using Multi-label Deep Neural Networks

12/09/2018 ∙ by Haofu Liao, et al. ∙ University of Rochester 0

In this study, we investigate what a practically useful approach is in order to achieve robust skin disease diagnosis. A direct approach is to target the ground truth diagnosis labels, while an alternative approach instead focuses on determining skin lesion characteristics that are more visually consistent and discernible. We argue that, for computer-aided skin disease diagnosis, it is both more realistic and more useful that lesion type tags should be considered as the target of an automated diagnosis system such that the system can first achieve a high accuracy in describing skin lesions, and in turn facilitate disease diagnosis using lesion characteristics in conjunction with other evidence. To further meet such an objective, we employ convolutional neural networks (CNNs) for both the disease-targeted and lesion-targeted classifications. We have collected a large-scale and diverse dataset of 75,665 skin disease images from six publicly available dermatology atlantes. Then we train and compare both disease-targeted and lesion-targeted classifiers, respectively. For disease-targeted classification, only 27.6 and 57.9 0.42. In contrast, for lesion-targeted classification, we can achieve a much higher mAP of 0.70.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The diagnosis of skin diseases is challenging. To diagnose a skin disease, a variety of visual clues may be used such as the individual lesional morphology, the body site distribution, color, scaling and arrangement of lesions. When the individual elements are analyzed separately, the recognition process can be quite complex [1]. For example, the well studied skin cancer, melanoma, has four major clinical diagnosis methods: ABCD rules, pattern analysis, Menzies method and -Point Checklist. To use these methods and achieve a satisfactory diagnostic accuracy, a high level of expertise is required as the differentiation of skin lesions demands a great deal of experience and expertise [2].

Unlike the diagnosis by human experts, which depends essentially on subjective judgment and is not always reproducible, a computer aided diagnostic system is more objective and reliable. Traditionally, one can use human-engineered feature extraction algorithms in combination with a classifier to complete this task. For some skin diseases, such as melanoma and basal cell carcinoma, this solution is feasible as their features are regular and predictable. However, when we extend the skin diseases to a broader range, where the features are so complex that hand-crafted feature design becomes infeasible, the traditional approach fails.

In recent years, deep convolutional neural networks (CNN) become very popular in feature learning and object classification. The use of high performance GPUs makes it possible to train a network on a large-scale dataset so as to yield a better performance. Many studies [3, 4, 5, 6]

from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

[7]

have shown that the state-of-art CNN architectures are able to surpass humans in many computer vision tasks. Therefore, we propose to construct a skin disease classifier with CNNs.

Fig. 1: Some visually similar skin diseases. First row (left to right): malignant melanoma, dermatofibroma, basal cell carcinoma, seborrheic keratosis. Second row (left to right): compound nevus, intradermal nevus, benign keratosis, bowen’s disease.

However, training CNNs directly using the diagnosis labels may not be viable. 1) For some diseases, their lesions are so similar that they can not be distinguished visually. Figure 1 shows the dermatology images of eight different skin diseases. We can see that the two diseases in each column have very similar visual appearances. Thus, it is very difficult to make a judgment between the two diseases with only the visual information. 2) Many of the skin diseases are not so common that only a few images are available for training. Table I shows the dataset statistics of the dermatology atlantes we used in this study. We can see that there are tens of hundreds of skin diseases. However, most of them contain very few images. 3) Skin disease diagnosis is a complex procedure that often involves many other modalities, such as palpation, smell, temperature changes and microscopy examinations [1].

On the other hand, lesion characteristics, which inherently describe the visual aspects of skin diseases, arguably should be considered as the ideal ground truth for training. For example, the two images in the first column of Figure 1 can both be labeled with hyperpigmented and nodular lesion tags. Compared with using the sometimes ambiguous disease diagnosis labels for these two images, the use of the lesion tags can give a more consistent and precise description of the dermatology images.

In this paper, we investigate the performance of CNNs trained with disease and lesion labels, respectively. We collected skin disease images from six different publicly available dermatology atlantes. We then train a multi-class CNN for disease-targeted classification and another multi-label CNN for lesion-targeted classification. Our experimental results show that the top- and top- accuracies for the disease-targeted classification are and with a mean average precision (mAP) of . While for the lesion-targeted skin disease classification, a much higher mAP of is achieved.

Ii Related Work

Much work has been proposed for computer aided skin disease classification. However, most of them use human-engineered feature extraction algorithms and restrict the problem to certain skin diseases, such as melanoma [8, 9, 10, 11, 12]. Some other works [13, 14, 15] use CNNs for unsupervised feature learning from histopathology images and only focus on the detection of mitosis, an indicator of cancer. Recently, Esteva et al. [16] proposed a disease-targeted skin disease classification method using CNN. They used the dermatology images from the Dermnet atlas, one of the six atlantes used in this study, and reported that their CNN achieved top-1 accuracy and top- accuracy. However, they performed the CNN training and testing on the same dataset without cross-validation which makes their results unpersuasive.

Iii Datasets

We collect dermatology photos from the following dermatology atlas websites:

  • AtlasDerm (www.atlasdermatologico.com.br)

  • Danderm (www.danderm-pdv.is.kkh.dk)

  • Derma (www.derma.pw)

  • DermIS (www.dermis.net)

  • Dermnet (www.dermnet.com)

  • DermQuest (www.dermquest.com)

These atlantes are maintained by professional dermatology resource providers. They are used by dermatologists for training and teaching purpose. All of the dermatology atlantes have diagnosis labels for their images. For each dermatology image only one disease diagnosis label is assigned. We use these diagnosis labels as the ground truth to train the disease-targeted skin disease classifier.

However, each of the atlas maintains its own skin disease taxonomy and naming convention for the diagnosis labels. It means different atlas may have different labels for the same diagnosis and some diagnosis may have several variations. To address this problem, we adapt the skin disease taxonomy used by the DermQuest atlas and merge the diagnosis labels from other atlantes into it. We choose the DermQuest atlas because of the completeness and professionalism of its dermatology resources. In most of the cases, the labels for the same diagnoses may have similar naming conventions. Therefore, we merge them by looking at the word or string similarity of two diagnosis labels. We use the string pattern matching algorithm described in

[17], where the similarity ratio is

(1)

Here, is the number of matches and is the total number of characters in both strings. The statistics of the merged atlantes is given in Table I. Note that the total number of diagnoses in our dataset is which is significant higher than any of the atlas. This is because we use a conservative merging strategy such that we merge two diagnosis labels only when their string similarity is very high (). Thus, we can make sure no two diagnosis labels are incorrectly merged. For those redundant diagnosis labels, they only contain a few dermatology images. We can discard them by choosing a threshold that filters out small diagnosis labels.

For the disease-targeted skin disease classification, we choose the AtlasDerm, Danderm, Derma, DermIS, and Dermnet datasets as the training set and the DermQuest dataset as the test set. Due to the inconsistency of the taxonomy and naming convention between the atlantes, most of the diagnosis labels have only a few images. As our goal is to investigate the feasibility of using CNNs for disease-targeted skin disease classification, we remove these noisy diagnosis labels and only keep those labels that have more than images. As a result of the label refinement and cleaning, we have images in the training set and images in the test set. The total number of diagnosis labels is .

For the skin lesions, only the DermQuest dataset contains the lesion tags. Unlike the diagnosis, which is unique for each image, multiple lesion tags may be associated with a dermatology image. There are a total of lesion tags for the dermatology images from DermQuest. However, most lesion tags have only a few images and some of the lesion tags are duplicated. After merging and removing infrequent lesion tags, we retain lesion tags.

Atlas # of Images # of Diagnoses
AtlasDerm 478
Danderm 97
Derma 1195
DermIS 651
Dermnet 488
DermQuest 657
Total 2113
TABLE I: Dataset statistics

Since only the DermQuest dataset has the lesion tags, we use images from the DermQuest dataset to perform training and testing. The total number of dermatology images that have lesion tags is . As the training and test sets are sampled from the same dataset, to avoid overfitting, we use -fold cross-validation in our experiment. We first split our dataset into evenly sized, non-overlapping “folds”. Next, we rotate each fold as the test set and use the remaining folds as the training set.

Iv Methodology

We use CNNs for both the disease-targeted and lesion-targeted skin disease classifications. For the disease-targeted classification, a multi-class image classifier is trained and for the lesion-targeted classification, we train a multi-label image classifier.

Our CNN architecture is based on the AlexNet [18] and we modify it according to our needs. The AlexNet architecture was one of the early wining entry of the ILSVRC challenges which is considered sufficient for this study. Readers may refer to the latest winning entry (MSRA [19] as of ILSVRC 2015) for better performance. Implementation details of training and testing the CNNs are given in the following sections.

Iv-a Disease-Targeted Skin Disease Classification

For the disease-targeted skin disease classification, each dermatology image is associated with only one disease diagnosis. Hence, we train a multi-class classifier using CNN. We fine-tune the CNN with the BVLC AlexNet model [20] which is pre-trained from the ImageNet dataset [7]. Since the number of classes we are predicting is different with the ImageNet images, we replace the last fully-connected layer (

dimension) with a new fully-connected layer where the number of outputs is set to the number of skin diagnoses in our dataset. We also increase the learning rate of the weights and bias of this layer as the parameters of the newly added layer is randomly initialized. For the loss function, we use the softmax function

[21, Chapter 3]

and connect a new softmax layer to the newly added fully-connected layer. Formally put, let

be the the weighted input of the

th neuron of the softmax layer, where

is the total number of the layers in the CNN (For AlexNet, ). Thus, the th activation of the softmax layer is

(2)

And the corresponding softmax loss is

(3)

where is the number of images in a mini-batch, is the ground truth of the th image and is the th activation of the softmax layer. In the test phase, we choose the label that yields the largest activation as the prediction, i.e.

(4)

Iv-B Lesion-Targeted Skin Disease Classification

As we mentioned early, multiple lesion tags may be associated with a dermatology image. Therefore, to classify skin lesions we need to train a multi-label CNN. Similar to disease-targeted skin disease classification, we fine-tune the multi-label CNN with the BVLC AlexNet model. To train a multi-label CNN, two data layers are required. One data layer loads the dermatology images and the other data layer loads the corresponding lesion tags. Given an image

from the first data layer, its corresponding lesion tags from the second data layer are represented as a binary vector

where is the number of lesions in our data set and is given as

(5)

We replace the last fully-connected layer of the AlexNet with a new fully-connected layer to accommodate the lesion tag vector. The learning rate of the parameters of this layer is also increased so that the CNN can learn features of the dermatology images instead of those images from ImageNet. For the multi-label CNN, we use the sigmoid cross-entropy [21, Chapter 3] as the loss function and replace the softmax layer with a sigmoid cross-entropy layer. Let the be the weighted input denoted in Section IV-B, then the th activation of the sigmoid cross-entropy layer can be written as

(6)

And the corresponding cross-entropy loss is

(7)

For a given image , the output of the multi-label CNN is a confidence vector . Here, is the th activation of the sigmoid cross-entropy layer. It denotes the confidence of being related to the lesion tag . In the test phase, we use a threshold function to determine the lesion tags of the input image , i.e. where

(8)

For the choice of the threshold function , we adapt the method recommended in [22] which picks a linear function of the confidence vector by maximizing the multi-label accuracy on the training set.

V Experimental Results

In this section, we investigate the performance of the CNNs trained for the disease-targeted and lesion-targeted skin disease classifications, respectively. For both the disease-targeted and lesion-targeted classifications, we use transfer learning

[23] 111We use transfer learning and fine-tuning interchangeably in this paper. to train the CNNs. However, note that the ImageNet pre-trained models are trained from images containing mostly artifacts, animals, and plants. This is very different from our skin disease cases. To investigate the features learned only from skin diseases and avoid using useless features, we also train the CNNs from scratch.

We conduct all the experiments using the Caffe deep learning framework

[20] and run the programs with a GeForce GTX 970 GPU. For the hyper-parameters, we follow the settings used by the AlexNet, i.e., batch size , momentum and weight decay . We use and learning rate for fine-tuning and training from scratch, respectively.

V-a Performance of Disease-Targeted Classification

Learning Type Top-1 Accuracy Top-5 Accuracy mAP
Fine-tuning 27.6% 57.9% 0.42
Scratch 21.1% 48.9% 0.35
TABLE II: Accuracies and mAP of the disease-targeted classification
Fig. 2:

The confusion matrix of the disease-targeted skin disease classifier with the CNN trained using fine-tuning. Row: Actual diagnosis. Column: Predicted diagnosis.

Fig. 3: Macro-average of precision, recalls, and F-measures as well as mAP.

To evaluate the performance of the disease-targeted skin disease classifier, we use the top- and top- accuracies, mAP score, and the confusion matrix as the metrics. Following the notations in Section IV, let be the output of the multi-class CNN when the input is and be the labels of the largest elements in . The top- accuracy of the multi-class CNN on the test set is given as

(9)

where is

(10)

and is the total number of images in the test set. For the mAP, we adapt the definition described in [24]:

(11)

where and

denote the precision and recall of the

th image at fraction , denotes the change in recall from to and is the total number of possible lesions. Finally, for the confusion matrix , its elements are given as

(12)

where is the ground truth, is the prediction and is the number of images whose ground truth is .

(a) Precisions
(b) Recalls
(c) F-measures
Fig. 4: Label-based precisions, recalls, and f-measures

Table II shows the accuracies and mAP of the disease-targeted skin disease classifiers with the CNNs trained from scratch or using fine-tuning. It is interesting to note that the CNN trained using transfer learning performs better than the CNN trained from scratch only on skin diseases. It suggests that the more general features learned from the richer set of ImageNet images can still benefit the more specific classification of the skin diseases. And training from scratch did not necessarily help the CNN learn more useful features related to the skin diseases. However, even for the CNN trained with fine-tuning, the accuracies and mAP are not satisfactory. Only top- accuracy, top- accuracy, and mAP score are achieved.

The confusion matrix computed for the fine-tuned CNN is given in Figure 2. The row indices correspond to the actual diagnosis labels and the column indices denote the predicted diagnosis labels. Each cell is computed using Equation (12) which is the percentage of the prediction among images with ground truth . A good multi-class classifier should have high diagonal values. We find in Figure 2 that there are some off-diagonal cells with relatively high values. This is because some skin diseases are visually similar, and the CNNs trained with diagnosis labels still cannot distinguish among them. For example, the off-diagonal cell at row and column has a value of . Here, label represents “compound nevus” and label stands for “malignant melanoma”. It means about of the “compound nevus” images are incorrectly labeled as “malignant melanoma”. If we look at the two images in the first column of Figure 1, we can see that these two diseases look so similar in appearance that not surprisingly the disease-targeted classifier fails to distinguish them.

V-B Performance of Lesion-Targeted Classification

As we use a multi-label classifier for the lesion-targeted skin disease classification, the evaluation metrics used in this experiment are different from those used in the previous section. To evaluate the performance of the classifier on each label, we use the label-based precision, recall and F-measure. And to evaluate the overall performance, we use the macro-average of the precision, recall and F-measure. In addition, the mAP is also used as an evaluation metric of the overall performance.

Let be the set of images whose ground truth contains lesion and be the set of images whose prediction contains lesion . Then, the label-based and the macro-averaged precision, recall, and F-measure can be defined as

(13)

where is the total number of possible lesion tags.

Figure 3 shows the overall performance of the lesion-targeted skin disease classifiers. The macro-average of the F-measure is around and the mean average precision is about . This is quite good for a multi-label problem. The label-based precisions, recalls, and F-measures are given in Figure 4. We can see that for the lesion-targeted skin disease classification, the fine-tuned CNN performs better than the CNN trained from scratch which is consistent with our observation in Table II. It means for the lesion-targeted skin disease classification problem, it is still beneficial to initialize with weights from ImageNet pretrained models. We also see that the label-based metrics are mostly above in the fine-tuning case. Some exceptions are atrophy (), erythemato-squamous (), excoriation (), oozing (), and vesicle (). The failures are mostly due to 1) the lesiona not visually salient or masked by other larger lesions, or 2) sloppy labeling of the ground truth.

Some failure cases are shown in Figure 5. Image is labeled as atrophy. However, the atrophic characteristic is not so obvious and it is more like an erythematous lesion. For image , the ground truth is excoriation which is the little white scars on the back. However, the red erythematous lesion is more apparent. So the CNN incorrectly classified it as a erythematous lesion. Similar case can be found in image . For image , the ground truth is actually incorrect.

Fig. 5: Failure cases. Ground truth (left to right): atrophy, excoriation, hypopigmented, vesicle. Top prediction (left to right): erythematous, erythematous, ulceration, edema.

Figure 6 shows the image retrievals using the lesion-targeted classifier. Here, we take the output of the second to last fully-connected layer ( dimension) as the feature vector. For each query image from the test set, we compare its features with all the images in the training set and outputs the -nearest neighbors (in euclidean distance) as the retrievals. The retrieved images with green solid frames match at least one lesion tag of the query image. And those images with red dashed frames have no common lesion tags with the query image. We can see that the retrieved images are visually and semantically similar to the query images.

Fig. 6: Images retrieved by the lesion-targeted classifier. Row 1: the query images from the test set. Row 2-6: the retrieved images from the training set. Dotted borders annotate errors. Ground truth of the test images from column A to D: (crust, ulceration), (hyperpigmented, tumour), (scales), (erythematous, telangiectasis), (nail hyperpigmentation, onycholysis), (edema, erythematous).

Vi Conclusion

In this study, we have showed that, for skin disease classification using CNNs, lesion tags rather than the diagnosis tags should be considered as the target for automated analysis. To achieve better diagnosis results, computer aided skin disease diagnosis systems could use lesion-targeted CNNs as the cornerstone component to facilitate the final disease diagnosis in conjunction with other evidences. We have built a large-scale dermatology dataset from six professional photosharing dermatology atlantes. We have trained and tested the disease-targeted and lesion-targeted classifiers using CNNs. Both fine-tuning and training from scratch were investigated in training the CNN models. We found that, for skin disease images, CNNs fine-tuned from pre-trained models perform better than those trained from scratch. For the disease-targeted classification, it can only achieve top- accuracy and top- accuracy as well as mAP. The corresponding confusion matrix contains some high off-diagonal values which indicates that some skin diseases cannot be distinguished using diagnosis labels. For the lesion-targeted classification, a

mAP score is achieved, which is remarkable for a multi-label classification problem. Image retrieval results also confirm that CNNs trained using lesion tags learn the dermatology features very well.

Acknowledgment

This work was supported in part by New York State through the Goergen Institute for Data Science at the University of Rochester. We thank VisualDX for discussions related to this work.

References