It is generally known that there are many types of skin diseases, ranging from itching caused by mosquito bites to skin cancer. Traditionally, the diagnosis of skin diseases is based on the comprehensive consideration of the size, shape, color and other visual features of the lesion area. Recently, deep learning methods have been applied to many medical tasks[15, 14] and obtained remarkable achievements. Using computers to help diagnose skin diseases would be a promising direction. At present, there are many studies [13, 21] on dermoscopic images and they have achieved promising results. Although there are some works [16, 19] on clinical skin disease images, their datasets are small. To fill this gap, we first build a clinical skin disease dataset, namely Skin-10, which has 10,218 images of 10 common skin diseases, and we manually marked the skin lesions for each image. Besides, we expand Skin-10 to Skin-100, which covers 100 classes of skin diseases and contains 19,807 clinical images. To the best of our knowledge, the scale of Skin-100 is larger than all existing clinical skin disease datasets. What’s more, Skin-10 is the first clinical skin disease dataset that provides bounding boxes. By using the bounding boxes, we could extract more discriminative features and thereby improve the classification performance.
|Dataset||PH2||ISIC||Ham10000|||| /|||| /||||Skin-10 / Skin-100|
|Type||D||D||D||C||C / C||C||C / C||C||C / C|
|#Classes||3||-||7||6||3 / 7||44||128 / 198||26||10 / 100|
|#Images||200||23,801||10,015||366||90 / 706||2,309||5,619 / 6,584||17,777||10,218 / 19,807|
|Annotation?||Y||Y||N||N||N / N||N||N / N||N||Y / N|
Unlike dermoscopic images, which are hard to obtain because of high expense and inconvenience access , clinical images can be captured by easily-accessed devices, like the smartphone. However, there are several difficulties in recognizing skin lesion from a clinical image. (a) the background of clinical images is more complex so that how to reduce background noise interference is an important issue to consider. (b) clinical images cover far more classes of diseases than dermoscopic images, because dermatoscope is designed primarily for skin cancers, like melanoma and basal cell carcinoma, etc. (c) clinical skin disease image classification can be considered as a task of fine-grained classification, which is more difficult than normal classification task (e.g. CIFAR10 
). On the one hand, the lesions of one class of skin diseases may not only appear in different parts of the human body, but also the visual characteristics of these lesions vary greatly, which indicates the high intra-class variance. On the other hand, two lesions that look very similar may belong to different categories, which indicates the low inter-class variance. Despite the above difficulties, it would be helpful for doctors and patients if we can apply deep learning techniques on our datasets to solve those problems.
To verify whether deep neural network works well on the task of clinical skin disease classification, we first benchmark on Skin-10 and Skin-100 using four SOTA CNNs: ResNet50 , DenseNet121 , Nasnetamobile  and Pnasnet5large , and we implement ensemble methods based on above four CNN models. Besides, we perform two SOTA object detection models (RetinaNet  and Faster-RCNN 
) on Skin-10 to detect possible lesion regions on the raw images and then classify the disease based on the region with the highest confidence score. The experimental result shows that the object detection based method can reduce the influence of the background and achieve higher accuracy than plain CNN models.
In summary, this paper has the following contributions. (a) We build two versions of clinical skin disease datasets: Skin-10 and Skin-100. As far as we know, Skin-10 is the first clinical skin disease dataset providing bounding boxes and the scale of Skin-100 is larger than most existing clinical skin disease datasets. (b) We establish the baseline performance of four SOTA CNN models on our datasets. (c) We verify the effectiveness of the ensemble method which achieves the best accuracy on both datasets. (d) We propose an object detection based approach and evaluate its performance on Skin-10.
The remainder of this paper is organized as follows. Section II reviews the existing related skin disease datasets and methods of recognizing skin diseases. In Section III, the details of Skin-10 and Skin-100 are presented. Section IV describes the details of CNN-based, ensemble, and object detection based methods, and analyzes the experimental results. Finally, Section V offers concluding remarks and future works.
Ii Related Work
Ii-a Skin Disease Datasets
There are two types of skin disease images: dermoscopic and clinical. The former is obtained by a dermatoscope, which requires a high-quality magnifying lens and a powerful lighting system. Hence, dermoscopic images have a simpler background, which means that the distribution, number, size, shape, and color of skin lesions are much clearer compared with clinical images. Table I summarizes some existing skin disease datasets. PH2 dataset  contains 200 dermoscopic images of melanocytic lesions, including 80 common nevi, 80 atypical nevi, and 40 melanomas. HAM10000  collects dermoscopic images from different populations and consists of 10,015 images from 7 classes. ISIC Archive111https://isic-archive.com/ provides both dermoscopic and clinical images, which are categorized into many classes based on different attributes, such as diagnostic attributes, clinical attributes, etc. ISIC has 23,801 dermoscopic images but only has 100 clinical skin images. The dataset  is from the UCI machine repository and contains six classes of skin diseases. In , there are two datasets proposed: 90 dermatological images covering 3 skin disease classes and 706 images covering 7 classes, respectively. The dataset in  has 44 classes containing 2,309 images. Another two datasets are proposed in : SD-128, which has 128 classes and 5,619 images, and SD-198, which has 198 classes and 6,584 images. They achieve the classification accuracy of 52.15% on SD-128 and 50.27% on SD-198. A recent work proposed by Google  develops a deep learning system for diagnosing 26 classes of clinical skin diseases using 17,777 cases. Each case consists of one or several clinical images and metadata, which includes patient demographic information and medical history.
Ii-B Skin Disease Classification
The methods of skin disease image classification are two-folds. The first relies on hand-crafted features, such as the texture (SIFT, HOG), color (ColorSIFT, ColorHistogram), and edge (Gabor, Sobel) 19]
, a pretrained CNN model is used to extract deep features and the result shows that CNN performs better than hand-crafted based methods when the background is complex. In, the authors use a large ensemble of SOTA CNN models and place second at the ISIC 2018 challenge for skin lesion diagnosis. Verma and Pal et al.  also propose an ensemble method that combines five different data mining techniques. Zhang’s work  proposes an attention residual learning CNN model to effectively locate the skin lesion of dermoscopic images. Currently, the models generated by the neural architecture search (NAS) technique have been demonstrated to achieve comparable results to the human-designed models in many tasks [7, 2]. In this paper, we compare NAS-designed and human-designed CNN models and perform the ensemble method based on these two types of models. We also evaluate the performance of two SOTA object detection models, which are expected to locate the skin lesion area and reduce the influence of background.
|Index||Class name||#Training set||#Testing set|
|2||Atopic Dermatitis Eczema||590||145|
|3||Basal Cell Carcinoma||3249||826|
Our datasets are built by scraping images from the Internet. We build Skin-10 by selecting 10 classes from the most common skin diseases . As a result, Skin-10 has 10,218 images and the statistics of Skin-10 is presented in Table II. Additionally, we use a graphical image annotation tool (LabelImg 222https://github.com/tzutalin/labelImg) to mark skin lesions with bounding boxes for each image in Skin-10. We further build a larger dataset based on Skin-10, namely Skin-100, which has 19,807 images of 100 skin diseases, and each class has over 40 images. The data distribution of Skin-100 is long-tailed, as shown in Fig 1. In both Skin-10 and Skin-100, the ratio of the training set to the testing set is set at 3:1.
In some images, the skin lesion is covered or cured. Hence, we perform data cleaning to remove a total of 290 noise images. The experimental result in Section 4 shows that data cleaning improves the performance of models for all CNN models.
Scale To the best of our knowledge, the scale of Skin-100 is much larger than existing clinical skin disease datasets. As Table I shows, the number of clinical skin disease images in Skin-100 is almost 3 to 200 times than other datasets. Besides, Skin-10 is the first clinical skin disease dataset providing bounding boxes for skin lesion detection [19, 16, 17].
Diversity Our datasets cover different ages, genders and lesion locations. Besides, the difference can be significant within the same class, e.g. the skin disease images from the same class may differ from skin colors and lesion shapes, whereas, the difference can also be subtle between different classes. In a word, our datasets are of high diversity so that it is worthy to evaluate whether deep leaning techniques are feasible in our datasets.
Iv Clinical Skin Disease Image Classification
In order to establish the baseline performance of CNN models on our datasets and not lose generality, we select four representative models from two types of models. (a) the models designed by human experts. (b) the models generated by the NAS algorithm. We also verify the feasibility of the ensemble method using these four models. We further evaluate the effectiveness of two SOTA object detection models. The implementation details and results are described in the following content.
Iv-a CNN based Classification
In this experiment, two types of SOTA CNN models are used as the baseline models: (a) ResNet50 and DenseNet121, which are designed by human experts. (b) Nasnetamobile and Pnasnet5large, which are generated automatically by the NAS technique. The pretrained models of (a) and (b) are obtained from torchvision333https://github.com/pytorch/vision/tree/master/torchvision/models
and pretrained-models.pytorch444https://github.com/Cadene/pretrained-models.pytorch, respectively. After fine-tuning four base CNN models, we ensemble them into a strong classifier, namely EnsembleNet (shown in Fig 2
), by summing the probability prediction of four base models:
where indicates the input image and represents the number of base models. In this work, the best result is obtained by setting , i.e. combining the prediction results of all base models.
Before feeding the training set to the model, we implement a series of data augmentation, including resize, random crop, flips, rotation, and normalization, while for the testing set, we only perform resize and normalization. We use the stochastic gradient descent (SGD) optimizer with an initial learning rate of 0.01. The learning rate will be multiplied by a factor of 0.1 every 10 epochs. The batch size is 64 and the input image size is fixed to 224*224. Cross-entropy is used as the loss function. All baseline models are fine-tuned to converge.
Iv-A2 Results and Analysis
Based on the observation that the distribution of our datasets is imbalanced, we use top-k weighted accuracy as the metric. Let be the output of the CNN models when the input is and be the labels of the largest elements in . is the ground-truth label of the -th sample and is the corresponding sample weight. The top-k weighted accuracy on the testing set is given as
and is the number of images in the test set.
Table III presents the results of the baseline models on Skin-10, Skin-100 and Skin-100-Noise. Skin-100-Noise represents the dataset in which there exist many noise images in the training set, while Skin-100 is the opposite. Additionally, we make sure that both Skin-100 and Skin-100-Noise share the same testing set. In this way, we can fairly see from Table III that removing the noise images contributes to the improvement of classification accuracy for all CNN models, especially for ResNet50, as its top-1 accuracy is improved by 7.34%. On the other hand, we can also find that, after data cleaning, the performance of models designed by the NAS technique is not improved as much as the human-designed models, which means that these models are more robust and able to extract more discriminative features.
We further employ an ensemble method and we evaluate all possible combinations of base models. The experimental results show that the ensemble of four base models (EnsembleNet) achieves the highest top-1 accuracy (79.01%), and the ensemble of DenseNet121, Nasnetamobile, and Pnasnet5large obtains the best top-3 accuracy (95.42%). What’s more, all ensemble models outperform any single base model.
Although EnsembleNet can improve classification accuracy by integrating base models, we can see from Fig 3 (left) that EnsembleNet and base CNN models have similar performance on Skin-10, e.g. they all achieve high score on class 3, 5 and 8, but perform not well on class 1, 2, 4, and 9. Fig 3 (right) plots the confusion matrix of EnsembleNet, from which we can see that the classes that confuses EnsembleNet can be divided into two groups: (a) class 0, 2 and 9. (b) class 1, 3, 4, 6 and 7. The reason why the accuracy of class 0 and 3 is relatively higher probably is that they have much more images. But why is the accuracy of class 5 and 8 is high even though the number of images in both classes is not as many as class 0 and 3? Based on the observation of the images of different classes, we find that most of the lesion locations of class 5 and 8 are nails and legs, respectively, while the lesions of other classes are distributed in similar regions (e.g. faces, chest or unknown area) and their visual features also look similar. Hence, we can conclude that for class 5 and 8, most of their lesions are distributed in the specific regions, so the CNN models can correctly classify them even if it does not correctly locate the lesion regions. However, for some classes with similar lesion locations, the CNN models may misclassify the images due to focusing on the wrong region or failure to extract effective and discriminative features from the lesion area.
Iv-B Object Detection based Classification
To alleviate the problem of the plain CNN models, we evaluate the performance of two SOTA object detection models (RetinaNet and Faster-RCNN), respectively, which are expected to learn to detect and classify based on the skin lesion region. The process is shown in Fig 4.
The implementation of both RetinaNet and Faster-RCNN are publicly available in mmdetction555https://github.com/open-mmlab/mmdetection. The backbone of both models is ResNeXt101_64x4d. The input images are zoomed to 400400. The model is optimized by SGD with an initial learning rate of 0.001, and the learning rate is dropped by a factor of 10 every 10 epochs. Besides, for RetinaNet, the focal loss  is used to replace the commonly used cross-entropy loss, and we set .
Iv-B2 Results and Analysis
Mean Average Precision (mAP) is one of the most commonly used metric for the object detection task. However, for Skin-10, each image corresponds to only one class of skin diseases, and we care more about whether the image is classified correctly. Therefore, we take the prediction of the box, which has the highest confidence, as the prediction of the image.
Let , be the prediction, ground truth for the -th image, respectively. is the sample weight for -th image, as defined in Equation IV-A2. and indicate the number of testing images and classes, respectively. Then the top-1 classification accuracy is defined as follows:
where represents the number of predicted boxes in the -th image. indicates the probability of being predicted as the -th class for the -th box in the -th image.
It can be seen from the result in Table IV that the classification accuracy of most classes is improved by detecting skin lesions. However, we can also find that the classification accuracy of some classes (e.g. class 8) decreases significantly for both RetinaNet and Faster-RCNN, instead. There are several possible explanations for this result. First, the prediction is only based on the box with the highest confidence. In other words, it loses the global features that are important for classification, which further confirms the conclusion of the previous experiment that the reason why base CNN models can achieve higher classification accuracy in some classes (e.g. class 8) is because lesion locations of these classes are unique and not easy to be confused by other classes. Second, although both object detection models can detect lesions well in most cases, it may detect the wrong area or even fail to detect any targets when the skin lesion area is not clear or affected by background noise, thus leading to failure of prediction.
V Conclusions and Future Work
In this paper, we investigated how deep learning techniques could help image-based skin disease diagnosis. We developed two versions of clinical skin disease datasets from Internet images: Skin-10, which consists of 10,218 images of 10 common skin disease classes with bounding boxes surrounding the lesion, and Skin-100, which contains 19,807 images of 100 skin disease classes. We found that data cleaning is very important as it can help improve the top-1 accuracy by 4% on average in our experiments. We also found that the ensemble method makes more efficient use of the dataset and outperforms any single CNN model on both Skin-10 and Skin-100. We further evaluated two SOTA object detection models that are used to reduce the influence of image background. Our results showed that the object detection models outperform the classification based solutions.
Although it is demonstrated that deep learning techniques can achieve satisfactory performance in skin disease image classification, there still exists room for further improvement. Based on the analysis of previous experimental results, we summarize the following directions worth studying in the future. First, the models generated by autoML technique may extract more discriminative features and are more robust to noisy data than human-designed models, therefore we can try to explore autoML technique to generate a task-specific architecture based on clinical skin disease datasets. Second, although the ensemble method is easy to implement and effective, it does not improve the accuracy of some hard-to-classify classes. Thus, how to improve the accuracy of these classes is still a challenging task. Third, object detection based approach can reduce the influence of background by detecting the local region of skin lesions, but it loses the global features. Hence, how to combine global and local features would be another promising work. At last, our dataset is built on Internet images and cannot be open to the public due to copyright issue. We hope the healthcare community can join hands to develop an open and large skin disease dataset.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §I.
-  (2018) Neural architecture search: a survey. arXiv preprint arXiv:1808.05377. Cited by: §II-B.
-  (2018) Skin lesion diagnosis using ensembles, unscaled multi-crop evaluation and loss weighting. arXiv preprint arXiv:1808.01694. Cited by: §II-B.
-  (1998) Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals. Artificial Intelligence in Medicine 13 (3), pp. 147–165. Cited by: TABLE I, §II-A.
-  (2014) The global burden of skin disease in 2010: an analysis of the prevalence and impact of skin conditions. Journal of Investigative Dermatology 134 (6), pp. 1527–1534. Cited by: §III.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §I.
-  (2019) AutoML: a survey of the state-of-the-art. arXiv preprint arXiv:1908.00709. Cited by: §II-B.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §I.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §I.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988. Cited by: §I, §IV-B1.
-  (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §I.
-  (2019) A deep learning system for differential diagnosis of skin diseases. arXiv preprint arXiv:1909.05382. Cited by: TABLE I, §II-A.
-  (2013) PH 2-a dermoscopic image database for research and benchmarking. In Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE, pp. 5437–5440. Cited by: TABLE I, §I, §II-A.
-  (2012) Inbreast: toward a full-field digital mammographic database. Academic Radiology 19 (2), pp. 236–248. Cited by: §I.
-  (2017) Mura: large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957. Cited by: §I.
-  (2012) Skin lesion image recognition with computer vision and human in the loop. Medical Image Understanding and Analysis (MIUA), Swansea, UK, pp. 167–172. Cited by: TABLE I, §I, §II-A, §III.
-  (2013) Interactive skin condition recognition. In 2013 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: TABLE I, §II-A, §III.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pp. 91–99. Cited by: §I.
-  (2016) A benchmark for automatic visual classification of clinical skin disease images. In European Conference on Computer Vision, pp. 206–222. Cited by: TABLE I, §I, §II-A, §II-B, §III.
-  (2010) Global access to dermatopathology services: physician survey of availability and needs in sub-saharan africa. Journal of the American Academy of Dermatology 63 (2), pp. 346–348. Cited by: §I.
-  (2018) The ham10000 dataset: a large collection of multi-source dermatoscopic images of common pigmented skin lesions. arXiv preprint arXiv:1803.10417. Cited by: TABLE I, §I, §II-A.
-  (2019) Classification of skin disease using ensemble data mining techniques. Asian Pacific Journal of Cancer Prevention 20 (6), pp. 1887–1894. Cited by: §II-B.
-  (2019) Attention residual learning for skin lesion classification. IEEE Transactions on Medical Imaging. Cited by: §II-B.
-  (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §I.