Deep learning (DL) has found increasing use in healthcare domain in recent years. With its ability to find patterns imperceptible to human evaluation, it promises to be a powerful tool in improving the quality of treatment. The Food and Drug Agency (FDA) of the US has recently approved diagnostic tools such as QuantX and EchoMD AutoEF, which significantly improve diagnosing breast cancer and cardiac abnormalities respectively (Allen, 2019). DeepMind is collaborating with National Health Scheme (NHS) of UK to streamline patient care (Powles and Hodson, 2017). Quality dermatological attention is also an established need in the current scenario. Approximately 1.9 billion people worldwide are having some skin conditions, with skin diseases being the fourth most common source of morbidity (Liu et al., 2019). But there is a shortage of dermatologists to treat this growing patient pool. In the US, there are only 3.6 dermatologists reported for every 10,000 people (Kimball and Resneck Jr, 2008). Canada and Japan are courting telemdicine as a remedial means to provide specialist opinion in remote areas (Lanzini et al., 2012; Dekio et al., 2010; Imaizumi et al., 2017).
The variety of skin conditions is fairly large. Diseases such as contact dermatitis and ringworm, though not life threatening, spread virulently. According to the estimates by National Institutes of Health (NIH) in the US, one out of five US citizens are at a risk for morbidity due to debilitating skin conditions(Stern, 2010). With timely intervention many of these conditions can be resolved. The survival rate is close to 98% with remedial therapeutics.
At a time when the need for dermatological expertise is increasing due to growing incidence, a constant undersupply of specialists is leading to long waiting queues (Mishra et al., 2018; Suneja et al., 2001). In the absence of immediate specialist attention, patients tend to consult general practitioners. Their diagnostic accuracy is reported to be between 24-70% and concurrency with specialist opinion around 57% (Liu et al., 2019; Lowell et al., 2001). Specialist opinion can also vary from person to person, leading to increased risk to the patient.
Computer vision has gained traction in several domains due to the advent of deep learning (LeCun et al., 2015). They have circumvented rule based approaches, which used to make identification of visual patterns tedious. There have been few successful case studies in dermatology using deep learning. Esteva et al. and Haenssle et al. demonstrated that deep learning could match or even surpass dermatologists in detecting Melanoma (Esteva et al., 2017; Haenssle et al., 2018). Shrivastava et al. could demonstrate similar efficacy of DL based Psoriasis detection (Shrivastava et al., 2015). These models were limited to identifying single diseases. Park et al. could detect several anomalies with the help of crowd-sourcing (Park et al., 2018). Liu et al. have managed to achieve about 70% accuracy over 26 skin conditions by differential diagnosis (Liu et al., 2019).
At present, we require some automation via expert systems to address the growing requirement of specialist attention. Although there have been recent forays into building such systems, the results are based on carefully curated datasets (Liu et al., 2019; Mishra et al., 2019a; Codella et al., 2017). In trying to perform machine aided diagnostics, we believe handpicked multimedia obtained under ideal conditions are not a true representation of samples seen in clinical workflow. In this paper, we attempt to assess the robustness of deep learning models, which have been well trained on user-submitted images over ten classes. We evaluate its performance on common workflow issues such as noise and blur. The paper is organized as follows. Section 2 introduces the data and methods which have been employed in the current study. Section 3 elaborates on the results. We present a brief discussion and summary in Section 4 before concluding. The contribution of this paper is as follows:
We have attempted to understand a multi-class dermatological classifier’s decision on wrong predictions via attention based mechanisms.
We have investigated the effect of common artifacts such as shot noise and motion blur in changing the decisions.
We have assessed the performance of such classifiers in the presence of distribution shift and out-of-distribution samples.
The source codes will be made publicly available so that researchers in medical image recognition can evaluate the robustness of their methods to such non-ideal conditions.
A systematic collection of dermatological images was made with the consent and cooperation of volunteers belonging to the East Asian race. These images came from different sources, but usually larger than pixels and in JPEG format. Some additional samples were sourced from medical centers and affiliated institutions within agreed frameworks of data reuse. These samples were anonymized and labeled by registered clinicians without any modifications. Photographs with identifying features such as face, birthmark, tattoos, hospital tags etc. were excluded from the study. With the advice of clinical practitioners, we chose the following ten classes to build and investigate our classifier: (i) Acne, (ii) Alopecia, (iii) Blister, (iv) Crust, (v) Erythema, (vi) Leukoderma, (vii) Pigmented Maculae, (viii) Pustules, (ix) Wheal, and (x) Ulcer. Table 1 presents information about the selected labels and their sizes.
To examine the effect of distribution shift and out of distribution samples, we chose the SD-198 dataset (Sun et al., 2016). It was an ideal candidate in our study since it was also composed of user submitted images. Since a one-to-one correspondence did not exist between classes in this dataset and our collected images, a dermatologist helped group classes relevant to our experimental design. Hundred samples were selected at random from these composite new classes and tested with our model. Information about this grouping is shown in Table 2.
|Acne||Acne Keloidalis Nuchae, Acne Vulgaris, Steroid|
|Acne, Favre Racouchot, Nevus Comedonicus,|
|Alopecia||Alopecia Areata, Androgenetic Alopecia,|
|Follicular Mucinosis, Kerion, Scar Alopecia.|
|Blister||Dyshidrosiform Eczema, Hailey Disease,|
|Herpes Simplex, Herpes Zoster, Varicella,|
|Mucha Habermann disease|
|Crust||Angular Cheilitis, Bowen’s Disease, Impetigo|
|Erythema||Acute Eczema, Candidiasis, Erythema Ab Igne,|
|Ery. Annulare Centrifigum, Ery. Craqule, Ery.|
|Multiforme, Rosacea, Exfoliative Erythroderma|
|Leukoderma||Balanitis Xerotica Obl., Beau’s Lines, Halo|
|Nevus, Leukonychia, Pityriasis Alba, Vitiligo|
|P. Macula||Actinic Solar Damage, Becker’s Nevus, Blue|
|Nevus, Cafe Au Lait Macula, Compound Nevus,|
|Congenital Nevus Dermatosis Nigra, Epidermal|
|Nevus, Green Nail|
|Tumor||Angioma, Apocrine Hydrocystoma, Lipoma,|
|Dermatofibroma, Digital Fibroma, Fibroma,|
|Ulcer||Aphthous Ulcer, Behcet’s Disease, Ulcer, Stasis|
|Ulcer, Mal Perforans, Pyoderma Gangrenosum,|
|Wheal||Urticaria, Stasis Edema|
2.2. Model learning
Since the dataset is much smaller as compared to conventional computer vision datasets, we chose previously established learning paradigms for rapid training as our starting point (Mishra et al., 2019a, b)
. We provide a brief overview of the process in this section. The model was built on PyTorch using a single GPU (NVIDIATM V100 16GB HBM2) (Paszke et al., 2019)
. Pre-trained ResNet-34, ResNet-50, ResNet-101, and ResNet-152 were chosen as the candidate architectures since they exhibit feature re-use and propagation, which were essential to our fine tuning. We normalized data with the recommended mean and standard deviation. The data was split in 5:1 ratio into training and validation sets. We performed dynamic in-memory augmentation such as crop, random zoom, horizontal & vertical flips in the data-loader.
For a good learning rate (LR), we calculated suitable values through a range test prior to the model training (Smith, 2017). The implementation used several mini-batches with increasing values of the rate , and the loss values were computed. The validation loss was observed until it dropped significantly and reached a point of inflexion. The learning rate was chosen in the neighborhood of this inflexion. Figures 2 and 2 illustrate this process.
The model learning was conducted in two phases using Stochastic gradient descent with warm restarts (SGD-R)(Loshchilov and Hutter, 2016). The first phase used a cyclical learning rate with the network frozen except the final fully-connected layer. Cosine rate annealing was performed at the beginning of each epoch. The LR modulation was governed by,
where indicates the initial LR, is the current iteration, and is the total number of iterations to cover an epoch. This scheduling operation is illustrated in Figure 4.
In the subsequent phase, all the layers were unfrozen and SGD with restarts was used to train the full network. Two additional changes were also introduced: SGD-R was used in conjunction with cycle length multiplication (CLM) such that each cosine annealing cycle lasted longer. The modified SGD-R scheme is illustated in Figure 4. Additionally, discrimative learning rate was used to assign different scales of changes for different parts of the network (Howard and Ruder, 2018). Doing so helped preserve valuable existing pre-trained information. The outcomes of our model learning are discussed in Sec. 3.1.
2.3. Adversarial tests
Imaging systems are prone to imperfections. Shot noise can manifest as a result of CCD sensor’s defects or via spots on the camera lens. Presence of noisy pixels can adversely affect the classifier performance (Goodfellow et al., 2014). Unless using a stable platform for image capture, a photograph can also present varying amounts of motion blur. Sometimes both the artifacts could manifest in the image.
When working with user submitted images, it is not unreasonable to expect these imperfections. Dermatological classifiers in clinical workflow should ideally be able to disregard the presence of minor flaws towards reliably identifying the skin lesions. Although benchmarks for adversarial robustness do not exist for the current application, we attempted to simulate the above mentioned artifacts on our dataset and compare the changes in accuracy.
To simulate the effect of shot noise, we introduced varying levels of salt and pepper (S&P) pixels. At most, 5% of the image pixels were to be changed with the ratio of black to white noise pixels set at 3:1. In the case of motion blur, we chose aGaussian kernel. Sample images demonstrating the degree of changes are shown in Figures 5 and 6.
We tested the effects of these artifacts independent of each other. Additionally, we trained our classifier on noisy data to assess any change in the performance. The results of these adversarial experiments are discussed in Sec. 3.3.
3.1. Model fitting
In Sec. 2.2, we trained several ResNet models of different sizes to their best fits. Given sufficient epochs, we observed their accuracy to be converging to similar scores. All the models exhibited aggregate Top-1 accuracy higher than 85%. The differences due to architecture sizes were not starkly different. The accuracy change between smallest and the largest model was found to be approximately 3.5%. The learning stability was evidenced by the loss curves (training and validation) in each case, which indicated limited possibility for further fit. The results of model learning averaged over three trials are shown in Table 3. A plot indicating the loss changes during ResNet-152 model learning is illustrated in Figure 7
. The confusion matrix of its classification is shown in Figure8. The receiver operator characteristics (ROC) and area under the curve (AUROC) for each class is indicated in Figure 9.
|Model||Top-1 Accuracy (in %) ()|
3.2. Effect of lesion variety
With a stable convergence of different models, we could hypothesize that several incorrect classifications arose due to the nature of data and incorrect labeling, if any. There were certain label pairs which exhibited high degree of errors. The ensemble statistics are listed in Table 4.
|Labels||Avg. Erroneous predictions ()|
|Ulcer and Tumor|
|P. Macula and Erythema|
|Erythema and Wheal|
|Crust and Ulcer|
We investigated some erroneous predictions using GradCAM (Selvaraju et al., 2017). An example of Hyperplasic Pigmented Macula is exhibited in Figure 13. The classifier had a shape bias to detect it as Erythema, since a majority of the latter present themselves in contrasting patches. An example of Erythema incorrectly classified as P. Macula is shown in Figure 13. The presence of hyperpigmented spots in the periphery drew the classifier to believe it to be of the wrong class. Texture bias in vision models discussed by Geirhos et al. was evident in the examples of Ulcer misclassified as Tumor, and vice versa (Figures 13 and 13) (Geirhos et al., 2019). The absence of any identifying texture led to higher confusion while categorizing between Erythema and Wheal.
3.3. Effect of adversarial artifacts
In Sec. 2.3, we described the means of creating adversarial examples to measure the robustness of model against shot noise and motion blur. Using ResNet-50 trained on uncorrupted data, we measured the classification performance of the model on test set having up to 5% corrupted pixels. The Top-1 accuracy under this setting was observed to be 68%. Figure 14 shows correct prediction for the sample shown in Figure 5 despite the presence of S&P noise. However, many otherwise obvious cases show sharp prediction changes, such as Figure 15 for a sample of Ulcer. We also observed emergence of a bias for the classifier towards a few labels, such as Acne, Blister and Wheal.
In the presence of motion blur, the performance of classifier was marginally better at 74.85%. Figure 16 shows the previously illustrated example (Fig. 6) being correctly identified to its true class label. Although bias did emerge due to introduction of the blur, it was not as severe as seen in the presence of shot noise. In both cases however, we observed the classification quality change significantly which is illustrated in the confusion matrices shown in Figures 17 and 18.
We also trained our model on shot noise corrupted images and evaluated the performance on both clean and corrupted test sets. The Top-1 accuracy measured in these cases were 75.6% and 80.92% respectively. We observed that the confusion in classes changed significantly when the model was trained on clean data and tested on corrupted images. The pattern of original class confusions were retained to some degree on both clean and corrupted test set when the model was trained in the presence of shot noise. Table 5 gives a brief summary of the adversarial experiments conducted.
|Train Status||Test Status||Top-1 Acc. (in %)|
|Clean w/o Noise||N.A.||86.90|
|Clean w/o Noise||S&P Corrupted||68.31|
|Clean w/o Noise||Blurred||74.85|
|S&P Corrupted||Clean w/o Noise||75.60|
|S&P Corrupted||S&P Corrupted||80.92|
3.4. Effect of distribution shift
In real world scenarios, any dermatological classifier is expected to work on diverse set of input images. A supplied image may be different from the kind of images the model is trained on. Moreover, input images may also belong to classes which the classifier was not designed for. In the cases of distribution shift, and out of distribution samples, models are required to behave predictably.
As described in Sec. 2.1, we grouped matching classes of SD-198 dataset to our experimental specification for creating a composite test set with a hundred samples in each class (Only exception being Wheal which had 71 samples). We trained a ResNet-50 model to optimal accuracy on our dataset, and performed inference on the test set to evaluate predictions. Generalization was seen to be poor with only 32% Top-1 aggregate accuracy. Only Acne and Alopecia fared properly with 70% and 73% accuracy respectively. Classifier bias was seen to favor Acne and Blister dominantly over others. Figure 19 shows a confusion matrix for the classifier performance on this test set.
In Sec. 3.1, we demonstrated means to perform a good model fit on homogeneous skin disease images. From the computer vision frame of reference, although background homogeneity may be an excellent option, it poses several challenges in dermatological classification. Despite of robust training, a sufficiently large gap exists between our results and ideal, error-free performance. We ascribe it predominantly to the nature of the data itself. The visual attributes of skin diseases are not discrete by nature. Many diseases belong to wide spectrum and families of abnormalities which make their detection and grading difficult. Several of these disease classes are also chronologically related. Instances of Acne and Blister can eventually lead to either formation of Ulcers or Crusts. Different pathways exist for lesion progression, increasing the complexity. Although it could be obvious for a trained clinician, intermediate transitional states between classes could also be a difficult challenge in machine diagnosis. Skin diseases also rarely occur in isolation. Conditions such as Erythema or Hyperpigmentation are concurrent in presentation with several other diseases. Dermatological classifiers can be biased in detecting one over the other. In the absence of valuable patient history, the information presented is only skin-deep.
Unlike radiological images, skin disease images present very few landmarks. There are only a handful of diseases with strong visual markers or patterns to aid classification. This issue is compounded by the large spectrum of skin tones in the patient pool. Lesion contrast and color are variable attributes when considering racial differences. Although some remedial measures have been discussed previously, contrast and illumination correction may not be sufficient (Mishra et al., 2019a). Our models were trained on a single racial type. With the introduction of more racial cohorts, we strongly believe the class confusions will aggravate. We demonstrated this effect to some extent in Sec. 3.4, where model testing on composite similar classes dramatically reduced the efficiency of the classifier. Irregularities tied to the nature of data may not be alleviated in the near future.
Adversarial attacks can cause wrong categorizations with seemingly unchanged input images. We presented adversarial examples via shot noise & simulated motion blur which degraded the performance of classification. User submitted images typically contain some degree of such artifacts. Dermatological classifiers have not yet demonstrated confidence in overcoming them. We plan to study the effects of adversarial perturbations in skin images to design better classification models in the future. This might also help in reducing vulnerabilities from malicious attacks in real world systems.
In light of these deficiencies, human expertise cannot be be removed from the clinical workflow. The margin of error in healthcare is slim. In addition to better and varied skin data, we require clinicians in correctly labeling skin conditions and verifying diagnostic outcomes. From our experience, computer vision techniques have been useful in detecting mislabeled samples or the presence of a novel category in our experimental data. But a trained practitioner is required eventually to appraise this information. At present, even if deep learning based classifiers cannot replace the human expertise, they can be valuable physician aids in screening patients.
In this paper, we demonstrated that several skin diseases can be identified from user submitted images with deep learning based classifiers. We showed that given sufficient training, the accuracy levels become architecture agnostic. There exists a significant gap between error-free detection and the peak performance achieved by contemporary methods. This gap may not be bridged easily since it manifests from the nature of skin disease presentation. We also showed that the performance dipped by at least 10% in non-ideal conditions such as noise, blur and distribution shift, which are reasonable scenarios in any field trial. We emphasize the role of trained practitioners in conjunction with these methods to improve the quality of dermatological services. There is a long road yet to be travelled to achieve unassisted, reliable diagnosis of skin conditions.
The role of the fda in ensuring the safety and efficacy of artificial intelligence software and devices. Journal of the American College of Radiology 16 (2), pp. 208–210. Cited by: §1.
- Deep learning ensembles for melanoma recognition in dermoscopy images. IBM Journal of Research and Development 61 (4/5), pp. 5–1. Cited by: §1.
- Usefulness and economic evaluation of adsl-based live interactive teledermatology in areas with shortage of dermatologists. International journal of dermatology 49 (11), pp. 1272–1275. Cited by: §1.
- Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 (7639), pp. 115. Cited by: §1.
- ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.. In International Conference on Learning Representations, External Links: Cited by: §3.2.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §2.3.
Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Annals of Oncology 29 (8), pp. 1836–1842. Cited by: §1.
- Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §2.2.
- Hippocra: doctor-to-doctor teledermatology consultation service towards future ai-based diagnosis system in japan. In 2017 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW), pp. 51–52. Cited by: §1.
- The us dermatology workforce: a specialty remains in shortage. Journal of the American Academy of Dermatology 59 (5), pp. 741–745. Cited by: §1.
- Impact of the number of dermatologists on dermatology biomedical research: a canadian study. Journal of cutaneous medicine and surgery 16 (3), pp. 174–179. Cited by: §1.
- Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
- A deep learning system for differential diagnosis of skin diseases. arXiv preprint arXiv:1909.05382. Cited by: §1, §1, §1, §1.
- SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §2.2.
- Dermatology in primary care: prevalence and patient disposition. Journal of the American Academy of Dermatology 45 (2), pp. 250–255. Cited by: §1.
Interpreting fine-grained dermatological classification by deep learning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1, §2.2, §4.
- Supervised classification of dermatological diseases by deep learning. arXiv preprint arXiv:1802.03752, pp. 1–6. Cited by: §1.
- Improving image classifiers for small datasets by learning rate adaptations. arXiv preprint arXiv:1903.10726. Cited by: §2.2.
Crowdsourcing dermatology: dataderm, big data analytics, and machine learning technology. Journal of the American Academy of Dermatology 78 (3), pp. 643–644. Cited by: §1.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §2.2.
- Google deepmind and healthcare in an age of algorithms. Health and technology 7 (4), pp. 351–367. Cited by: §1.
- Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §3.2.
- Reliable and accurate psoriasis disease classification in dermatology images using comprehensive feature space in machine learning paradigm. Expert Systems with Applications 42 (15-16), pp. 6184–6195. Cited by: §1.
- Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. Cited by: §2.2.
- Prevalence of a history of skin cancer in 2007: results of an incidence-based model. Archives of dermatology 146 (3), pp. 279–282. Cited by: §1.
- A benchmark for automatic visual classification of clinical skin disease images. In European Conference on Computer Vision, pp. 206–222. Cited by: §2.1.
- Waiting times to see a dermatologist are perceived as too long by dermatologists: implications for the dermatology workforce. Archives of dermatology 137 (10), pp. 1303–1307. Cited by: §1.