Multiclass Wound Image Classification using an Ensemble Deep CNN-based Classifier

10/19/2020 ∙ by Behrouz Rostami, et al. ∙ University of Wisconsin-Milwaukee 9

Acute and chronic wounds are a challenge to healthcare systems around the world and affect many people's lives annually. Wound classification is a key step in wound diagnosis that would help clinicians to identify an optimal treatment procedure. Hence, having a high-performance classifier assists the specialists in the field to classify the wounds with less financial and time costs. Different machine learning and deep learning-based wound classification methods have been proposed in the literature. In this study, we have developed an ensemble Deep Convolutional Neural Network-based classifier to classify wound images including surgical, diabetic, and venous ulcers, into multi-classes. The output classification scores of two classifiers (patch-wise and image-wise) are fed into a Multi-Layer Perceptron to provide a superior classification performance. A 5-fold cross-validation approach is used to evaluate the proposed method. We obtained maximum and average classification accuracy values of 96.4 3-class classification problems. The results show that our proposed method can be used effectively as a decision support system in classification of wound images or other related clinical applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

page 7

page 8

page 9

page 10

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Acute and chronic wounds are a challenge and burden to healthcare systems worldwide. In the United States alone, acute wounds affect 11 million people and chronic wounds influence more than 6 million humans annually with an estimated medicare burden of $28.1 billion to $96.8 billion (US) 

[demidova2012acute, sen2009human, sen2019human]. The characterization of a wound is a key step in wound diagnosis that would help clinicians to identify an optimal treatment procedure and assess the efficacy of the treatment. Wound classification is to classify a wound as a whole into different types (e.g., venous, diabetic, pressure) or different conditions (e.g., ischemia vs. non-ischemia, infection vs. non-infection), which constitutes an essential part of wound characterization and assessment.

Wounds are traditionally manually classified by wound specialists and made as part of the electronic health record (EHR). With new developments of Artificial Intelligence (AI) in the past decades, however, intelligent algorithms have been popularly used in healthcare in such fields as drug discovery, eye care, medical image diagnostic systems, etc. 

[yu2018artificial] in addition to the wound classification under investigation in the current study. In recent years, AI algorithms have evolved into so-called data-driven techniques without human or expert intervention as opposed to the early generations of AI that were rule-based relying largely on an expert’s knowledge [yu2018artificial]. Radiology, ophthalmology, immunology, genetics, and wound care are just a few examples of using developed AI algorithms like Machine Learning (ML) methods in health care [yu2018artificial, lakhani2018machine, figgett2019machine, andreatta2017machine, bari2017machine, rahman2018machine, collier2019lotus].

A representative of the data-driven developments is Deep Learning (DL), which is able to analyze complicated data automatically and extract the needed information, relationships, and patterns [ohura2019convolutional, lakhani2018machine]. DL models may appear in different forms such as Deep Convolutional Neural Networks (DCNN), according to the utilized building blocks in their structures. DCNNs based on deep learning theory, are neural networks with multiple hidden layers unlike their ancestors which utilized a limited number of layers [jiang2017artificial]. In these networks, convolution is used as the main math operation to process the input information [rostami2019survey]. Many studies have been conducted in the wound care field using DL [wang2015unified, li2018composite, rajathi2019varicose] and DCNNs, specifically in wound image analysis tasks like segmentation [goyal2019skin, veredas2015wound] and classification [abubakar2019discrimination, zahia2018tissue, zhao2019fine]. These studies have clearly shown the effectiveness and efficiency of deep convolutional neural networks in wound diagnosis and analysis.

The main contribution of this research is to propose an end-to-end ensemble DCNN-based classifier for classifying the entire wound images into multiple wound types. To the best of our knowledge, this research is the first study in which an ensemble method used for image-wise classification of the wound images into different types. We have proposed an innovative combining strategy to combine a sliding window-based patch classifier and a common DCNN, AlexNet to classify the wound images. Moreover, far as we are aware, this is the first time a deep learning-based method is proposed for classifying the wound images into more than two types. By accepting the entire wound image as the input, our proposed approach is able to generate the wound type as the output. Figure 1 displays an overview of our proposed method. Additionally, we utilized a new dataset of real wound images containing image data with more wound types than those considered in the prior publications. Our dataset contains 538 images from four different types of wounds including diabetic, venous, pressure, and surgical wounds. All these images along with their types and corresponding extracted ROIs will be made publicly available.

The rest of this manuscript has been organized as follows: Section II gives a literature review. In Section III, we present the materials used and discuss the details of our proposed ensemble deep CNN-based classification method. Section IV describes the experimental results. We give the summary and discussion of the results in Section V, followed by our conclusions in Section Acknowledgment.

Fig. 1: Classification process.

Ii Related works

In this section, we review the previous wound image classification studies. We have organized the articles under two major subcategories: the papers which used a feature extraction method followed by an SVM, and the articles in which an end-to-end DCNN-based classification approach was proposed. The complete organization chart for the reviewed studies can be found in Figure 

2.

Ii-a Feature generation + SVM

Ii-A1 Traditional ML algorithm + SVM

Yadav et al. proposed a method for binary classification of burn wound images using machine learning tools [yadav2019feature]. The authors classified the images into the two categories, grafting and non-grafted wounds, using a classic color-based feature extraction approach followed by an SVM. The dataset included 94 images with different burn depths including full-thickness, deep dermal, and superficial dermal. A classification accuracy of 82.43% was reported. Testing the proposed method on a very small dataset is the weakness of this research.

Goyal et al. suggested detection and localization methods for Diabetic Foot Ulcer (DFU) on mobile devices [goyal2018robust]. For the classification part, they tried both conventional and DCNN-based methods. A dataset with 1775 DFU images was used and the ground truth generated by creating bounding boxes around the ROI using an annotation software. For the traditional machine learning techniques, patches were extracted from normal skin and abnormal areas and the number of samples increased by utilizing data augmentation methods. Different traditional feature extraction algorithms were applied, and the best three methods were selected. Then they extracted 209 features from each patch which were used for training a Quadratic SVM classifier. Finally, for a new image, sliding window technique was used for classifying each patch of the image as normal or abnormal by utilizing the trained SVM.

Fig. 2: Organization chart for the wound classification papers.

Ii-A2 Deep CNN + SVM

In another study, Abubakar et al. proposed a machine learning-based approach to distinguish between burn wounds and pressure ulcers [abubakar2019can]. Pre-trained deep architectures like VGG-face, ResNet101, and ResNet152 were utilized for feature extraction. The features were fed into an SVM for the classification task. The dataset included 29 pressure and 31 burn images which were augmented using cropping, rotation, and flipping transformations. After augmentation, they performed binary and 3-class classification experiments. In binary classification experiment, the images were classified into burn or pressure categories and in 3-class classification problem the goal was to classify the images into the labels burn, pressure, or healthy skin. ResNet152 architecture generated the best results for both classification problems with an accuracy value of 99.9%.

Goyal et al. predicted the presence of infection or ischemia in DFUs using a deep learning-based classification method [goyal2020recognition]

. A new dataset with 1459 DFU images was introduced and the samples were augmented using Faster-RCNN and InceptionResNetV2 networks. Binary classification experiments were performed to classify the samples into infection or non-infection, and ischemia or non-ischemia classes. In more detail, some color-based descriptors were extracted from each patch before classification. ResNet50, InceptionV3, and InceptionResNetV2 architectures were used in this study. Besides, the authors used an ensemble CNN approach for combining the outputs of the three deep networks and fed it into an SVM for classification. They used MATLAB and TensorFlow frameworks. In both binary classification problems, the deep learning-based methods showed a better performance than the traditional classifiers. The authors reported the accuracy values of 90% for ischaemia and 73% for infection experiments.

Ii-B End-to-end deep CNN-based methods

In [goyal2018robust], for the deep learning-based classification methods, two-tier transfer learning approach was utilized for training the deep architectures including MobileNet, InceptionV2, ResNet101, and InceptionResNetV2. This method uses both partial and full transfer learning which means transferring only the lower level features or the whole features from a pre-trained model to the new model. Tensorflow was used as the framework. Object localization algorithms like R-FCN and Faster R-CNN were utilized followed by the trained deep architectures for tasks like classification. The combination of Faster R-CNN and InceptionV2 reported as the best model.

In another research, Goyal et al. used convolutional neural networks to classify diabetic foot ulcers [goyal2018dfunet]. A DFU image dataset with 397 images was presented. Data augmentation techniques were utilized to increase the number of samples. They proposed DFUNet, a deep neural network, for patch-wise classification of the foot ulcers into either normal or abnormal classes. DFUNet utilized the idea of concatenating the outputs of three parallel convolutional layers which used different filter sizes. The authors claimed that using this idea, multiple-level features were extracted from the input which resulted in having a network with higher discriminative strength. An accuracy value of 92.5% was reported for the proposed method. The main issue about this research is that the authors proposed a patch classifier which is not very helpful in medical image classification tasks. Indeed, it makes more sense to the clinicians and is more useful, to work with a whole-image classifier instead of a patch classification model.

Nilsson et al. proposed a CNN-based method for venous ulcer image classification [aguirre2018classification]

. The utilized dataset included 300 samples and a VGG-19 network was used to classify the images into venous or non-venous categories. The methodology included pre-training of the VGG-19 network using another dataset of Dermoscopic images and then, fine-tuning the network utilizing their related dataset. The values obtained for accuracy, precision, and recall reported as 85%, 82%, and 75%, respectively. Caffe, TensorFlow, and keras were used as the frameworks.

Alaskar et al. applied deep convolutional neural networks for intestinal ulcer detection in wireless capsule endoscopy images [alaskar2019application]. AlexNet and GoogleNet architectures were utilized to classify the input images into ulcer (abnormal) or non-ulcer (normal) categories. The dataset consisted of 1875 images obtained from wireless capsule endoscopy video frames and the experiments implemented in MATLAB environment. The authors reported classification accuracy of 100% for both networks.

In another research, Shenoy et al. proposed a method to classify wound images into multiple classes using deep CNNs [shenoy2018deepwound]

. A dataset with 1335 wound images collected via smartphones as well as the internet, was used in this study. After pre-processing and augmentation, nine different labels were created and for each label two positive and negative subcategories were considered. The authors created a modified form of VGG16 network, WoundNet, and three different versions of WoundNet were pre-trained on the ImageNet dataset. Besides, another network named Deepwound, which was an ensemble model was designed for averaging of the outcomes from the three individual WoundNet architectures. The algorithms were implemented in Keras. Also, an application was created for mobile phones to facilitate patient to physician consultation and wound healing evaluation.

Alzubaidi et al. presented a DCNN for binary classification of diabetic foot ulcers [alzubaidi2019dfu_qutnet]

. A new dataset consisting of 754 smartphone-captured foot images was introduced in this study. The goal was to classify the samples into normal or abnormal (DFU) skin categories. Normal and abnormal patches were extracted from the images and number of samples increased using data augmentation techniques. The proposed network, DFU_QUTNet, is a deep architecture with 58 layers including 17 convolutional layers. In comparison with the common DCNNs, the width of the proposed model has been increased without adding computational complexities. Then the network would be able to extract more information from the input which results in higher classification accuracy. In one experiment, DFU_QUTNet was applied for an end-to-end classification task and in another one, it was utilized as a feature extractor along with SVM and KNN classifiers. The maximum reported F1-Score was 94.5% obtained from combining DFU_QUTNet and SVM. Although designing a high-performance patch classifier can be a good achievement, but in clinical environments it would be more useful to have a whole image classification system and it is the weakness of this research.

Ref Classification Feature(s) Methods Dataset Limitation(s)
[goyal2018dfunet] Binary (DFU/normal skin). N/A A novel CNN architecture named DFUNet. DFU dataset with 397 images (292 abnormal, 105 normal cases). For small DFUs and DFUs having similar color like surrounding skins is hard to classify by this network. This also goes for normal skins with wrinkle and high red tones.
[yadav2019feature] Binary (graft/non-graft burn wound) color, texture, and depth. Support Vector Machine (SVM). Burns BIP_US Database. Very small evaluation set (74 images) was used.
[aguirre2018classification] Binary (venous/non-venous) N/A A pre-trained VGG-19 network. A dataset with 300 images with specialist annotation. Classification accuracy depends on camera distance from the ulcer surface.
[alaskar2019application] Binary (ulcer/non-ulcer endoscopy image) N/A AlexNet and GoogleNet. 1875 images obtained from wireless capsule endoscopy video frames. An unbalanced test set (3:1 ratio) has been used.
[shenoy2018deepwound] Binary (considered pos/neg cases for different labels like wound, infection, granulation, etc. N/A WoundNet (modified version of VGG-16) and Deepwound (an ensemble model). 1335 wound images collected via smartphones and internet. As accuracy varies from 72% to 97% for different binary classes, for some specific classes (like: drainage (72%)) this model does not work well.
[abubakar2019can] Binary (burn/pressure), 3-class (burn/pressure/healthy skin) Feature extracted from VGG-face, ResNet101, and ResNet152.

Support Vector Machine (SVM).

29 pressure and 31 burn images. Very small dataset used.
[alzubaidi2019dfu_qutnet] Binary (normal/abnormal skin (diabetic ulcer)). N/A A novel deep CNN (DFU_QUTNet). 754 foot images. Only use precision, recall and f1-score as evaluation matrices, which may not reflect all the evaluations clearly.
[goyal2020recognition] Binary (infection/non-infection, ischemia/non-ischemia) Bottleneck features extracted from Inception-V3, ResNet50, and InceptionResNetV2. Support Vector Machine (SVM). 1459 DFU images. Depending on lighting conditions (shadow), marks, and skin tone their model can show poor performance.
[goyal2018robust] Binary (normal/abnormal skin (DFU)) 209 features extracted using LBP, HOG, and color descriptors. Quadratic SVM, InceptionV2, MobileNet, ResNet101, Inception-ResNetV2 1775 DFU images collected from a hospital within five years. No evaluation of their classification task has been given.
TABLE I:
Summary of Wound Image Classification Works.

Table I summarizes the reviewed studies. Only a few papers were identified that discuss wound analysis from the wound type classification point of view. Also, most of the publications on wound type classification, discuss only the binary classification problems such as classifying the samples into normal and abnormal cases. Having difficulties to access a reliable dataset can be mentioned as a reason for this issue. Providing data to fill this gap in the literature was one of the motivations for our research. Moreover, many papers discussed only the patch-wise or ROI classification instead of the image-wise wound classification. In the rest of this paper, we propose an ensemble classifier for image-wise multi-class wound type classification using deep convolutional neural networks.

Iii Materials and Methods

Iii-a Dataset

In this research, we used a new wound image dataset collected over a two-year clinical period at the AZH Wound and Vascular Center in Milwaukee, Wisconsin. The dataset includes 400 wound images in jpg format and various sizes from four different wound types: venous, diabetic, pressure, and surgical (100 images per class). The images were captured using an iPad Pro (software version 13.4.1) and a Canon SX 620 HS digital camera and were labeled by a wound specialist from the AZH Wound and Vascular Center. The dataset can be accessed on GitHub with this link [https://github.com/uwm-bigdata/wound_classification/blob/main/data/Dataset.rar].
Figure 3 shows some sample images from different classes of the dataset.

Fig. 3: Sample images from the AZH Wound and Vascular Center database. The rows from top to bottom display diabetic, venous, pressure, and surgical samples, respectively.

Iii-B Methods

This section describes the method we used in this research and has been organized into two subsections: patch-wise and image-wise wound classification. The patch classifiers with the best performance will be used as a building block of the whole image classification model.

Iii-B1 Patch-wise classification

Pre-processing-ROI Extraction

We selected 100 images for each wound type as the training samples and manually extracted 100 unique Regions of Interest (ROI) out of them per class, representing each of the six categories: diabetic, venous, pressure, surgical, background, and normal skin. The ROIs are rectangular and have different sizes. Figure 4 displays some of the extracted ROIs from different classes of the dataset.

Fig. 4: Sample ROIs. The columns from left to right display diabetic, venous, pressure, surgical, normal skin, and background ROIs, respectively.
Pre-processing-Data splitting

After extracting the ROIs, we separated the samples by putting them into training (70% samples), testing (15% samples), and validation (15% samples) sets.

Pre-processing-Patch generation

In this step, 17 patches were generated from each ROI, resulting in 1700 patches per class. The patches were extracted in a way that covers between 75% to 85% of the original ROI. After this step, for each class we had 1190, 255, and 255 patches in the train, validation, and test sets, respectively.

Pre-processing-Data Augmentation

Augmentation of the training set samples was performed by generating 16 samples from each one using image transformation methods like rotating, flipping, cropping, and mirroring. Following augmentation, there were 19040 training samples in each class.

Fig. 5: Training process of the patch classifier.
Training

Following the pre-processing step, we trained a deep convolutional neural network using the training samples. After studying various networks, since our training set is small and the more modern CNNs need extensive number of samples and are computationally expensive, we selected AlexNet as the classifier. Due to its simplicity and effectiveness, AlexNet is still one of the most common deep networks used by researchers [litjens2017survey]. AlexNet is a deep CNN architecture proposed in 2012 and was the winner of ILSVRC in comparison to other traditional machine learning methods [krizhevsky2012imagenet, alom2018history]

. This network has 8 layers and 60 million parameters. A modified version of AlexNet was utilized in this research to suit for our classification problems. We changed the fully connected layer in a way that its output size, matches the number of classes in our data. In addition, we used the transfer learning technique to increase the training accuracy while reducing the training time. It means that the AlexNet architecture was pre-trained on a massive dataset of general images called ImageNet and fine-tuned using our wound image patches. Figure 

5 shows the described steps for training the patch classifier.

Iii-B2 Image-wise classification using an ensemble classifier

Several DCNN-based ensemble classification methods were proposed for medical or non-medical image classification tasks in the literature [chen2019deep, xia2019transferring, kassani2019classification]. Various strategies were used by the researchers to construct the ensemble classifier such as voting, concatenating, averaging, etc. [savelli2020multi, cha2019automated, hussain2020comprehensive, bermejo2020classification]. In all of these studies, the final conclusion was that the ensemble model outperformed the individual classifiers in performance. To this end, we designed an ensemble classifier in which the trained patch classifier described in Section III-B1 is used as a building block. In fact, the classification scores acquired from two classifiers (patch-wise and image-wise) are fed into an MLP classifier to obtain a better classification performance. Our hypothesis is that the proposed ensemble classifier will outperform each of the individual classifiers in terms of classification accuracy. Different components of the proposed classifier will be explained individually below.

Classifier A - Whole image classifier

This classifier is a pre-trained AlexNet architecture that we fine-tuned using our own dataset. Figure 6 displays the training phase for this classifier.

Fig. 6: Training process of Classifier A.
Classifier B - Sliding window + Patch classifier

This classifier applies the sliding window technique on the input wound image to extract 9 patches of equal size along with patch classification step to predict their wound type. The wound type for the whole image will then be predicted by majority voting on the predicted label of the patches detected as wound. Figure 7 describes the entire process for this classifier.

Fig. 7: Classifier B. The first step is extracting equal size patches out of the input image using the sliding window technique. Then the patch classifier is used for detecting the patch labels. The final step is majority voting for predicting the whole image label.
Classification scores

Each of the classifiers A and B generates two classification scores for every input image. For example, for the surgical vs venous binary classification problem they output S and V scores which stand for classification scores of the surgical and venous labels, respectively. For the classifier B, these scores are calculated by averaging over S and V scores of all the patches detected as wounds by the patch classifier. In the end, for the binary classification case, we create a three-element feature vector including , and in which the subscripts A and B show the related classifier. It is important to note that we did not include in the feature vector, because of the correlation between and . Finally, we feed the feature vector into the MLP classifier for the final classification task as described below.

MLP Classifier

The MLP classifier is a four-layer MLP with two hidden layers that have 8 and 7 neurons, respectively. The number of nodes in the input and output layers are determined based on the type of the classification problem. For the binary classification case, there are 3 and 2 nodes in these layers. The output of the MLP classifier is the wound type of the input image. Figure 

8 displays how the proposed ensemble classifier works in case of a binary classification problem. It is important to mention that the idea behind combining the Classifier A and B is to consider both patch level and whole image level information for classification.

Iii-C Performance metrics

In this research we used accuracy, precision, recall, and F1-score metrics to investigate the performance of the classifiers. Equations 1 to 4

show the related formulae for these evaluation metrics. In the binary classification problem, we used Area Under the ROC Curve (AUROC or AUC) metric as well. In these equations, TP, TN, FP and FN represent True Positive, True Negative, False Positive, and False Negative measures, respectively. More details about these equations and the related concepts can be found in 

[fawcett2006introduction].

(1)
(2)
(3)
(4)
Fig. 8: Whole image classification process using our proposed ensemble classifier. The classifier accepts the wound image as the input and predicts the wound type as the output.

Iv Results

This section presents the results obtained from the patch-wise and image-wise classification experiments. The classifiers were implemented in MATLAB R2019b and R2020a using an NVIDIA GEFORCE RTX 2080 Ti GPU with 11GB of memory. In the diagrams and tables presented in this section, the following abbreviations D, V, P, S, BG, and N represent the classes diabetic (D), venous (V), pressure (P), surgical (S), background (BG), and normal skin (N) (Table II

). Several experiments were conducted to find the optimum structure for the deep networks. The optimum epoch number obtained was 20 and we used a learning rate value of 10e-6. Also, Adam was utilized as the optimization algorithm 

[kingma2014adam]. Further details for each experiment are provided below.

Abbreviation Description
BG Background
N Normal skin
V Venous
D Diabetic
P Pressure
S Surgical
TABLE II:
Class label abbreviations

Iv-a Patch classification

To evaluate the patch classifier’s performance for patch-wise classification, we used 255 test patches per class. For 4-class classification experiments, the goal was to classify the wound patches into one of the four classes: BG, N, and two wound labels. In the 5-class classification problem, we had three wound labels as well as the BG and N classes. The last group of the patch-wise classification experiments is related to the 6-class classification case in which the wound patches are classified into one of the six classes diabetic, venous, pressure, surgical, BG, and N. Table III shows the test accuracy values for all the experiments mentioned above. Figures 9 to 11 display some sample confusion matrices for patch-wise classification experiments. It should be noted that we performed and compared all the experiments with and without data augmentation. As data augmentation always resulted in better results, we only show our experiments with data augmentation.

Num of Classes Classes Test accuracy (%)
BGNVD 89.41
BGNVP 86.57
4-class BGNVS 92.20
BGNDP 80.29
BGNDS 90.98
BGNPS 84.12
BGNDVP 79.76
5-class BGNDVS 84.94
BGNDPS 81.49
BGNVPS 83.53
6-class BGNDVPS 68.69
TABLE III:
Patch-wise classification results.
Fig. 9: Confusion matrices of the best (left) and worst (right) case in 4-class classification experiments.
Fig. 10: Confusion matrices of the best (left) and worst (right) case in 5-class classification experiments.
Fig. 11: Confusion matrix for the 6-class classification experiment.

Iv-B Whole image classification

To assess our proposed ensemble classifier’s efficiency, we performed two types of experiments: binary classification and 3-class classification. With the patch classification results presented in subsection IV-A, surgical vs venous and surgical vs venous vs diabetic classifiers were selected which showed the best binary and 3-class classification outcomes. For the rest of the manuscript, we name the whole image classifier (trained AlexNet on the whole wound images) as Classifier A, and the patch classifier (trained AlexNet on the wound patches) as Classifier B. We also obtained 138 extra wound images from three classes surgical (28 samples), diabetic (54 samples), and venous (56 samples) from the AZH wound and vascular center to be used as the test images. Table IV shows the binary classification results obtained from applying the Classifier A, Classifier B, and the proposed ensemble classifier on the test set which included 84 wound images. Also, the 3-class classification results for the three classifiers have been provided in Table V. The superiority of our proposed ensemble classifier over the other two classifiers can be detected from these results for both binary and 3-class classification problems.

Classifier
Accuracy Precision Recall F1-Score
(%) (%) (%) (%)
A 83.3 76.9 71.4 74.04
B 82.1 71 78.6 74.60
Our
ensemble
classifier
96.4 93.1 96.4 94.72
TABLE IV:
Whole image binary classification (surgical vs venous) results obtained from applying the classifiers on the test set images.
Classifier
Class Precision Recall F1-score ACC
(%) (%) (%) (%)

 

A D 88 81.5 84.62
S 67.9 67.9 67.9 83.3
V 86.7 92.9 89.69

 

B D 70.9 72.2 71.54
S 42.1 28.6 34.06 67.4
V 71.9 82.1 76.66

 

Our ensemble classifier D 86.2 92.6 89.28
S 81.5 78.6 80.02 89.1
V 96.2 91.1 93.58

 

TABLE V:
Whole image 3-class classification results obtained from applying the classifiers on the test set images.

V Discussion

Acute and chronic wounds are a challenge and burden to healthcare systems in all countries. The wound diagnosis and treatment process can be facilitated using an efficient classification method. ML and DL have a good potential to be used as powerful algorithms for wound image analysis tasks such as classification. Prior works in the literature mainly dealt with binary classification or studied only specific types of wounds like diabetic ulcers. Additionally, the major part of the previous studies classified extracted ROIs or wound patches, rather than the whole wound images. Also, most of them had difficulties accessing valid, reliable, and high-quality images as some of them collected their data from the web. For these reasons, we proposed an end-to-end novel ensemble deep learning-based classification method for classifying the chronic wounds into multi categories based on their type.

In the patch classification, as we expected, by increasing the number of classes from four to five and six, the classification accuracy decreases. The justification for this phenomenon is that increasing the number of classes accordingly increase the number of network parameters, which would make it more challenging for the deep architecture to train all the parameters to the same standard as before, using the same number of training samples. Another interesting observation is that in the 4-class and 5-class classification experiments, the lowest classification accuracy is related to the diabetic and pressure wounds. It shows that these two wound types are very similar in appearance. The confusion matrices confirm this fact by showing that many diabetic wounds had been classified into pressure class and vice versa. Also, we observed in these experiments, the highest classification accuracy is related to the surgical wounds. It can be concluded that surgical wounds are the most distinguishable wound type from others. This could be related to the shapes of the surgical wounds which are usually more elongated and therefore more distinguishable among other wound types. We see that, in most of the experiments, the pressure wound is the most challenging wound type to classify. By looking closely at the confusion matrices, we find that the low recall value for this wound type often comes from misclassifying the wound into the venous or diabetic instead of the pressure class. Another observation is that the venous wounds typically show the highest recall values among all of the wound types. This phenomenon could be related to having samples from a wider variability and consequently better training of the classifier for this wound type. We expect that increasing the number of samples for dataset categories would improve the recall value for all the wound types. In all of the patch classification experiments, the background and normal skin classes have the highest recall values. This is important because in our proposed ensemble classifier we need to have a patch classifier with the ability to distinguish background and normal skin parts from the wounded tissue with a high accuracy value.

About the image-wise classification experiments we see that for the binary case, both Classifier A and B generate almost similar results while our proposed ensemble classifier showed an accuracy value of 96.4% which is 13.1% higher than the Classifier A and 14.3% higher than the Classifier B. Also, for the 3-class classification case, the accuracy of the ensemble classifier is higher than the other two classifiers. This last observation is very interesting because the Classifier B displays a low classification performance specifically for the surgical wounds, but after combining with the Classifier A, its accuracy value is improved by 5.8%.

V-a Robustness experiments

To investigate the robustness of our proposed classifier, we used 5-fold cross-validation as a standard evaluation method. For these experiments, we added the test set (84 images for binary and 138 images for 3-class classification experiment) to our training set (200 images for binary and 300 images for 3-class classification case). Thereafter, for each of the three classes diabetic, surgical, and venous, we selected 20% of the samples randomly as the test set and the rest as the training samples. We trained our classifier using the training samples and tested it on the test set and repeated this strategy for 5 iterations. Tables VI to X display the classification accuracy, AUC, precision, recall, and F1-score values obtained for the binary classification problem. Figures 12 and 13 compares the classifiers in accuracy and AUC metrics. The ROC plots presented in Figure 14. Also, 3-class classification results have been summarized in Table XI to XIV and Figure 15. In all tables, R1 to R5 display the round number of the experiments.

Classifier R 1 R 2 R 3 R 4 R 5
A 91.1 89.3 87.5 85.7 85.7
B 67.9 69.6 73.2 83.9 75
Our ensemble
classifier
94.6 94.6 96.4 92.9 92.9
TABLE VI:
Whole image binary classification (S vs V) accuracy percentages obtained from 5-fold cross-validation.
Classifier R 1 R 2 R 3 R 4 R 5
A 0.9497 0.9677 0.9548 0.9548 0.9303
B 0.7806 0.7716 0.7677 0.8439 0.7084
Our ensemble
classifier
0.9806 0.9845 0.9613 0.9561 0.9535
TABLE VII:
Whole image binary classification (S vs V) AUC values obtained from 5-fold cross-validation.
Fig. 12: Accuracy values obtained from 5-fold cross-validation for the binary classification problem.
Fig. 13: AUC values obtained from 5-fold cross-validation for the binary classification problem.
Classifier R 1 R 2 R 3 R 4 R 5
A 91.7 91.3 100 84 87
B 65.2 68.2 77.8 83.3 76.2
Our ensemble
classifier
95.8 89.3 100 86.2 92
TABLE VIII:
Whole image binary classification (S vs V) Precision percentages obtained from 5-fold cross-validation.
Classifier R 1 R 2 R 3 R 4 R 5
A 88 84 72 84 80
B 60 60 56 80 64
Our ensemble
classifier
92 100 92 100 92
TABLE IX:
Whole image binary classification (S vs V) Recall percentages obtained from 5-fold cross-validation.
Classifier R 1 R 2 R 3 R 4 R 5
A 89.81 87.49 83.72 84 83.35
B 62.49 63.83 65.12 81.61 69.56
Our ensemble
classifier
93.86 94.34 95.83 92.58 92
TABLE X:
Whole image binary classification (S vs V) F1-score percentages obtained from 5-fold cross-validation.
(a) Round 1
(b) Round 2
(c) Round 3
(d) Round 4
(e) Round 5
Fig. 14: ROC plots obtained from 5-fold cross-validation experiments.
Classifier R 1 R 2 R 3 R 4 R 5
A 79.1 76.7 82.6 83.7 81.4
B 68.6 55.8 64 72.1 61.6
Our ensemble
classifier
84.9 81.4 88.4 91.9 91.9
TABLE XI:
Whole image 3-class classification (S vs V vs D) accuracy percentages obtained from 5-fold cross-validation.
Fig. 15: Accuracy values obtained from 5-fold cross-validation for the 3-class classification problem.
Classifier Class R 1 R 2 R 3 R 4 R 5

 

A D 80 80 83.3 81.3 78.8
S 73.9 76.2 84.2 94.7 80
V 81.8 74.3 81.1 80 84.8

 

B D 63.6 70.6 68 67.7 56.7
S 65 41.4 44 66.7 63.2
V 75.8 60 75 79.4 64.9

 

Our ensemble classifier D 81.8 92.6 84.4 89.7 85.3
S 85.7 70.4 90.5 100 91.7
V 87.5 81.3 90.9 87.9 100

 

TABLE XII:
Whole image 3-class classification (S vs V vs D) precision percentages obtained from 5-fold cross-validation.
Classifier Class R 1 R 2 R 3 R 4 R 5

 

A D 80 80 83.3 86.7 86.7
S 68 64 64 72 64
V 87.1 83.9 96.8 90.3 90.3

 

B D 70 40 56.7 70 56.7
S 52 48 44 56 48
V 80.6 77.4 87.1 87.1 77.4

 

Our ensemble classifier D 90 83.3 90 86.7 96.7
S 72 76 76 96 88
V 90.3 83.9 96.8 93.5 90.3

 

TABLE XIII:
Whole image 3-class classification (S vs V vs D) recall percentages obtained from 5-fold cross-validation.
Classifier Class R 1 R 2 R 3 R 4 R 5

 

A D 80 80 83.3 83.91 82.56
S 70.82 69.56 72.72 81.80 71.11
V 84.36 78.80 88.25 84.83 87.46

 

B D 66.64 51.06 61.83 68.83 56.7
S 57.77 44.45 44 60.88 54.56
V 78.12 67.59 80.59 83.07 70.60

 

Our ensemble classifier D 85.70 87.70 87.11 88.17 90.64
S 78.25 73.09 82.61 97.95 89.81
V 88.87 82.57 93.75 90.61 94.90

 

TABLE XIV:
Whole image 3-class classification (S vs V vs D) F1-score percentages obtained from 5-fold cross-validation.

Regarding the reported results, we find that our proposed ensemble classifier beats both Classifiers A and B with displaying better classification performance. This finding confirms our initial theory that combining two individual classifiers that one of them includes the patch-level information, and the other one contains the image-wise information will result in an stronger classifier with higher classification accuracy and better performance. Figure 16 presents some of the samples that were classified wrongly by Classifier A or B, but the proposed ensemble classifier assigned them to the correct class. The interesting observation was that the major part of the samples miss-classified by Classifier A, were the images in which the wound consists a small proportion of the entire image. On the other hand, Classifier B missed the samples in which the wound occupies a large part of the image. These observations show that how the two weak classifiers cooperate to fix each other’s shortcomings which resulted in producing a superior classifier. Indeed, having objects from different scales in the dataset has always been one of the challenges in deep learning-based tasks as discussed in several studies [van2017learning, hesamian2019deep, zhao2019object]. Specifically, in the field of wound care it is not guaranteed to take high quality images from an optimal view point and desired distance to the wound surface because of medical concerns like infection control as well as the patient’s easement [goyal2018dfunet]. Our results show that the Classifier B can overcome partially the scale problem for those images which were taken from a further distance to the wound. It of course loses this power for the photos zoomed on the wound area.

Fig. 16: Some of the miss-classified samples. The top row shows the samples miss-classified by Classifier B, and the bottom row displays the images which were wrongly classified by Classifier A.

Vi Conclusion

In this paper, we proposed an end-to-end ensemble deep CNN-based classifier for classification of wound images into multiple classes based on the type of the wound. To the best of our knowledge, our proposed classifier is the first model that classifies the wound images into more than two types. We initially designed patch classifiers with fine-tuned AlexNet architecture to efficiently classify the wound patches into different wound types. Influence of different wound types on the classification accuracy was investigated by running numerous experiments and testing different combinations of the wound types. For image-wise classification task, for each input image, first a feature vector created from the designed patch classifier and another AlexNet that was trained on the whole images. Then the feature vector fed into an MLP to obtain an ensemble image-wise classifier with a higher accuracy and better performance. The results show that our proposed ensemble classifier can be used successfully as a decision support system in wound image classification tasks to assist the physicians in related clinical applications. We have made available the dataset we used for the current research. As a future study, we plan to improve the performance and classification accuracy of our proposed method by trying different combinations of the patch-wise and image-wise classifiers. Besides, testing the proposed approach to classify the images into more classes by working on a larger dataset of wound images would be one of the subsequent steps of this research.

Acknowledgment

This work is partially supported by the Discovery and Innovation Grant (DIG) award and the Catalyst Grant Program at The University of Wisconsin-Milwaukee.

References