A critical issue in the healthcare industry, particularly in the United States, is the effective management of postoperative wounds. The World Health Organization estimates 359.5 million surgical operations were performed in 2012, displaying an increase of 38% over the preceding eight years. Surgeries expose patients to an array of possible afflictions in the surgical site. Surgical site infection (SSI) is an expensive healthcare-associated infection. The difference between the mean unadjusted costs for patients with and without SSI is approximately $21,000 . Thus, individual SSIs have a significant financial impact on healthcare providers, patients and insurers. SSIs occur in 2-5 percent of patients undergoing inpatient surgery in the U.S., resulting in approximately 160,000 to 300,000 SSIs each year in the United States alone, as summarized by .
Currently, most wound findings are documented via visual assessment by surgeons. Patients revisit their surgeon a few days after the operation for this checkup. This takes up valuable time that a surgeon could use to help out other patients. Infections can also set in earlier and the delay until the checkup can exacerbate the issue. Moreover, there is a lack of quantification of surgical wounds. An automated analysis of a wound image can provide a complementary opinion and draw the attention of a surgeon to particular issues detected in a wound. Thus, a rapid and portable computer aided diagnosis (CAD) tool for wound assessment will greatly assist surgeons in determining the status of a wound in a timely manner.
Advances in software and hardware, in the form of powerful algorithms and computing units, have allowed for deep learning algorithms to solve a wide variety of tasks which were previously deemed difficult for computers to tackle. Challenging problems such as playing strategic games like Go and poker , and visual object recognition  are now possible using modern compute environments. A type of artificial neural network, called a convolutional neural network (CNN), has demonstrated capabilities for highly accurate image classification after being trained on a large dataset of samples . In the past decade, research efforts have led to impressive results on medical tasks, such as automated skin lesion inspection  and X-ray based pneumonia identification .
In this paper, we propose a novel approach to identifying the onset of wound ailments through the simple means of a picture. We introduce a CNN architecture, WoundNet, and train it with a HIPAA compliant dataset of wound images collected by patients and doctors using smartphones. Finally, we build a mobile application for the iOS software ecosystem that presents a user implementation of our CAD system. It includes clinically relevant features, such as the daily documentation of patient health and generation of wound assessments. The app enables patients to generate wound analysis reports and send them to the surgeon on a regular basis from a remote location, such as their home.
Ii Prior Work
While the applications of machine learning in healthcare are numerous, few have attempted to solve the problem of postoperative wound analysis and surgical site monitoring. We would like to summarize two key pieces of research that sought to build models similar to the one presented in this paper.
Wang et al. showcased a comprehensive pipeline for wound analysis, from wound segmentation to infection scoring and healing prediction 
. For binary infection classification, they obtained a F1 score of 0.348 and accuracy of 95.7% with a Kernel Support Vector Machine (SVM) trained on CNN generated features. Their dataset consisted of 2,700 images with 150 cases positive for SSI.
Another paper by Sanger et al. used classical machine learning to predict the onset of SSI in a wound. It is trained on baseline risk factors (BRF), such as pre-operative labs (e.g. blood tests), type of operation, and a multitude of other features 
. Their best classifier achieved a sensitivity of 0.8, specificity of 0.64, and receiver operating characteristic (ROC) area-under-curve (AUC) of 0.76. By computing the harmonic mean of their sensitivity and specificity, we determine that their F1 score is 0.71.
In our opinion, while the infection scoring model presented by Wang et al. does achieve an accuracy of 95.7%, we believe that this metric is insufficient due to the severe class imbalance in their dataset. Sensitivity, specificity, F1 score, and ROC curves are better metrics which address this issue. This work improves upon these metrics compared to the presentation by Wang et al. While Sanger et al. have built a predictive methodology based on BRF, our approach leverages pixel data from wound images. Thus, our research complements any analysis using BRF.
According to our literature search, no prior work in dressing identification and other ailments apart from SSIs have been modeled using computational techniques. Thus, we believe we have built the most robust and comprehensive wound classification algorithm up-to-date.
Iii Materials and Methods
Iii-a Data Collection and Description
Prior to this research, a dataset of 1,335 smartphone wound images was collected primarily from patients and surgeons at the Palo Alto VA Hospital and the Washington University Medical Center in St. Louis. The dataset also includes images from searching the internet to counteract class imbalance. All images were anonymized and cropped into identical squares.
Figure 1 shows a few examples of images from the dataset. As can be seen, images are very diverse and contain high variability. Images ranged from open wounds with infections to closed wounds with sutures. Table I shows the breakdown of the entire dataset.
Many tools went into the development of this research. Our CNNs were engineered using the Keras deep learning framework
in the Python 3.7 programming language. The neural networks were trained on a Nvidia Tesla K80 GPU hosted by the Amazon Web Services Elastic Cloud Compute platform. The OpenCV computer vision library was used for histogram equalization and image inspection. Scikit-learn  was leveraged for its variety of built-in metrics for model evaluation. The final model was deployed on a server using Flask.
A standard iOS development setup was used for the mobile application. The app was built using the Swift 4 programming language and Xcode integrated development environment.
Figure 2 below summarizes the four steps in the development of our model. We now cover each of these steps in detail.
In this section, we will cover the first three blocks of this pipeline. The “Model Testing and Analysis” block will be covered in the Results and Discussion section of this paper.
Iii-C Data Preprocessing
Figure 3 gives a summary of the data preprocessing steps. Once data is loaded into memory, images are resized to 224 by 224 pixels to fit the input of our CNN architecture. The input layer is 224 by 224 by 3 pixels, the final dimension accounting for the three-color channels. We then partition the dataset into training and validation sets. 80% of the data is used as the training data for our model and 20% is left for model evaluation and testing.
A critical component of the preprocessing stage is to compensate for vast differences in lighting and position found in smartphone images. To accommodate this, we apply contrast limited adaptive histogram equalization (CLAHE) to each image 
. Histogram equalization (HE) takes in a low contrast image and increases the contrast between the image’s relative highs and lows to bring out subtle differences in shade and create a higher contrast image. CLAHE applies HE in individual 8x8 pixel tiles around the image. Contrast limiting is used to prevent noise from being amplified. We use a contrast limiting factor of 1. Finally, bilinear interpolation is applied to the image to remove artifacts in the borders.
Iii-D Model Generation
The second step in the development of our model is known as model generation and is shown in Figure 4. We take the preprocessed images and generate three slightly different CNNs using the WoundNet architecture. We also prepare it for transfer learning by initializing the CNNs on ImageNet weights.
Convolutional neural networks vastly outperform other current machine learning models for large scale image processing and classification. We make adjustments to a current state of the art CNN, VGG-16, to better suit our specific problem. The resulting configured model is known as WoundNet, illustrated in Figure 5. Three models were initialized using the VGG-16 CNN architecture . While we did try other deeper network architectures, we found them to overfit on the data almost immediately, unlike VGG-16. We believe that this is due to the combination of imbalance and small size of our dataset. For example, some deeper networks Rather than creating nine individual binary classifiers, we train each neural network to label images with all nine classes. This enables our model to find inter-label correlations through shared knowledge in the deep learning model.
We first remove the final output layer along with the two 4096-neuron fully connected (FC) layers prior to it. We append two smaller 1024-neuron FC layers, each with a dropout of 0.5, and an output layer with the sigmoid activation function. Dropout is a form of regularization that forces a chosen percentage of elements in a layer to not activate and thus reducing the overfitting of the model.
Our motivation for these changes are the following. Each wound image can be positive or negative for nine different classes in comparison to the ImageNet dataset in which every image only has a single label. Furthermore, the sigmoid function treats every class as a binary decision while softmax converts the weights of the neural network in the layer prior to the output to probabilities that add up to 1. We determined two 1024 element layers are the best choice to replace the FC layers to keep the model lightweight, faster to train, and reduce the computational complexity present in the original VGG-16 architecture.
Iii-E Model Training
After generating our models, we fine-tune them on our data. The entire process is outlined in Figure 6. Our training phase can be divided into four critical components: transfer learning, data augmentation, training specifics, and ensembling multiple models. We will now cover each of these pieces in depth.
Iii-E1 Transfer Learning
In practice, it is very difficult to train a CNN from end-to-end starting with randomly initialized weights. Furthermore, huge datasets with upwards of a million images are necessary to successfully train an accurate neural network from scratch. Too little data, such as our case, would cause a model to overfit.
We employ transfer learning , also known as fine tuning, to leverage pre-learned features from the ImageNet database. The original VGG-16 model was trained from end to end using approximately 1.3 million images (1000 object classes) from the 2014 ImageNet Large Scale Visual Recognition Challenge. We use the weights and layers from the original VGG-16 model as a starting point.
Transfer learning leverages the previously learned low level features (such as lines, edges and curves). Since these features are common for any image classification task, transfer learning requires less data to arrive at a satisfactory CNN. For optimum generalization and to prevent overfitting, we freeze each of the three WoundNet models at layer 6, 10, and 14, respectively. This prevents the low-level features from being washed away from training the CNNs on the training set of wound images.
Iii-E2 Data Augmentation
In order to make the most out of our training set, we utilize aggressive data augmentation prior to feeding images into the CNN. Data augmentation improves the generalization and performance of a deep neural network by copying images in the training set and performing a variety of random transformations on them. Each copy is randomly rotated from 0° to 360°, shifted by 10 pixels in any direction, zoomed into by a factor of 30%, and sheared by a factor of 20%. Copies are flipped vertically or horizontally with equal probability of 50%. The copied data is given the same labeling as its original and is added to the current training batch. Some images generated via data augmentation are shown in Figure 7.
Iii-E3 Training Specifics
Our CNNs are trained using the backpropagation algorithm with a batch size of 64. Backpropagation is an application of the chain rule of calculus to compute loss gradients for all weights in the network. Once an image is passed through the CNN during the training phase, the error is calculated using a loss function. That loss gradient is propagated backwards through the CNN, adjusting weights in the CNN. This way, the next time the CNN sees the same image, it will arrive at the correct outputs.
We use the binary cross-entropy loss function. Layers are first trained using the Adam optimizer 
for 30 epochs with a learning rate of 1e-3. We then continue to train the model using the Stochastic Gradient Descent (SGD) optimizer for gradient descent with a learning rate of 1e-4 for 50 epochs. SGD enables us to escape local minima of the loss function by using small, random movements. This process makes it easier for the CNN to find the global minimum of the loss function.
Iii-E4 Ensemble Multiple Models
We use the process of ensemble averaging to combine our three separate models into one. This approach is superior to generating only one classifier as the various errors among each model due to overfitting or underfitting will average out, resulting in higher overall scores. This ensemble is called Deepwound.
When a new image is fed into Deepwound, it is independently delegated to each member WoundNet CNN for classification. The results from each algorithm are consolidated into one result matrix through majority-voting for the presence or absence of each label.
Iv Mobile Application Pipeline
With a predicted 6.8 billion smartphones in the world by 2022  mobile health monitoring platforms can be leveraged to provide the right care at the right time. In this research, we have developed a comprehensive mobile application, Theia, as a way to deliver our Deepwound model to patients and providers. Screenshots from the final app are shown in Figure 8.
Theia is a proof-of-concept of how Deepwound can assist physicians and patients in postoperative wound surveillance. The first component of the app is the ”Quick Test.” Physicians or patients can quickly photograph a wound and generate a wound assessment. A wound assessment provides positive or negative values for each label affiliated with a wound.
The other component of the app is the ability for a patient to track wounds over a period of time. Every day, the patient receives a notification to complete a daily wound assessment, where he/she provides an image of the surgical site, their current weight, and rate their pain in that area. This data is accumulated over a period of 30 days. Furthermore, patients can track many different variables that can affect their wound recovery such as medicine intake, the changing of their wound dressing, weight, and pain level. All of this information is charted out over time and can be converted into a PDF that can be sent to physicians or family. Finally, we provide easy access for the patient to directly contact their surgeon through the app itself.
With permission from the patient, this app can also be used to collect wound images to add to our dataset. The enlarged dataset can be used to further improve our deep learning algorithms. As more patients and surgeons use the app, more image data can be collected. This newly accumulated data can be used to train our CNNs even further, leading to a virtuous cycle of improving accuracy.
Deepwound is used to classify every image the user takes. The image is stored securely within the app and anonymized when sent to the server for image processing and classification via a multipart HTTP request.
V Results and Discussion
We now present results for our computational model. We evaluate our CNN ensemble by calculating a variety of classification metrics (e.g accuracy, sensitivity, specificity, and F1 score), analyzing receiver operating characteristic curves, and generating saliency maps. This process is diagrammed in Figure 9.
V-a Classification Metrics
We use a few different metrics to evaluate the performance of the ensemble as a whole. We use accuracy, sensitivity, specificity, and F1 Score (see Equation 1, Equation 2, Equation 3, Equation 4). The latter three are more reliable metrics than accuracy for this paper as they take into account the real effectiveness of the model at discerning the presence and absence of a particular ailment. Table II displays all of our scores.
V-B Receiver Operating Characteristic and Area Under Curve
The area under the curve (AUC) of a receiver operating characteristic (ROC) curve is a useful metric in determining the performance of a binary classifier. ROC curves graphically represent the trade-off at every possible cutoff between sensitivity and specificity. Better classifiers have higher AUC values for their ROC curves while worse classifiers have lower AUC values. We chart ROC curves and calculate AUC values for each possible label for an image, as seen in Figure 10.
V-C Saliency Maps
When analyzing digital images using machine learning, it is important to understand why a certain classifier works. Saliency maps have been shown in the past as a way to visualize the inner workings of CNNs in the form of a heat map which highlights the features within the image that the classifier is focused on . We generate saliency maps from one of CNNs on a couple of images in the validation set to ensure that our classifiers are identifying the regions of interest for a particular label in an image (see Figure 11). We can confirm that the attention of the model is drawn to the correct regions in the images.
In summary, our work describes a new machine learning based approach using CNNs to analyze an image of a wound and document its wellness. Our implementation achieves scores that improve upon prior work by Wang et al. and Sanger et al. for F1 score and ROC AUC.
We acknowledge that our data set size is small and has some imbalance. This is a common problem in medical research as the data needs to be gathered over a sustained period of time with health compliant processes. We overcome these hurdles through the use of aggressive data augmentation, transfer learning, and an ensemble of three CNNs.
Our approach for analysis and delivery with a smartphone is a unique contribution. It enables several key benefits: tracking a patient remotely, ease of communication with the medical team and an ability to detect the early onset of infection. Wide spread use of such means can also enable automated data collection and classification at a lower cost, which in turn can improve the machine learning algorithm to be improved through re-training with a larger data set. Our mobile app can also generate comprehensive wound reports that can be used for the purpose of billing insurers, thus saving surgeons time.
Vii Future Work
There are many ways to improve our algorithm. On a larger scale, it is necessary to gather more images for both training and testing. Creating a robust corpus of images will enable us to improve the performance of our method. More labeled images always lead to higher performances in the field of deep learning.
We would like to consider blur detection prior to analyzing our image. If the image is too blurry, we can send a message back to the user, requesting a clearer picture. There are many well-known techniques to accurately measure blur within an image.
We would also like to look into embedding our model into mobile devices directly without the need for a server. This will drastically increase speed for users and enable them to use the app in locations without access to the internet. Finally, we would like to extend our wound assessment framework by developing a computational model to track the healing of a wound using a time-series of images which can be collected using the current version of the mobile app.
-  T. G. Weiser, A. B. Haynes, G. Molina, S. R. Lipsitz, M. M. Esquivel, T. Uribe-Leitz, R. Fu, T. Azad, T. E. Chao, W. R. Berry and A. A. Gawande, ”Size and distribution of the global volume of surgery in 2012,” Bulletin of the World Health Organization, 2016.
-  M. L. Schweizer, J. J. Cullen, E. N. Perencevich and M. S. Vaughan Sarrazin, ”Costs Associated With Surgical Site Infections in Veterans Affairs Hospitals,” JAMA Surg, vol. 146, no. 9, pp. 575-581, 2014.
-  D. J. Anderson, K. Podgorny, S. I. Berríos-Torres, D. W. Bratzler, E. P. Dellinger, L. Greene, A. C. Nyquist, L. Saiman, D. S. Yokoe, L. L. Maragakis and K. S. Kaye, ”Strategies to Prevent Surgical Site Infections in Acute Care Hospitals: 2014 Update,” Infect Control Hosp Epidemiol, vol. 35, no. 6, pp. 605- 627, 2014.
-  D. Silver, J. Schrittweiser, K. Simoyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driesseche, T. Graepel and D. Hassabis, ”Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.
-  N. Brown and T. Sandholm, ”Superhuman AI for heads-up no-limit poker: Libratus beats top professionals,” Science, 2017.
J. Deng, W. Dong, R. Socher, L. J. Li, L. Kai and L. Fei-Fei, ”ImageNet: A large-scale hierarchical image database,”
IEEE Conference on Computer Vision and Pattern Recognition, 2009.
-  Y. LeCun, Y. Bengio and G. Hinton, ”Deep learning,” Nature, vol. 521, pp. 436-444, 2015.
-  A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau and S. Thrun, ”Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, pp. 115-118, 2017.
-  P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, M. P. Lungren and A. Y. Ng, ”CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning,” arXiv preprint arXiv:1711.05225, 2017.
-  C. Wang, X. Yan, M. Smith, K. Kochhar, M. Rubin, S. M. Warren, J. Wrobel and H. Lee, ”A unified framework for automatic wound segmentation and analysis with deep convolutional neural networks,” 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, 2015.
-  P. C. Sanger, G. H. van Ramshorst, E. Mercan, S. Huang, A. L. Hartzler, C. A. Armstrong, R. J. Lordon, W. B. Lober and H. L. Evans, ”A Prognostic Model of Surgical Site Infection Using Daily Clinical Wound Assessment,” J Am Coll Surg, vol. 223, no. 2, pp. 259-270, 2016.
-  F. Chollet, ”Keras,” 2015. [Online]. Available: https://github.com/keras-team/keras.
-  G. Bradski, ”The OpenCV Library,” Dr. Dobb’s Journal: Software Tools for the Professional Programmer, vol. 25, no. 11, pp. 120-123, 2000.
-  F. Pedregosa, G. Varoquax, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and É. Duchesnay, ”Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825-2830, 2011.
-  S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A. Geselowitz, T. Greer, B. ter Haar Romeny, J. B. Zimmerman and K. Zuiderveld, ”Adaptive Histogram Equalization and its Variations,” Computer Vision, Graphics and Image Processing, vol. 39, pp. 355-368, 1987.
-  K. Simoyan and A. Zisserman, ”Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  A. Karpathy, ”Transfer Learning,” Stanford University. [Online]. Available: http://cs231n.github.io/transfer-learning/.
-  D. P. Kingma and J. Ba, ”Adam: A Method for Stochastic Optimization,” arXiv preprint arXiv:1412.6980, 2017.
-  Ericsson, Inc., ”Ericsson Mobility Report: June 2017”, June 2017. [Online]. Available: https://www.ericsson.com/assets/local/mobility-report/documents/2017/ericsson-mobility-report-june-2017.pdf.
-  K. Simoyan, A. Vedaldi and A. Zisserman, ”Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,” arXiv preprint arXiv:1312.6034, 2014.