The World Health Organization (WHO) indicates that hands hygiene is extremely important to prevent the transmission of bacteria and viruses by avoiding its transfer from contaminated surfaces to the mucous membrane of the mouth, nose, or eyes. Also with regard to COVID-19 disease, it is estimated that a non-negligible part of infections occurs due to contact, through our hands, with contaminated surfacessantarpia2020. For these reasons, one of the most important measure which any person can put in place to prevent the transmission of harmful germs is to take care of hands hygiene. To ensure a proper hands hygiene, WHO suggests that one should follow either handrub, using an alcohol-based formulation, or handwash with soap and water. The two suggested procedures, described in Figure 1, entail different steps with different duration. In particular, water and soap handwashing comprises eleven steps and should last between 40 to 60 seconds while handrubbing only includes 8 steps with a duration of about 20 and 30 seconds. WHO also suggest that alcohol-based handrub should be used for routine decontamination of hands, while the handwash with soap and water is recommended when hands are visibly soiled.
Despite the proven effectiveness of this two procedures, most ordinary people ignore or simply do not follow them due to their non-trivial implementation. This results into a significant number of persons that limits themselves to washing/rubbing their hands as they have always been accustomed to. For this reason, in this work, we focus on the recognition of unstructured handwashing/handrubbing with the aim to propose an automatic smartwatch-based system capable to monitor the hands hygiene in the greater part of common people.
Wearable devices, such as modern smartwatches, are equipped with several sensors capable to continuously measure characteristic parameters of our body movement. For instance, Wang et. al, in 2020, have measured the accuracy of some wearable bracelets, equipped with accelerometers, gyroscopes and electrodes for surface electromyography (sEMG), in identifying and monitoring the handwashing/handrubbing procedures suggested by WHO reaching an accuracy of over 96% in recognizing the true hands activity Wang2020. Before them, several authors have demonstrated the effectiveness of wearable devices in the classification and identification of general human activities such as running, walking, climbing/descending stairs, jumping, and sitting zhang2013; sztyler2016; sztyler2017; bhat2018; koping2018.
Furthermore, current machine learning techniques have given a further boost to these studies by significantly increasing their classification accuracy which, for general macro-activities, now reaches values up to 99% cheng2010; hassan2018; hou2020.
In this paper, we investigate the ability of an automatic machine learning based system to distinguish the handwashing and handrubbing gestures from the rest of the activities that each person performs every day without the use of invasive instrument, but relying only on commonly used wearable devices such as commercial smartwatches. Once validated, this tool will provide users with automatic and continuous indirect monitoring of hands hygiene in an attempt to reduce the diffusion of COVID-19 and other disease due to pathogens transmissible through direct contact.
The paper is organized according to the following structure: in Section 2 we describe state-of-the-art approaches related to our work, according to the scientific literature; in Section 3 we report a background description of the proposed machine learning tools; in Section 4 we illustrate the proposed method and the related design choices; in Section 5 we present the results of the experimental evaluation; in Section 6 we report some conclusive remarks.
2 Previous work
At present there are no scientifically validated devices or applications that are able to recognize the activity of washing/rubbing hands by means of wearable tools. A commercially available system called SureWash, produced by Glanta Ltd surewash2021, is able to detect the hands movements of the hospital staff, through the use of video cameras, in order to provide information about the correct execution of the procedure defined by WHO. However, one of the main problems facing camera-based systems is privacy, as such systems inevitably require the installation of cameras in several rooms, and, second, it is non-ubiquitous (it is not possible to monitor self washing/rubbing by means, for instance, of sanitizing gel).
For what concerns wearable sensors approach, the relevant scientific contributions are reduced to a few units and most of them are based on multiple sensors with very high sensitivity and accuracy typical of scientific instrumentation Galluzzi2015; bal2017; Li2018. These preliminary works show that the automatic recognition of hand washing activity, through the use of inertial sensors (accelerometers and gyroscopes), is a feasible task but, on the other hand, they do not study the potential of commercial smartwatches in common use, nor the application of modern deep-learning techniques.
In this work, we evaluate the feasibility of a smartwatch based system in recognizing the unstructured handwashing/handrubbing activities by means of modern machine learning techniques. Moreover, we aimed at pointing out if the reduction in accuracy, due to the use of non-professional devices, can be compensated for with the use of complex machine learning classification techniques.
Two relevant works which make use of commercial smartwatch have been published in 2015 and in 2021. The first one, presented by Mondol et al., describes a handwash monitoring and reminder system which interacts with a Bluetooth enabled soap dispenser to recognize the start of the washing procedure Mondol2015. Thanks to this broadcast advertisements, the smartwatch can easily start processing accelerometer and gyroscope data in order to recognize each procedure step as described by WHO. Notice that, in addition to the evident facilitation represented by the broadcast message provided by the soap dispensers which, among other things, require the installation of specific devices, authors do not deal with unstructured washing/rubbing.
The last relevant work has been published in 2021 by Samyoun et al. Samyoun2021 where the authors present an handwashing quality assessment systems based on smartwatch. In particular, it measures the handwashing quality in terms of likeness with the standard guidelines of WHO without taking in consideration unstructured washing/rubbing. Moreover, the presented system identifies when the user enters home from outside in order to immediately providing a reminder to wash hands. Finally, the start and the end of the washing events are marked with the help of the voice interaction with the user facilitating the recognition of the activity.
In this section we report some background information about the machine learning tools investigated with the proposed method. In particular two standard machine learning tools and two deep
learning tools have been tested. For what concerns standard learning, we evaluated Support Vector Machines (SVM) and Ensemble subspace with k-nearest neighbors (ES-KNN), while, in the deep-learning domain we have considered a Convolutional Neural Network (CNN) and a Long short-term Memory network (LSTM).
3.1 Ensemble subspace with k-nearest neighbors (ES-KNN)
The k-nearest neighbors (KNN) is one of the most simple and easy to implement supervised machine learning algorithm which can be used in regression and classification problems. It assigns an unknown observation to the class most common among its k nearest neighbors observations, as measured by a distance metric, in the training data fix1989; altman1992
Despite its simplicity, KNN gives competitive results and in some cases even outperforms other complex learning algorithms. However, one of the common problem which affects KNN is due to the possible presence of non-informative features in the data which can increase miss-classification errors. This is more likely in case of high dimensional data.
To improve KNN classification performances, ensemble techniques have been proposed in literature. In general, the ensemble method entails the process of creating multiple models and combining them (for instance by averaging) to produce a desired output, as opposed to creating just one model. Several studies show that, frequently, an ensemble of models performs better than any individual model, because the various errors of the models average out naftaly1997.
One way to generate an ensemble in machine learning is to train the classifiers on different sets of data, obtained by creating several subsets from the original training set. This technique, which is commonly calledEnsemble subspace has been widely explored by several authors among which the contributions of Breiman and Freund et al., which are known, respectively, as bagging and boosting subspace ensemble, certainly stand out breiman1996; freund1996.
In this work, we focus on a particular class of ensemble subspace tools applied to KNN algorithms which are called Ensemble Random Subspace KNN (ERS-KNN). According to this technique the features are randomly sampled, with replacement, for each learner forcing it to not over-focus on features that appear highly predictive/descriptive in the training set, but which can fail in unknown data ho1998; li2011.
3.2 Support Vector Machines (SVM)
SVM are another class of supervised learning models traditionally used for regression and classification problems with reduced number of samples.
A SVM model represents the input data as points in space, in such a way that the data belonging to the different classes are separated by a margin as large as possible. The new data are then mapped in the same space and the prediction of the category to which they belong is made on the basis of the side on which it falls. From the practical point of view, a SVM defines an hyperplane that best divides the dataset into the desired classes.
Moreover, in addition to a simple linear classification, it is possible to make use of the SVM to effectively carry out nonlinear classifications using non linear kernel methods which implicitly maps input data in a multi-dimensional feature space (steinwart2008).
3.3 Convolutional neural network (CNN)
A CNN is actually a kind of multi layer neural network following a computer vision approach to make use of any spatial or temporal information in the data. The CNN, in fact, was inspired by the biological process that occurs in the animal visual cortex, where neurons handle responses only from separate regions of the visual field. In the same way, CNN make use of convolving filters to handle local regions within the data. A CNN is mainly composed of an input layer, and of several convolutional layers, pooling layers, and fully connected layers. The input layer has the task of collecting data and of forwarding its to the subsequent layer. The convolutional layer, represents the main core of a CNN as it contains several convolution filters, called kernels, which convolve with the input data. The operation of convolution automatically extracts useful features from the input data and reduces its dimension. Moreover, the pooling layer, also called subsampling-layer, is also inserted to further reduce the number of parameters and the resulting computational cost. It includes max-pooling and/or average-pooling operations which sample, respectively, the max and the average value from the input. Finally, one or more fully connected layers act as traditional Perceptron network which takes as input the features originated from the previous layer.
A CNN is traditionally built using several layers batteries and it is used in the deep-learning approach also thanks to its characteristic of eliminating the requirement of feature extraction and feature selection often at the cost of an increase in computational complexity and memory usageAlbawi2017.
3.4 Long short-term memory (LSTM)
Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning. LSTM networks are mostly designed to recognize patterns inside sequence of data such as numerical time series. RNN and LSTM differ from classical artificial neural networks due to the fact that they have a temporal dimension and they can not only process single data points, such as images, but also entire sequences of data such as speech or video.
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. It also makes decisions about what to store, and when to allow reads, writes and erasures, via gates that open and close. hochreiter1997.
LSTM networks are well-suited to classifying, processing and making predictions based on time series data, and they have been used in many complex problems such as handwriting recognition, language modeling and translation, speech synthesis, audio analysis, protein structure prediction and many others yu2019.
4 The proposed method
In this work we evaluate the suitability of four different supervised classification methods, namely SVM, ERS-KNN, CNN, and LSTM, for classifying handwashing and handrubbing activities starting from gyroscopic and accelerometer data sampled in real-life conditions by means of a common smartwatch.
4.1 Experimental protocol and data gathering
As there is no publicly available dataset for handwashing or handrubbing quality assessment, we collected sensors data from a wearable Inertial Measurement Unit (IMU) positioned on the wrist of the dominant hand of four participants during real-life activities. In particular, each subject was asked to annotate the start and the end of each handwashing or handrubbing activity performed during the day. Each subject was wearing the IMU sensor for several hours during different days leading to a total of about 40 hours of recording containing about 1 hour and 40 minutes of total time spent in washing hands and about 2 hours and 10 minutes of time spent in rubbing. The wearable device was programmed to sample its triaxial accelerometer and gyroscope at a frequency of 100 Hz and to store the collected data into the internal SD card. In order to remove sensors bias, the device was calibrated once at the start of the study by placing it on a stable surface and the accelerometers and gyroscopic measurements were recorded for 30 seconds.
Notice that the subjects were not instructed on how to wash or rub their hands leaving them completely free to use their usual way so to collect data about the unstructured way people normally use to wash the hands. Table 1
shows the average duration, together with the standard deviation, of each activity performed by the four subjects.
As we collect data in an unstructured way, the average duration and repeatability of each activity significantly depend on the subject. If this can represent an advantage in recognizing a particular subject, since her/his way of washing the hands could represent a kind of fingerprint, it could also represent a problem by reducing the ability to generalize the true activity recognition.
To collect data about daily activities we use a Shimmer3 IMU unit equipped with two accelerometers, a gyroscope and a magnetometer Shimmer3
. Each embedded sensor samples information on the three independent orthogonal axes (x,y, and z) for a total of nine degrees of freedom (DoF).
The two separate accelerometers allow the user to tailor the IMU configuration to the needs of the application. In particular, the first accelerometer is a wide range sensors sampled at 14 bits which can be configured to operate in a full scale range from up to with a resulting sensitivity from in the range up to in the range.
The second accelerometer, on the other hand, is a low noise sensor based on a capacitive MEMS sensor capable to suppress the measurement noises with a sensitivity of in the only available range of .
Notice that, since in this work we try to recognize the particular wrist dynamics during hand washing, we focus only on the data extracted from the first accelerometer as rapid hands rotations can induce high acceleration values for a short time which can go out of range for the low-noise one. Moreover, the wide range accelerometer is a classical sensor which most smartwatches are equipped with.
4.2 Signal windowing
The recorded tracks, composed of six distinct signals (i.e. 3 accelerometer and 3 gyroscope waveforms) have been divided in time windows and each of these has been considered as a sample to be used to train and test the classifiers. Furthermore, each sample has been labeled using the annotations provided by each subject in accordance with the following categories: i) washing; ii) rubbing; iii) other.
Obviously, deciding the size of the time window is a non trivial task because it can influence the performance of classification models in different ways. In fact, it must be large enough to capture the ”fingerprint” of the particular activity that we want to recognize, but it must not be too large to include consecutive activities. For what concerns human activity recognition (HAR), different window lengths have been used in the literature: starting from 1s up to 30s cheng2010; hassan2018; hou2020. In particular, for what concern the handwashing recognition both Li et al. in 2018 and Samyoun et al. in 2021 use a very tiny time window (only 0.06 seconds), with 70% overlap between subsequent windows, due to the fact that they aim at recognizing each steps of the structured handwashing procedure Li2018; Samyoun2021. In 2015, both Galluzzi et al. and Modol et al. use a larger window respectively of 0.5 and 1 seconds while Wang et al. in 2020 found that a window with 0.2 second of amplitude and 75% overlap gave the best classification accuracy Galluzzi2015; Mondol2015; Wang2020. To better highlight the impact of the window length on the overall classification performance, we present in this work an extensive sensitivity analysis of the classifiers with respect to this parameter.
Notice that, due to the proposed gathering protocol, which plans to continuously record sensors data during real-life activity, the number of samples containing non washing/rubbing events is much greater than that which contains them. For this reasons, the samples labeled as other have been randomly undersampled in order to rebalance the occurrence of each classes.
4.3 The classifiers
As machine learning classifier models we used multi-class SVM and ERS-KNN, CNN and LSTM (for neural networks). For what concern the SVM tool, a cubic polynomial kernel has been choose for performance reasons steinwart2008. We also considered other kernels (i.e. linear, quadratic or Gaussian functions), however, these did not reach the performances of the cubic kernel.
In order to train and test the classification accuracy of standard (i.e. not based on neural network appraches) machine learning tools, the input signal need to be processed to extract synthetic features. In particular, in this work for each window three sets of descriptors have been computed. The first set, hereafter referred as Base, contains basic statistical descriptors aimed at capturing data tendency and variability. These are the following classical descriptors: i) average; ii) maximum value; iii) standard deviation; iv) median value. The second set contains the so called Hjorth parameters which are: i) Activity; ii) Mobility; iii) ComplexityShape of the data.
While the Base
set easily describes the sample tendency, Hjorth parameters can capture the main characteristics of the signal in the frequency domain. In fact, Hjorth activity represents the power of the signal, the mobility its mean frequency, and the complexity measures its change in frequency(Hjorth1970).
Kurtosis and Skewness are used to describe, respectively, the degree of dispersion and symmetry of the data. In particular, Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution while Skewness measures how much data differ from a completely symmetrical distribution(kim2004).
The architecture of the CNN and LSTM networks are presented, respectively in Figure 2.(a) and Figure 2.(b). In the case of deep learning approaches no feature extraction is needed and the samples of the signals, that make up the time window, can be directly used as input for the classification tool. In the case of CNN, which is commonly applied to analyzing visual imagery and which has been designed to work properly with bidimensional data, a preprocessing step has been added to represent time series data by means of visual cues. This possibility has recently attracted widespread attention so that in literature we can count several strategies aimed at re-coding time series into images to enable computer vision techniques and to perform classification Wang2015; baldini2017; qin2020. In this paper, in particular, we investigate the method proposed by Wang et al. in 2015 which encodes time series as images called Gramian Angular Summation/Difference Field (GASF/GADF). This method represents time series in a polar coordinate system instead of the typical Cartesian coordinates with the advantage of preserving spatial and temporal relations Wang2015. Because this method leads to the production of two distinct images: one for the Gramian Angular Summation (GASF) and one for the Gramian Angular Difference (GADF), in this work, we obtain 12 distinct images (six from the accelerometer and six from the gyroscopic data). As a consequence, the CNN model takes in input a 12-channel square image, reconstructed starting from these data, whose height and width depend on the chosen window processing size (
). The image is then convolved by four subsequent convolutional layer with decreasing size and numbers of filters. Furthermore, each convolutional layer is followed by a batch normalization layer and by a rectified linear activation function (
). Batch normalization is used to standardize the input before forwarding it to the following layer and it has the effect of stabilizing the learning process and reducing the number of training epochs required to train the networkioffe2015
. The results of the ReLu layer are then processed by a subsequent pooling layer which selects the most activated features (max pooling
). At the end of the convolutional structures, tree fully connected layers, with decreasing number of neurons, have been added. The output of the last layer is then processed by a softmax function which assigns to each class a probability proportional to the output signal.
The LSTM network, on the other hand, receives in input six sequences extracted from the original time series the length of which is the size of the window processing (). The input is then processed by three subsequent LSTM layers with decreasing number of hidden units. The output of the last LSTM layer is then forwarded to three fully connected layer as in the case of CNN.
4.4 Classification performance metrics
For the proposed classifiers we calculate several performance metrics, together with the standard deviations, during a k-fold cross-validation test with k=5. Dealing with multi-class classifiers, entails the evaluation of the following quantities for each of the N classes ( is an index that identifies a specific class): , the number of true positives predicted for class ; , the number of true negatives predicted for class ; , the number of false positives predicted for class ; , the number of false negatives predicted for class .
Subsequently, these indicators have been used to compute the following metrics (corresponding to the so called macro-averaging measures) sokolova2009:
5 Experimental results
In this section we report the results obtained. First of all we show the best classification metrics calculated with the four machine learning tools described in section 4.3. Then, for each classifier we report the sensitivity analysis with respect to the processing window length. The last part of the section reports the feature selection results for SVM and ERS-KNN.
5.1 Classification results
Table 2 reports the best value of the classification metrics obtained when using the four proposed models. These values refer to the higher results obtained for each model when varying the window processing size and, for the standard learning tools, also the number of selected features. Each value is reported as the average value calculated during the 5-fold cross-validation test. For each metric, the highest value obtained ever is highlighted in bold. For instance, the SVM classifier obtains the highest Recall value (about ) while the ERS-KNN shows the highest Precision and F1-score (respectively and ). LSTM, on the other hand, reaches the best accuracy value of about . This suggests that the classification of the handwashing/handrubbing activities using signals gathered from a common smartwatch is a well feasible task which can be accomplished both with standard or deep learning techniques. Notice that, the best results reported here have been obtained with the following size of the processing window: SVM=12s; ERS-KNN=8s; LSTM=2s; CNN=6s. Moreover, in the case of SVM and ERS-KNN tools, all the proposed features have been used.
Figure 3 reports the average confusion matrices calculated on top of the results obtained during the 5-fold cross-validation tests. All four models used show a great ability to correctly classify the other activity. For instance, the LSTM network reaches the higher value of about 97.2%. Furthermore, also the washing activity has been correctly classified reaching the higher values of about 95% using standard tools while deep learning models do not exceed 91%. The rubbing activity, on the other hand, appears to be the most difficult to classify with a lower value of about 82% obtained with CNN model. Here, in fact, the rubbing class has been misclassified as other about 16% of the time. This is probably due to the fact that hands rubbing is a less dynamic activity with respect to hand washing which therefore produces fewer accelerations and rotations of the wrist. Furthermore, the use of running water for washing could introduce vibrations that are more easily identifiable by the classifiers.
A second set of classification experiments has been carried out aimed at identifying the person washing or rubbing their hands instead of the performed activity. To this purpose, each sample related to the other activity has been removed from the database while washing and rubbing samples have been merged into a single class to which a label containing a unique person identifier has been added. Table 3 shows the best value of the classification metrics obtained with the four models. Also in this case, the best results have been obtained with the following size of the processing window: SVM=12s; ERS-KNN=8s; LSTM=2s; CNN=6s. In the case of standard tools, all the proposed features have been used. As expected, recognizing the person who is washing/rubbing the hands is a much easier task thanks to the fact that we collect data in an unstructured way where each subject is free to wash its hands as he/she is used to. Our results, with the higher accuracy of about obtained with the SVM classifier, in fact, suggest that the hand washing/rubbing activity can represent a kind of subject fingerprint. Another interesting result from this experiment concerns the fact that SVM and ERS-KNN seem to exceed the deep learning methods by almost percentage points.
5.2 Sensitivity to the window length
Te size of the processing window influences the performance of the classification models in several ways. In this section the results of the in-depth analysis of this dependence is reported. In particular, Figure 4 plots the classification metrics obtained by the four classifiers when varying the size of the processing window. Each point represents the average value together with the standard deviation calculated over a 5-fold cross-validation test. Both SVM and ERS-KNN (Figure 4.(a) and Figure 4.(b)) show an almost flat trend of the measured accuracy even if at some point the other metrics (precision, recall, and f1-score) begin to deteriorate as the window size increases. In particular, the SVM classifier increases its performance until when using a window of about 12 seconds. Further increasing the size of the window leads to an average decrease of Precision, Recall, and F1-score and to more unstable results (higher standard deviations).
Similarly, the performances of ERS-KNN increase until a window size of about 8 seconds beyond which they markedly decrease together with results stability.
An opposite trend is found, however, regarding the results obtained by the two deep learning classifiers (Figure 4.(c) and Figure 4.(d)). In this case, in fact, the four performance metrics show an almost monotonous decreasing trend for increasing values of the window size. Moreover, for CNN classifier we found a local maximum at a window size of about 6 seconds.
5.3 Feature selection results
In order to evaluate the relative influence of the proposed features on the classification performances we use the forward feature selection method liu2005. Forward feature selection is based on an objective function (e.g. the accuracy) which is used as a criterion to evaluate the impact of adding a features from a candidate subset, starting from an empty set until adding other features doesn’t induce any improvement in the objective function. We applied this strategy to highlight how the proposed features contribute to the overall performance of the two standard classifiers. In particular, each group of features, namely Base, Hjorth, and Shape has been treated as an atomic unit which can be added or removed as a whole. First of all, we tested each classifier using only one of the three groups and then we added the other groups to explore all possible combinations.
Table 4 shows the activities classification performances, together with its standard deviations, of the SVM when varying the adopted features. For each performance metrics the maximum value achieved has been highlighted in bold. All metrics showed a monotone increasing trend when consecutively adding one of three groups of features reaching the higher performances when all the proposed features are used together (Base+Hjorth+Shape). This suggest that all features provide original information content useful for the classification process. Furthermore, the Hjorth group seems containing the most informative group of features producing the highest classification performance with respect to the other groups when tested alone.
The same experiment conducted with the ESR-KNN classifier produces comparable results, reported in Table 5, with the only difference that, in this case, the measured performances are slightly higher.
Notice that, for both classifiers, these results suggest that a good trade-off between classification performances and real-time computation complexity can be represented by the design choice of calculating only Base+Hjorth giving up only about 0.2% of classification performance decrease. Moreover, if for the purpose of a particular real-time application, an accuracy of about 92% could be considered acceptable, it even would be sufficient to calculate the Hjorth features, saving many computational resources and power.
Hands hygiene is extremely important in breaking the chain of pathogens transmission by contact. In fact, contaminated hands are a privileged way to get to the mucous membrane of the mouth, nose, or eyes. Also regarding the COVID-19, it is estimated that a non-negligible part of infections occurs due to contact, through our hands, with contaminated surfaces.
In this work we proposed and evaluated four classification machine learning models to distinguish the unstructured handwashing/handrubbing gestures from the rest of the daily activities starting from commonly used wearable devices. The proposed models lay the foundations for the creation of a system which provides users with automatic and continuous indirect monitoring of hands hygiene in an attempt to reduce the contact transmission of pathogens including SARSCoV-2 coronavirus.
The experimental results, calculated over more than 40 hours of daily activities registration, show that both standard and deep learning techniques can be considered a viable solution to the classification problem reaching, respectively, an average accuracy of about 94% and 95%.