Hand hygiene, the so-called hand wash process, is an essential part that prevents infectious diseases in surgery. Even healthcare professionals frequently fail to follow the hand hygiene guidelines, hence, raise the chances of infection transmission. Fortunately, the hand hygiene technique is published by The World Health Organization (WHO) to help medical staffs keep all the surfaces of their hands clean during working. However, people may find it difficult to remember these steps correctly. Hence, it is substantial to automatically control the quality of the hand wash process in clinical environments. A computer vision system is one of the most efficient methods to accomplish this. Specifically, hand gesture recognition technologies, in particular, have already been used to assess hand hygiene compliance[1, 25, 17].
In this paper, we consider the task of hand gesture recognition over a hand hygiene system. Our goal is to interpret the gestures of medical staff when they are washing their hands. However, different from other domains, hand gesture recognition for the hand wash process has two main aspects. First, hand hygiene is a process that contains fine-grained actions. It means that a deep learning agent is about to deal with both significant intra-class differences and subtle inter-class differences during their prediction process. For huge intra-class differences, the agent may recognize hand gestures that belong to the same category but may present significantly different poses and viewpoints111A viewpoint (or a scene) is the apparent distance and angle from which the camera views and records the subject. For subtle inter-class differences, the agent may cope with gestures that belong to disparate steps but might be very similar apart from minor differences. Hand gesture recognition for fine-grained actions has been introduced in many approaches [20, 41, 46, 10]. However, in the hand hygiene domain, this aspect has not been focused on yet.
Second, the hand hygiene data are mismatched in distribution between the training phase and the inference phase. Indeed, the behavior of medical staff in different locations and camera views are not the same. Some steps are missing or not taken correctly, which requires lots of human effort to annotate. As a result, although similar data from other data distributions might be readily available, only a limited amount of data from the target distribution can be collected. Many works on hand gesture recognition have mentioned this data-driven issue [42, 18, 33]. However, there is no current dataset for the hand hygiene domain challenging enough to compare various gesture recognition methods.
By considering the two aspects above, our paper makes two contributions. The main contribution is to simulate both preceding problems by introducing a multi-viewpoint fine-grained hand hygiene dataset, named the MFH dataset (Fig.1). It contains samples in total, which are collected by camera views in different locations. All samples are split into classes in total. MFH dataset is distinguished from existing datasets in three aspects: the large intra-class difference, the subtle inter-class difference, and the data mismatch in distribution between the training phase and the inference phase. This dataset thus provides a more realistic benchmark. For performance evaluation, besides the accuracy, we recommend using the Macro F1-score for a more comprehensive measurement.
As a minor contribution, we address the preceding problems by applying the self-supervised learning approach for recognizing hand gestures in a hand hygiene system. Intuitively, the method is designed to maximize the mutual information between features extracted from multiple views of a shared context. This method is expected to deal with hand hygiene fine-grained problems and reduce the negative effect of distribution mismatch. To our knowledge, there is no previous approach that leverages self-supervised learning in dealing with multi-viewpoint hand gesture recognition.
Ii Literature Review
Hand Gesture Recognition. Hand gesture recognition in a hand hygiene system is not a trivial learning task. There are two groups of approaches to coping with this. The former one, the sensor-based work, leverages the information from different sensors to establish the recognition [47, 13, 40, 29]. The latter one, the image-based approach, mainly leverages input images and their correlated data to give out hand gesture predictions [25, 17].
Specifically, in [36, 47, 8, 28, 20, 12, 37], the authors use information from depth sensors as input to give out hand gesture predictions. In [7, 44, 42, 21, 20, 31, 6, 43], the skeleton information is used as input. Recently, Leap Motion has also been considered as an essential sensor for hand gesture recognition [12, 26]. Although sensors play a crucial role in many situations, most sensors are costly and may not be easy to configure. Different from sensor-based approaches, the image-based ones mainly take images as the input. Particularly, in , the authors extract HOG and HOF over images and then apply SVM for hand gesture classification. In the most recent approach , the authors propose hand hygiene monitoring based on the segmentation for separating hand parts of interacting and self-occluded hands. To our knowledge, almost no works consider the fine-grained characteristic of hand gestures and the data mismatch over different viewpoints in the hand hygiene system during recognition. Moreover, datasets used in these previous works are private [25, 11], do not handle the fine-grained problem or the data mismatch problem .
Self-supervised learning (SSL) aims to self-generate robust representations from the unlabeled data according to the structure or characteristics of the data itself. SSL works as a supervision and benefits almost all types of downstream tasks, e.g., classification, recognition, or image retrieval[19, 32, 14, 5, 35, 24, 23, 15, 30, 39]. To deal effectively with the image classification task, the authors of 
implement a hybrid system of self-supervised learning and semi-supervised learning. In, Pretext-Invariant Representation Learning (PIRL) is used to solve jigsaw puzzles and their rotation by enhancing the quality of the learned image representations. Recently, SSL is introduced to be more generalized since it can maximize the mutual information between features extracted from multiple views of a shared context . Inspired by the SSL, we leverage the generalization of the AmDim setup  to deal with the data mismatch problem in hand gestures recognization from different viewpoints.
Iii The MFH dataset
Iii-a Dataset description
The introduced MFH dataset is an on-top dataset using images from . These images were collected monthly, and the total deployment duration was months. During dataset collection, a total of cameras were placed in different locations. All cameras are set at pixels resolution, and their frame rate is 30 fps. There are defined seven different hand washing movements as recommended by the WHO. These movements are as follows: palm to palm (Step ), palm over dorsum with fingers interlaced (Step ), palm to palm with fingers interlaced (Step ), back of fingers to opposing palm (Step ), rotational rubbing of the thumb (Step ), fingertips to palm (Step ), turning off the faucet with a paper towel (Step ). For more details, please visit Fig.1 which illustrates the visualization of these movements from different viewpoints. Additionally, it was necessary to identify whether a person is washing hands with a watch, a ring, or has lacquered nails. The reason is that these factors interfere with basic handwashing procedures and can be regarded as inappropriate for medical professionals in their work environment.
The dataset consists of annotated video files, each of which corresponds to a single hand-wash episode. The video files are split into frames that are easier to access. For each video file, there is a matching .json file, which contains the annotations of each frame in JSON format. Overlapping exists among all viewpoints (scenes); the dataset contains up to samples (frames).
Table I illustrates the detail statistics of the MFH dataset. Specifically, each row denotes the number of samples of each specific scene. There are scenes in total. The column indicates steps in the hand wash process, and the row demonstrates the index of different scenes. Besides, we provide the total samples of each scene in the final column. Through statistical, there is bias in the number of samples over classes in each viewpoint, i.e., imbalanced data. Moreover, the distribution of the bias over classes between these viewpoints is highly distinct. Our approach is expected to provide an opportunity to compare and evaluate the performance of different hand gesture recognition networks under two challenging aspects: fine-grained hand gestures and data distribution mismatch over different viewpoints. These aspects are consistent with practical usage. Hence, our dataset provides a testbed for methods applied in open systems.
|Step 1||Step 2||Step 3||Step 4||Step 5||Step 6||Step 7|
|[width=height=1cm] Scene Class||Step 1||Step 2||Step 3||Step 4||Step 5||Step 6||Step 7||
Iii-B Evaluation protocols
We typically use Accuracy to evaluate the effectiveness of different models in the hand gesture recognition task. Accuracy is calculated as the ratio between the number of correct predictions to the total number of predictions. The definition of Accuracy is also described as in (1).
where is the number of samples that are recognized correctly, is the number of all samples in the test data.
The Accuracy metric is essential in most cases. However, if the benchmarking dataset is not balanced, this metric has not much reference value. Since the MFH dataset is an imbalanced one, we introduce the Macro F1-score. Unlike Accuracy, which focuses on the importance of samples, the Macro F1-score puts the same importance on each class. The model that only performs well on the common classes while performing poorly on the rare classes will cause a low Macro F1-score.
Macro F1-score, so called Macro-averaged F1 score, is defined as the mean of class-wise/label-wise F1-scores. The F1-score, , , , and be the true positives, false positives, false negatives, precision, recall, F1-score with regard to class and is the harmonic mean. The precision , the recall , and the F1-score are computed as in (2), (3), and (4), respectively.
where is the number of classes, is the calculated F1 value on class .
In the final step, the Macro F1-score is then calculated using (5).
Iii-C Comparing with existing datasets
A statistics comparison with existing datasets is shown in Table II. Our dataset contains samples, which is larger than the current largest dataset  by double. Different from  which uses Depth Infrared images as inputs, RGB images are leveraged in our introduced MFH dataset since they provide good visual information. The highlight of the MFH dataset, also the key difference when comparing MFH with other datasets, is that it contains sub-datasets from non-overlapping viewpoints. Under realistic Healthcare Industry settings, our dataset serves as an ideal benchmark for learning methods that focus on the generalization capacities and data mismatching by opening two different evaluation protocols for testing.
|Type of Input||RGB image||RGB image||Depth Infrared image||RGB image|
|Num. View Split|
Iv Self-supervised learning
Self-supervised learning derives from unsupervised learning and can be applied in any recognition or classification task. It aims to learn semantically meaningful representations from unlabeled data. Generally, some portion of the data is retained, and the network is tasked with predicting it. One of the most effective approaches is to design a pretext task, which maximizes the mutual information between features extracted from multiple views of a shared context. The context here is the input images, and the preceding views are augmented from these inputs.
, we determine the mutual information (MI), which measures the shared information between two random variablesand . MI is defined as the Kullback-Leiber (KL) divergence between the joint and the product of the marginals and .
Since it is not easy to direct access to the underlying distribution to estimate MI, we instead maximize a lower bound on MI by minimizing the Noise Contrastive Estimate (NCE) loss based on negative sampling.
Our objective is to maximize MI between global features and local features from two views of the same hand gesture input image .i.e., , and . Where , , are the global feature, the encoder’s local feature map and the encoder’s feature map respectively. The NCE loss between and is defined in (7).
where are the negative samples of image , is the distance metric function.
The overall loss between and is the total of the NCE losses and is written in (8).
It is worth noting that, after finishing the training process, the self-supervised learning network is leveraged as an encoder to extract features for the further classification task. To achieve hand gesture recognition, we need to train a supervised learning network, i.e., a classifier, on top of the aforementioned extracted features using the annotated hand gesture data. The structure of the self-supervised learning network is the standard ResNet
. For the classifier, a linear layer or a multilayer perceptron is used as the structure. For more details about these structures, please visit.
V-a Implementation details, data setup and baselines
Implementation details. All experiments are conducted on an NVIDIA Titan V GPU with
GB RAM. All models are trained by using Stochastic Gradient Descent with a momentum of. The initial learning rate is set to , with exponential decay of
after every two epochs. The maximum number of epochs is set at.
Data setup. In the MFH dataset, the data got from each viewpoint is split into a train set and a test set. Each set contains percent of data, and samples in each are not overlapping. There are two scenarios for the evaluation phase. The first scenario is that the model is trained and tested within the same camera data, i.e., “same scenes” scenario. The second scenario is that the model is trained in a specific scene notwithstanding its effectiveness is evaluated in the data collected from other scenes, i.e., “cross scenes” scenario.
are leveraged as the baseline network for our analysis. These models are pretrained on the Imagenet dataset and then fine-tuned in a specific sub-dataset so as to maximize their performance. AmDim 
, a self-supervised representation learning baseline, is expected to work well under the limitation of our dataset. Following the setup of other baselines, the model is pre-trained on the Imagenet dataset. Finally, we train a classification on top of the features for the recognition of hand gestures.
V-B Experimental details
Fine-grained action analysis To identify the effectiveness of deep network over hand wash actions, we use the well-known InceptionV3, which is pre-trained in the ImageNet dataset, as the baseline. An analysis is established in the -st scene data of the MFH dataset. In Figure 2
, we present the confusion matrix over different tested samples. The results imply that there has a visible of confusing predictions over classes, e.g., The model tends to give out the predicted Stepwhen it meets images from Step . The main reason is that gestures that belong to disparate steps might be very similar apart from some minor differences.
Data mismatch analysis. To further understand the MFH dataset and its challenges, we provide the recognition results between all scene pairs in Figure 3 where InceptionV3 is used as the baseline. Each value in the figure is the Accuracy( Figure 3- a) or Macro F1-score (Figure 3 - b) when we test a specific model in the corresponding test set. Note that the row denotes the train data, and the column denotes the test data. Through the figure, hand gesture recognition within the same camera view, i.e., the “same scenes” testing scenario, yields the highest accuracy score. On the other hand, as expected, the performance among different camera pairs, i.e., the “cross scenes” testing scenario, varies a lot. In most cases, InceptionV3 achieves low results due to the mismatch in distribution between train set and test set, regardless of the number of classes are not much, and the network itself is strong enough. Besides, the Macro F1-score is by far lower than the Accuracy in most “cross scenes” experiment results. It indicates that the imbalance over classes further increases the data mismatch between viewpoints.
Table III demonstrates the results obtained by different typical deep network structures including Mobilenet , ResNet , and InceptionNet . All of the preceding networks do not achieve good results during the inference phase of “cross scenes” in both the Accuracy and the Macro F1-score. These results imply that all benchmarking networks can not work well with the data mismatch problem.
|Method||Avg. Accuracy||Avg. Macro F1|
|same scenes||cross scenes||same scenes||cross scenes|
Self-supervised learning analysis The AmDim  is leveraged as a self-supervised learning baseline to deal with both analyzed problems. Table III demonstrates the performance comparison between AmDim and other learning methods. In the “same scenes” testing scenario, AmDim outperforms other networks by a large margin. This result indicates that AmDim can learn robust features for recognizing fine-grained hand gestures. In the “cross scenes” testing scenario, AmDim also achieves significant improvements in the Accuracy metric. Especially when comparing it with InceptionV3 - the most effective baseline, i.e., the Avg. The accuracy gap between AmDim and InceptionV3 is . Hence, the self-supervised learning approach can deal with the mismatch in data distribution over different viewpoints. It is worth noting that AmDim also outperforms other baselines in Macro F1-score, which validate the effectiveness of AmDim over imbalanced data problems (See Figure 4 for quantitative results of AmDim in over viewpoints.)
presents the AmDim performance with and without pretraining the model in the ImageNet dataset. In terms of the “cross-scene” scenario, the results show that both setups give better scores than the baseline InceptionV3 in two metrics, regardless that InceptionV3 is pre-trained on the ImageNet dataset. This again validates the effectiveness of self-supervised learning. Besides, through empirical experiments, we have investigated that using a deeper neural network, e.g., the multilayer perceptron (MLP), for classification achieves better results than using a linear one.
|Avg. Accuracy||Avg. Macro F1|
We introduce a multi-viewpoint fine-grained hand hygiene dataset (MFH) that reaches closer to realistic settings, especially in the Healthcare Industry. Our new dataset will enable research possibilities in multiple directions, e.g., deep learning, fine-grained learning, multi-view learning, and data distribution learning. Besides, self-supervised learning (SSL) is presented to deal with fine-grained hand gestures and data mismatch problems. The extensive experiments show that SSL yields the best performance with various competitive baselines in Accuracy and Macro F1-score.
Special thanks to AIOZ Singapore company and Blood Transfusion Hematology Hospital Vietnam for the valuable support on the cooperation.
-  (2011) A vision-based system for handwashing quality assessment with realtime feedback. In The Eighth IASTED International Conference on Biomedical Engineering, Biomed 2011, Innsbruck, Austria, Cited by: §I.
-  (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §II, §IV, §IV, §V-A, §V-B, TABLE III, TABLE IV.
-  (2015) Deep learning with non-medical training used for chest pathology identification. In Medical Imaging: Computer-Aided Diagnosis,
-  (2021) Self-supervised learning for few-shot image classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1745–1749. Cited by: §IV.
-  (2020) Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029. Cited by: §II.
Motion feature augmented recurrent neural network for skeleton-based dynamic hand gesture recognition. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 2881–2885. Cited by: §II.
-  (2019) Construct dynamic graphs for hand gesture recognition via spatial-temporal attention. arXiv preprint arXiv:1907.08871. Cited by: §II.
-  (2015) Survey on 3d hand gesture recognition. IEEE transactions on circuits and systems for video technology 26 (9), pp. 1659–1673. Cited by: §II.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §V-A, §V-B.
-  (2020) Towards domain-independent complex and fine-grained gesture recognition with rfid. Proceedings of the ACM on Human-Computer Interaction 4 (ISS), pp. 1–22. Cited by: §I.
-  (2018) Hand hygiene monitoring based on segmentation of interacting hands with convolutional networks. In Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Applications, Vol. 10579, pp. 1057914. Cited by: §II, §III-C, TABLE II.
-  (2017) Hand gesture recognition with leap motion. arXiv preprint arXiv:1711.04293. Cited by: §II.
-  (2017) Monitoring hand-washing practices using structural vibrations. Structural Health Monitoring. Cited by: §II.
-  (2019) Scaling and benchmarking self-supervised visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6391–6400. Cited by: §II.
-  (2020) Self-supervised co-training for video representation learning. arXiv preprint arXiv:2010.09709. Cited by: §II.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV, §V-A, §V-B, TABLE III.
-  (2020) Automated quality assessment of hand washing using deep learning. arXiv preprint arXiv:2011.11383. Cited by: §I, §II, §II, §III-A, §III-C, TABLE II.
-  (2019) Synthetic video generation for robust hand gesture recognition in augmented reality applications. arXiv preprint arXiv:1911.01320. Cited by: §I.
-  (2019) Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1920–1929. Cited by: §II.
-  (2018) CNN+ rnn depth and skeleton based dynamic hand gesture recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3451–3456. Cited by: §I, §II.
-  (2020) An ensemble of knowledge sharing models for dynamic hand gesture recognition. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. Cited by: §II.
-  (2018) A dataset of clinically generated visual questions and answers about radiology images. Nature.
-  (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4242–4251. Cited by: §II.
-  (2020) MS2L: multi-task self-supervised learning for skeleton based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 2490–2498. Cited by: §II.
-  (2011) A vision-based system for automatic hand washing quality assessment. Machine Vision and Applications 22 (2), pp. 219–234. Cited by: §I, §II, §II, TABLE II.
3D dynamic hand gestures recognition using the leap motion sensor and convolutional neural networks. In International Conference on Augmented Reality, Virtual Reality and Computer Graphics, pp. 420–439. Cited by: §II.
-  (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: §II.
-  (2015) Hand gesture recognition with 3d convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 1–7. Cited by: §II.
-  (2020) HAWAD: hand washing detection using wrist wearable inertial sensors. In 2020 16th International Conference on Distributed Computing in Sensor Systems (DCOSS), pp. 11–18. Cited by: §II.
-  (2018) Improvements to context based self-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9339–9348. Cited by: §II.
-  (2019) A neural network based on spd manifold learning for skeleton-based hand gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12036–12045. Cited by: §II.
-  (2018) Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9359–9367. Cited by: §II.
-  (2020) FS-hgr: few-shot learning for hand gesture recognition via electromyography. arXiv preprint arXiv:2011.06104. Cited by: §I.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §V-A, §V-B, TABLE III.
-  (2020) Adversarial self-supervised learning for semi-supervised 3d action recognition. In European Conference on Computer Vision, pp. 35–51. Cited by: §II.
-  (2020) Automatic detection of hand hygiene using computer vision technology. Journal of the American Medical Informatics Association 27 (8), pp. 1316–1320. Cited by: §II.
-  (2012) Hand gesture recognition with depth images: a review. In 2012 IEEE RO-MAN: the 21st IEEE international symposium on robot and human interactive communication, pp. 411–417. Cited by: §II.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §V-A, §V-B, TABLE III, TABLE IV.
-  (2020) Extending and analyzing self-supervised learning across domains. In European Conference on Computer Vision, pp. 717–734. Cited by: §II.
-  (2020) Accurate measurement of handwash quality using sensor armbands: instrument validation study. JMIR mHealth and uHealth 8 (3), pp. e17001. Cited by: §II.
-  (2016) Interacting with soli: exploring fine-grained dynamic gesture recognition in the radio-frequency spectrum. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology, pp. 851–860. Cited by: §I.
-  (2020) A prototype-based generalized zero-shot learning framework for hand gesture recognition. arXiv preprint arXiv:2009.13957. Cited by: §I, §II.
-  (2019) LE-hgr: a lightweight and efficient rgb-based online gesture recognition network for embedded ar devices. In 2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pp. 274–279. Cited by: §II.
-  (2019) Make skeleton-based action recognition model smaller, faster and better. In Proceedings of the ACM Multimedia Asia, Cited by: §II.
-  (2019) S4l: self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1476–1485. Cited by: §II.
-  (2018) HandSense: smart multimodal hand gesture recognition based on deep neural networks. Journal of Ambient Intelligence and Humanized Computing, pp. 1–16. Cited by: §I.
-  (2016) WashInDepth: lightweight hand wash monitor using depth sensor. In Proceedings of the 13th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pp. 28–37. Cited by: §II, §II.