I Introduction
In general, handgesture recognition systems can be classified in two categories, according to the type of sensors they employ, namely, the camera and the wearablebased ones. The camerabased systems can achieve high recognition accuracy, but at a relatively high computational cost [1]. The performance of these systems is sensitive in the background, light conditions, and room geometry, and constrained by the field of view of the camera. On the other hand, sensorbased systems, worn on the wrist, leg, arm, chest, ankle, head and/or waist, employ accelerometers, gyroscopes and magnetometers, barometers, body sensor networks [2], electromyography sensors [3], or even sound sensors [4]. They have relatively small cost, are not particularly sensitive to the environmental conditions (e.g. light or geometry conditions), and can function in indoor and outdoor spaces. Although, in general, wearable sensors are energyconstrained, due to the rapid advances of the microelectromechanical technologies in reducing their size and enhancing their energy efficiency, they have gained a lot of attention in handgesture recognition (HGR), human activity recognition (HAR) [5, 6, 7, 8, 9] and even in human writing recognition [10]
. These systems collect data from the onboard sensors and apply either machine learning algorithms
[11, 12, 13, 14, 15, 16], mathematical models [4, 17, 18], fuzzy control techniques [19] or simple thresholdbased algorithms [20] to recognize the user’s gestures. Inertial measurement units are extremely useful and commonlyused for orientation/heading estimation [21, 22]. The prevalence of accelerometers, gyroscopes and magnetometers in smartphones for estimating their orientation has enabled various interesting applications in the gesture recognition domain.This paper introduces and evaluates the GestureKeeper, an innovative robust handgesture identification and recognition system based on a wearable inertial measurements unit (IMU). In the context of daily activities, the user can control an appliance by performing a specific gesture. The identification of starting and ending points of the time windows, where the gestures occur, without relying on an explicit user action, as in [11], or a special gesture marker, as in [23], is challenging. To address this problem, GestureKeeper identifies the start of a gesture by exploiting the underlying dynamics of the associated time series using recurrence quantification analysis (RQA). RQA is a powerful method for nonlinear timeseries analysis, which enables the detection of critical transitions in the system’s dynamics (e.g. deterministic, stochastic). Most importantly, it does not make any assumption about the underlying distribution or model that governs the data. Moreover, it can be used even for relatively small and nonstationary datasets. More specifically, our proposed method capitalizes on the efficiency of RQA to extract the underlying dynamics of a recorded sensor data stream by mapping the associated time series in a higherdimensional phase space of trajectories. A major advantage of RQA is its fully selftuned nature, in the sense that no prior parameter finetuning is required in a manual fashion.
To the best of our knowledge, GestureKeeper is the first automatic handgesture identification system based only on accelerometers. Furthermore, it can recognize accurately a dictionary of 12 handgestures by applying support vector machines (SVM) on a hybrid set of statistical and sampletype features. The evaluation was performed in a smallscale pilot study at FORTH. This paper demonstrates that GestureKeeper can recognize gestures from our 12handgesture dictionary with a mean accuracy of about 96%. The analysis also reveals the predictive power of the features and the system’s robustness in the presence of additive noise. We performed a sensitivity analysis to examine the impact of various parameters. Finally, to comparatively assess the performance of SVM, we also applied random forests. SVM still achieves a higher accuracy. The rest of the paper is organized as follows: Section II overviews our system, its architecture, and the two main subsystems, while Section III summarizes our main conclusions and future research directions.
Ii GestureKeeper System Design
GestureKeeper consists of a wearable sensor and a server. The wearable sensor sends periodically its collected measurements to the server, while the server performs the gesture identification and recognition. Our dictionary consists of 12 gestures. Their names, short descriptions, and trajectories in space are shown below.

Up: Vertical movement towards the ceiling

Down: Vertical movement towards the ground

Left: Horizontal movement to the left

Right: Horizontal movement to the right

CW: Clockwise rotational movement

CCW: Counter Clockwise rotational movement

Z: Z trajectory starting from above

AZ: Mirror Z trajectory starting from below

S: Wave trajectory to the right

AS: Wave trajectory to the left

Push: Horizontal movement away from the body

Pull: Horizontal movement towards the body
Experiments and Data Collection. We employed the shimmer3 [24], a wearable device equipped with 3axis accelerometer, gyroscope, and magnetometer. We first calibrated the sensors and then configured each of them according to the needs of our implementation. For example, due to the relatively low expected acceleration, we chose a relatively small detection range for the accelerometer increasing this way its resolution. The final sampling frequency was set to 50 Hz, which is considered sufficient (given our observations from a number of fast steep movement experiments). An additional increase of the frequency will only increase the power consumption of the device and the measurement size without, however, enhancing the amount of information about the user’s movement. Finally, the data streaming and logging applications are responsible to collect the data during the experiments.
For the recognition subsystem, we performed a small field study with 15 subjects (9 female, 6 male). Each subject repeated a number of gestures (from the predefined dictionary of gestures), collecting in total 900 different repetitions. For each repetition, 63 statistical features and 30 accelerationtype features, which represent values of acceleration in the x, y and z axes, were extracted. This dataset had the isolated periods during which a gesture was performed. For the identification subsystem, a new dataset was produced, containing gestures as well as activities of daily living (ADL). The new dataset has a total duration of 3 hours and 45 minutes, with measurements collected from 4 new subjects (2 female, 2 male).
Iia Gesture Identification
We first focus on the problem of identifying the start of a gesture in a recorded data stream. We speculate that different gestures are characterized by distinct dynamics of the associated time series. This motivated the use of the RQA, which enables the detection of transitions in the dynamical behavior (e.g. deterministic, chaotic, etc.) of the observed system. A major advantage of RQA is its fully selftuned nature, in the sense that no prior parameter finetuning is required in a manual fashion. More specifically, a recurrence plot (RP) is derived first, which depicts those times at which a state of a dynamical system recurs, thus revealing all the times when the phase space trajectory of the dynamical system visits roughly the same area in the phase space. To this end, RPs enable the investigation of an dimensional phase space trajectory through a twodimensional representation of its recurrences. Such recurrence of a state occurring at time , at a different time is represented within a twodimensional square matrix with ones (recurrence) and zeros (nonrecurrence), where both axes are time axes. Given a time series of length , , a phase space trajectory can be reconstructed via timedelay embedding,
(1) 
where is the embedding dimension, is the delay, and is the number of states. Having constructed a phase space representation, an RP is defined as follows,
(2) 
where , are the states, is a threshold, denotes a general norm, and is the Heaviside step function, whose discrete form is defined by
(3) 
Typically, several linear (and/or curvilinear) structures appear in RPs, which provide hints about the time evolution of the highdimensional phase space trajectories. Besides, a major advantage of RPs is that they can also be applied to rather short and even nonstationary data. The visual interpretation of RPs, which is often difficult and subjective, is enhanced by means of several numerical measures for the quantification of the structure and complexity of RPs [25]. These quantification measures provide a global picture of the underlying dynamical behavior during the entire period covered by the recorded data. The temporal evolution of RQA measures and the subsequent detection of transient dynamics are enabled for each recorded sensor stream by employing a windowed version of RQA. Doing so, the corresponding quantification measures are computed in small windows, which are then merged to form our feature matrix. Furthermore, it is noted that the length of the sliding window yields a compromise between resolving smallscale local fluctuations and detecting more global recurrence structures.
The gesture identification of the GestureKeeper employs two RQA metrics, namely, the recurrence rate (RR) and the transitivity (TRA) (see [26] for the definitions), obtained from the yaxis acceleration data, in order to form the feature matrix^{1}^{1}1We initially employed an extended RQA measures set but observed that the aforementioned two measures are sufficient to identify the gesture timewindows..
Estimation of Embedding Parameters. In our implementation, the optimal time delay is estimated as the first minimum of the average mutual information (AMI) function [27]. Concerning the embedding dimension , a minimal sufficient value is estimated using the method of false nearest neighbours (FNN) [28]. Furthermore, the Euclidean norm is used as our selected distance metric for the construction of the RP, while a ruleofthumb is currently used to set the threshold . The window length refers to the size of the signal in which RQA is performed each time before being shifted by a step size to the next values. We selected a window size of 125 samples, which represents approximately 2.5 seconds of information, sufficient for capturing even the longest of the dictionary’s gestures. The step was set equal to 25 samples (i.e., 80% overlap between consecutive windows). By applying the above criteria on our data, the estimated embedding parameters are equal to and . Although the empirical rule for selecting yields , a higher accuracy was observed for .
An SVM classifier based on these two features was then employed to classify the data in two classes, namely, gestures and ADL, thus distinguishing the gestures from the rest of the hand movements.
IiB Gesture Recognition
The gesture recognition is based on an SVM classifier using the radial kernel^{2}^{2}2Radial kernel was shown in our sensitivity analysis to have the best accuracy among the polynomial, linear, and sigmoid.. For the classification, we employed two types of features, namely the statistical features and samples of the acceleration signal
. The statistical features include the mean, median, root mean square (RMS), standard deviation, variance, skewness and kurtosis, of the 3D acceleration, angular velocity, and magnetism time series provided by the sensor. The sample based features are formed by a resampling process of the x, y and z axis acceleration
^{3}^{3}3The original acceleration signal consists of 3 time series, one for each axis, of unknown length (since it depends on the particular gesture and the user who performed the movement).. After resampling, the new signal is composed of a fixed number of samples for each acceleration time series. The final set of features includes the statistical ones (introduced earlier) along with this number of samples of the resampled acceleration signal for each of the x, y, and zaxis time series.IiC Performance Analysis
Gesture Identification: GestureKeeper first utilizes RQA in order to extract features for the identification subsystem, thereafter uses an SVM model, trained with the aforementioned features, for identifying correctly the windows that contain gestures. Note that the ADL and gesture classes are highly unbalanced: only 0.5% of the data belong to the class “gestures” and the 99.5% is ADL. To analyze the performance of the identification process, we trained the SVM classifier (with a polynomial kernel of degree = 3, coefficient = 2, cost = 3, and gamma = 0.95^{4}^{4}4We also examined various classifiers using different kernels, such as linear, sigmoid, radial, and parameters values. The reported results were obtained using the above values.) using all subjects except one, which was then used for testing. Given that the ADL and gesture classes are highly unbalanced, we randomly selected a subset of the ADL class of equal size as the gesture one. We performed the training and testing on this dataset. This process was repeated for 100 iterations. We reported the mean accuracy for each subject of the testing. The accuracy for each subject varies from 76.9% to 91.1% (with a mean accuracy of 87.21%).
RQA  SVM Identification  SVM Recognition  

Parameters  Value  Parameters  Value  Parameters  Value 
Disntance Metric  Eucl. Norm  Kernel  polyn.  Kernel  radial 
Window size  125  Gamma ()  0.95  Gamma ()  0.005 
Window step  25  Cost (c)  3  Cost (c)  1 
Delay ()  1  Degree  3  
Dimensions (m)  4  Coefficient  2  
Threshold ()  0.1 
Gesture Recognition: As mentioned earlier in Sec. IIB, an SVM model with different parameters is employed for the final classification of the gestures in one of the 12 in total classes of our dictionary. To assess the predictive power of the statistical features, we performed the following process: First, we permuted a feature, keeping the values of the remaining features of the dataset fixed. Then, the model is trained and tested with this dataset. The mean accuracy is based on 100 repetitions. We then repeated the same process, selecting each time to permute a different feature from the original dataset. The multiclass classification model employs the “oneagainstone” approach, in which binary classifiers are trained and the appropriate class is finally selected by a voting scheme. The mean accuracy for the different permuted features is shown in Fig. 2. The horizontal line indicates the mean accuracy for the original dataset without any feature permutation. The larger the decrease in the accuracy, the more significant the information that the corresponding feature provides in gesture recognition. It appears that the mean and skewness of the zaxis angular velocity are the most significant ones.
A similar procedure was also applied for the accelerationsamplebased features. We examined a different number of samples for each time series (in the range of 412). In the case of ten samples per time series, the fourth, fifth, sixth value of the resampled xaxis acceleration have a significant impact on accuracy, while the remaining features exhibit a similar predictive power for the gesture recognition (as shown in Fig. 3). However, unlike the statistical features, all the accelerationsamplebased ones have a substantial contribution on the accuracy (as the accuracy using the original dataset is greater from the one using the permuted ones). Note that the first 10 samples that refer to acceleration in x axis carry more information than the others. Specifically, the most important feature tends to be the middle value of the xaxis signal (sample 5). Fig. 4 shows the accuracy of the gesture classification, when only the sample features are used, as a function of the sample size. The impact on the accuracy is prominent for a sample size of 5 or less.
Our gesture recognition classifier employs in total 73 features, namely the 43 most significant statistical features and 10 samplesbased features from each acceleration time series. The above analysis was performed using the default SVM hyperparameters, namely, cost and number of features.
Experiments and SVM Tuning. We train the gesture recognition classifier using the data collected from the 14 out of 15 subjects for training and the other one’s for testing. We performed tests for all the 15 combinations of training and testing partitions.
The hyperparameters that we tuned are the cost (c) and gamma () values. The cost represents the weight for penalizing the ”soft margin”. Consequently, a large cost value penalizes the SVM for data points within the margin or on the wrong side of the dividing hyperplane. For this reason, for large , SVM will try to find more complex hyperplanes, and possibly smaller margins that leave less data points in the wrong side, but is also more prone to overfitting. In contrast, for low cost values, the margins of SVM will be larger and the hyperplane less complex, making it more robust but also less accurate. This motivates the need for carefully addressing the tradeoff of accuracy and robustness. The gamma parameter refers to the kernel and depends on the number of features that each specific implementation has. Intuitively, gamma ”controls” the number of SVM’s support vectors and by extension the sensitivity of the decision boundary (hyperplane), which will affect whether or not some data points near the margins will be ignored. A sensitivity analysis with different cost and gamma values reports the best performance for . Fig. 5 shows the accuracy of each subject’s data as testing set for the following datasets, namely all statistical features, only significant statistical ones, only samples, both significant statistical features and the samples (proposed dataset), with a mean accuracy of 94.44%, 94.44%, 92.88%, and 96.22%, respectively. The best performance is obtained when all the significant statistical features and samplesbased ones are employed. The presence of all the parameters that were used in this paper is considered essential for any further experimental evaluation. Therefore Table I shows the parameters for the RQA and the SVM model of the first and second subsystem, respectively.
To comparatively evaluate the recognition subsystem, we also developed classifiers based on random forests. We evaluated them using the (first) dataset (that was also used for the recognition subsystem based on SVM). After training and tuning^{5}^{5}5The sensitivity analysis for the random forests reported best accuracy for 100 trees in the forest and a maximum depth of the tree equal to 10. The minimum number of samples required to split an internal node and the minimum number of samples required to be at a leaf node were left with the default values of 2 and 1 respectively as they did not seem to effect the accuracy., random forests reported a best accuracy of 88% as opposed to 96% of the proposed method using SVM.
Finally, we assessed the impact of noise on the accuracy of the gesture recognition subsystem. The mean accuracy was increased to 97.89%, when each feature vector of the original (clear) training dataset was augmented by a copy of a corrupted one (produced by adding Gaussian noise with standard deviation of 0.5).
Iii Conclusion
This paper presents GestureKeeper which employs an accelerometer, gyroscope and magnetometer, from a wearable IMU, to first identify timewindows that contain a gesture, and then, recognize which specific gesture it is. GestureKeeper uses features based on statistical properties and acceleration samples. It can accurately recognize gestures from our 12handgesture dictionary, exhibiting its best performance when the combination of features are used (e.g., about 96% mean accuracy). With the noise addition and feature selection, the mean accuracy is increased to over 97%. It is modular and can be extended to recognize a larger gesture dictionary size. The pilot field study was performed in a relatively controlled smallscale environment. We plan to extend the evaluation under more realistic conditions. Moreover, it is critical to correctly center the gesture for each of the identified timewindows, as we observed a significant accuracy drop. We will explore the use of longshort term memory (LSTM) networks and conditional random fields in the gesture recognition to address these challenges.
References
 [1] F. Dominio, M. Donadeo, G. Marin, P. Zanuttigh, and G. M. Cortelazzo, ”Hand gesture recognition with depth data,” ACM/IEEE Intl. Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Stream, 2013.
 [2] V. Jaijongrak, S. Chantasuban, and S. Thiemjarus, ”Towards a BSNbased gesture interface for intelligent home applications,” ICCASSICE, 2009.
 [3] V. E. Kosmidou and L. J. Hadjileontiadis, ”Intrinsic mode entropy: An enhanced classification means for automated Greek Sign Language gesture recognition,” Annual Int’ Conference of the IEEE Engin. in Medicine & Biology Society, 2008.
 [4] H. Basanta, Y. Huang and T. Lee, ”Assistive design for elderly living ambient using voice and gesture recognition system,” IEEE Intl. Conf. on Systems, Man & Cybernetics, 2017.
 [5] O. A. B. Penatti and M. F. S. Santos, ”Human activity recognition from mobile inertial sensors using recurrence plots,” Published in ArXiv, 2017.

[6]
E. GarciaCeja, Md. Z. Uddin, and J. Torresen, ”Classification of recurrence plots’ distance matrices with a convolutional neural network for activity recognition,”
Procedia Computer Science, vol. 130, 2018.  [7] A. Bulling, U. Blanke, and B. Schiele, ”A tutorial on human activity recognition using bodyworn inertial sensors,” ACM Comput. Surv., vol. 46, no. 3, Art. 33, 2014.

[8]
S. Savvaki, G. Tsagkatakis, A. Panousopoulou and P. Tsakalides, ”Matrix and tensor completion on a human activity recognition framework,”
IEEE J. Biomedical and Health Informatics, 21(6), 2017.  [9] N. Twomey, T. Diethe, X. Fafoutis, A. Elsts, R. McConville, P. A. Flach, and I. Craddock, ”A comprehensive study of activity recognition using accelerometers,” Informatics 5, 2018.
 [10] S. Agrawal, I. Constandache, S. Gaonkar, R. R. Choudhury, K. Caves, and F. DeRuyter, ”Using mobile phones to write in air,” ACM Mobisys, 2011.
 [11] J. Ducloux, P. Colla, P. Petrashin, W. Lancioni, and L. Toledo, ”Accelerometerbased hand gesture recognition system for interaction in digital TV,” IEEE Intl. Instrumentation & Measurement Technology Conference, 2014.
 [12] Y. Ma et al., ”Hand gesture recognition with convolutional neural networks for the multimodal UAV control,” Workshop on Research, Education and Development of Unmanned Aerial Systems, 2017.
 [13] F. Alemuda and F. J. Lin, ”Gesturebased control in a smart home environment,” IEEE Intl. iThings & IEEE GreenCom & IEEE CPSCom & IEEE SmartData, 2017.

[14]
C. Zhu, Q. Cheng, and W. Sheng, ”Human intention recognition in smart assisted living systems using a hierarchical hidden Markov model,”
IEEE Intl. Conf. on Automation Science and Engineering, 2008.  [15] T. Tai, Y. Jhang, Z. Liao, K. Teng, and W. Hwang, ”Sensorbased continuous hand gesture recognition by long shortterm memory,” IEEE Sen. Lett., vol. 2, no. 3, 2018.
 [16] L. Xu, W. Yang, Y. Cao, and Q. Li, ”Human activity recognition based on random forests,” 13th Intl. Conf. on Natural Computation, Fuzzy Systems & Knowledge Discovery, 2017.
 [17] D. Zhang, X. Wu, and C. Wang, ”Finegrained and realtime gesture recognition by using IMU sensors,” IEEE 23rd Intl. Conf. on Parallel and Distributed Systems, 2017.
 [18] T. Do, R. Liu, C. Yuen, and U. Tan, ”Design of an infrastructureless indoor localization device using an IMU sensor,” IEEE Intl. Conf. on Robotics & Biomimetics, 2015.
 [19] H. Chu, S. Huang, and J. Liaw, ”An acceleration featurebased gesture recognition system,” IEEE Intl. Conf. on Systems, Man & Cybernetics, 2013.
 [20] H. Yun and B. Song, ”Dynamic characteristic analysis of users’ motions for human smartphone interface,” 8th Intl. Conf. on Computing & Networking Tech. (INC, ICCIS and ICMIC), 2012.
 [21] S. O. H. Madgwick, A. J. L. Harrison, and R. Vaidyanathan, ”Estimation of IMU & MARG orientation using a gradient descent algorithm,” IEEE Intl. Conf. on Rehabilitation Robotics, 2011.
 [22] C. Jagadish and B.C. Chang, ”Diversified redundancy in the measurement of Euler angles using accelerometers and magnetometers,” 46th IEEE Conf. on Decision and Control, 2007.
 [23] D. Lee, H. Yoon, and J. Kim, ”Continuous gesture recognition by using gesture spotting,” 16th ICCAS, 2016.
 [24] Shimmer products, shimmer3, (accessed October 29, 2018), http://www.shimmersensing.com/products/shimmer3imusensor.
 [25] J. Zbilut and C. Webber, ”Embeddings and delays as derived from quantification of recurrence plots,” Physics Letters A, vol. 171, no. 3–4, 1992.
 [26] N. Marwan, M. Romano, M. Thiel, and J. Kurths, “Recurrence plots for the analysis of complex systems,” Physics Reports, vol. 438, no. 56, 2007.
 [27] A. Fraser and H. Swinney, “Independent coordinates for strange attractors from mutual information,” Phys. Rev. A, vol. 33, 1986.
 [28] M. B. Kennel, R. Brown, and H. D. Abarbanel, “Determining embedding dimension for phasespace reconstruction using a geometrical construction,” Phys. Rev. A, vol. 45, 1992.