Log In Sign Up

Virtual Reality to Study the Gap Between Offline and Real-Time EMG-based Gesture Recognition

by   Ulysse Côté-Allard, et al.

Within sEMG-based gesture recognition, a chasm exists in the literature between offline accuracy and real-time usability of a classifier. This gap mainly stems from the four main dynamic factors in sEMG-based gesture recognition: gesture intensity, limb position, electrode shift and transient changes in the signal. These factors are hard to include within an offline dataset as each of them exponentially augment the number of segments to be recorded. On the other hand, online datasets are biased towards the sEMG-based algorithms providing feedback to the participants, limiting the usability of such datasets as benchmarks. This paper proposes a virtual reality (VR) environment and a real-time experimental protocol from which the four main dynamic factors can more easily be studied. During the online experiment, the gesture recognition feedback is provided through the leap motion camera, enabling the proposed dataset to be re-used to compare future sEMG-based algorithms. 20 able-bodied persons took part in this study, completing three to four sessions over a period spanning between 14 and 21 days. Finally, TADANN, a new transfer learning-based algorithm, is proposed for long term gesture classification and significantly (p<0.05) outperforms fine-tuning a network.


page 1

page 3

page 7


Real-time Egocentric Gesture Recognition on Mobile Head Mounted Displays

Mobile virtual reality (VR) head mounted displays (HMD) have become popu...

Real-Time Human-Computer Interaction Based on Face and Hand Gesture Recognition

At the present time, hand gestures recognition system could be used as a...

Learning Effect of Lay People in Gesture-Based Locomotion in Virtual Reality

Locomotion in Virtual Reality (VR) is an important part of VR applicatio...

Unsupervised Domain Adversarial Self-Calibration for Electromyographic-based Gesture Recognition

Surface electromyography (sEMG) provides an intuitive and non-invasive i...

Deep Learning for Electromyographic Hand Gesture Signal Classification by Leveraging Transfer Learning

In recent years, the use of deep learning algorithms has become increasi...

Adversary Helps: Gradient-based Device-Free Domain-Independent Gesture Recognition

Wireless signal-based gesture recognition has promoted the developments ...

Avatar-independent scripting for real-time gesture animation

When animation of a humanoid figure is to be generated at run-time, inst...

Code Repositories

I Introduction

Muscle activity as a control interface has been extensively applied to a wide range of domains from assistive robotics [1] to serious gaming for rehabilitation [2] and artistic performances [3]. This activity can be recorded non-invasively through surface electromyography (sEMG), a widely adopted technique both in research and clinical settings [1, 4]

. Intuitive interfaces can then be created by applying machine learning on the sEMG signal to perform gesture recognition 


Despite decades of research in the field [6] however, an important gap still exists between offline classifiers’ performance and real-time applications [5]. This disconnect mainly stems from the four main dynamic factors of sEMG signals [7]: Gesture intensity, limb position, electrode shift and the transient nature of EMG signal. Myoelectric signals are also time-consuming to obtain and must be recorded for each user, as extensive variability exists between subjects [8]. This last factor means that, in practice, sEMG datasets used as benchmarks for offline classification rarely contain even a single of these dynamic factors. On the other hand, online myoelectric control naturally provides feedback to the participant. In turn, this feedback biases the recorded online dataset towards the algorithm used for control, as the participants will learn to adapt its behavior to improve the system’s usability [9, 10, 11]. Consequently, obtaining a fair comparison of EMG-based gesture recognition algorithms is problematic. Thus recording a new online dataset is often needed to test a new algorithm fairly. Recording such a dataset however, is not only time-consuming, but can also require expensive hardware (e.g. prosthetic arm, robotic arms) [12]. A common alternative of using these costly equipment is through computer simulation (e.g. Fitts’ law test [13]

) running on a 2D computer screen. These type of simulation however, limits the number of degrees of freedom that can be intuitively controlled. In contrast, virtual reality (VR) offers an attractive and affordable environment for sEMG-based real-time 3D control simulations 

[14, 15, 16].

As such, this work’s main contribution is to have created a virtual reality environment from which an online dataset, featuring 20 participants, recorded specifically to contain the four main dynamic factors is made publicly available. An important innovation of this dataset is that the real-time, gesture recognition feedback is provided solely by a leap motion camera [17]. In other words, the proposed online dataset is not biased towards a particular sEMG-based gesture recognition algorithm and can thus be re-used as a benchmark to compare new algorithms. The VR environment in conjunction with the leap motion tracks the participant’s limb orientation in 3D, allowing for more precise understanding of the effect of limb position. The recording sessions, which where ”gamified” to better engage the participants, features between three to four recording sessions (equally distant) per participant, spawning a period of 14 to 21 days.

This work proposes an analysis of the effect of the four main dynamic factors on a deep learning classifier. The feature learning paradigm offered by deep learning allows the classifier to directly receive the raw sEMG data as input and achieve classification results comparable with the state of the art 

[11, 18], something considered ”impractical” before [4]

. This type of input can be viewed as a sequence of one dimensional images. While ConvNets have been developed to encode spatial information, recurrent neural network-based architectures (RNN) have been particularly successful in classifying sequences of information 

[19]. Hybrid architectures combining these two types of network are particularly well suited when working with sequences of spatial information[20, 21]. In particular, such hybrids have successfully applied to sEMG-based gesture recognition [22]. Compare to the hybrid ConvNet-RNN, Temporal Convolutional Networks (TCN) [23, 24] are a purely convolutional approach to the problem of sequence classification which are parallelizable, less complex to train and have low memory requirement. within the context of real-time sEMG-based gesture recognition, especially if applied to prosthetic control, these computational advantages are particularly important. Additionally, TCNs have been shown to outperform RNN-based architectures in a variety of domains and benchmarks using sequential data [25]. Consequently, this work proposes leveraging a TCN-based architecture to perform gesture recognition.

Another contribution of this work is a new transfer learning algorithm for long-term recalibration, named TADANN, combining the transfer learning algorithm presented in [26, 11] and the multi-domain learning algorithm presented in [27].

This paper is divided as follows. The VR experimental protocol and environment is first presented in Section II. Section III then presents the deep learning classifiers and transfer learning method used in this work. Finally, the results and the associated discussion are covered in Section V and VI respectively. A flowchart of the material, methods and experiments presented in this work is shown in Figure 1.

Fig. 1: Diagram of the workflow of this work. The two type of recording session from the Long Term 3DC Dataset are first preprocessed. Then, a Temporal Convolutional Network is used with different training schemes. The data from both the evaluation and training session are used in various comparisons/experiments based on the different learning scheme. In the diagram, the blue rectangles represent experiments, while the arrows show which methods/algorithms are required to perform them.

Ii Long-term sEMG Dataset

This work provides a new, publicly available (, multimodal dataset to study the four main dynamic factors in sEMG-based hand gesture recognition. The dataset, referred as the Long-term 3DC dataset, features 20 able-bodied participant (5F/15M) aged between 18 and 34 years old (average 26 4 years old) performing the eleven hand/wrist gestures depicted in Figure 2. For each participant, the experiment was recorded in virtual reality over three sessions spanning 14 days (see Section II-C for details). In addition to this minimum requirement, six of them completed a fourth session, so that the experiment spanned 21 days. Note that originally, 22 persons took part in this study, however, two of them (both male) had to drop out before completing three sessions, due to external circumstances. Consequently, the incomplete data of these two individuals are not included in the results and analysis of this work.

Fig. 2: The eleven hand/wrist gestures recorded in the Long-term 3DC dataset (image re-used from [28])

The data acquisition protocol was approved by the Comités d’Éthique de la Recherche avec des êtres humains de l’Université Laval (approbation number: 2017-026 A2-R2/26-06-2019). Informed consent was obtained from all participants.

Ii-a sEMG Recording Hardware

The electromyographic activity of each participant’s forearm was recorded with the 3DC Armband [28]; a wireless, 10-channel, dry-electrode, 3D printed sEMG armband. The device, which is shown in Figure 3, samples data at 1000 Hz per channel, allowing to take advantage of the full spectra of sEMG signals [29]. In addition to the sEMG acquisition interface, the armband also features a 9-axis Magnetic, Angular Rate, and Gravity (MARG) sensor cadenced at 50 Hz. The dataset features the data of both the sEMG and MARG sensors at 1000 and 50 Hz respectively for each session of every participant.

Fig. 3: The 3DC Armband used in this work records electromyographic and orientation (9-axis Magnetic, Angular rate and gravity sensor) data. The wireless, dry-electrode armband features 10 channels, each cadenced at 1 kHz.

Ii-B Stereo-Camera Recording Hardware

During the experiment, in addition to the 3DC Armband, the Leap Motion camera [17] mounted on a VR headset was also used for data recording. The Leap Motion ( is a consumer-grade sensor using infrared emitters and two infrared cameras [30] to track a subject’s forearm, wrist, hand and fingers in 3D. In addition to the software-generated representation of the hand, the Long-term 3DC dataset also contains the raw output of the stereo-camera recorded at 10 Hz.

Ii-C Experimental Protocol in Virtual Reality

Each recording session is divided in two parts: the Training Session and the Evaluation Session, both of which are conducted in VR. Figure 4 helps visualizes the general interface of the software while this video ( shows the experiment in action. Note that, for every training session, two evaluation sessions were also performed. All three sessions were recorded within a timespan of an hour.

Fig. 4: The VR environment during the evaluation session. The scenery (trees, horizon) helps orient the participants. The requested gesture is written on the participant’s head-up display and shown as an animation (the blue hand model). The ring indicates the desired hand’s position while its color (and the color of the blue hand) indicates the requested gesture’s intensity. The yellow hand represents the participant’s virtual prosthetic hand and changes color based on the intensity at which the participant is performing the gesture. The score augments if the participant is performing the correct gesture. Bonus points are given if the participant is performing the gesture at the right position and intensity. Note that the software’s screenshot only shows the right eye’s view and thus does not reflect the depth information seen by the participant.

Before any recording started, the 3DC Armband was placed on the dominant arm of the participant. The armband was slid up until its circumference matched that of the participant’s forearm. A picture was then taken to serve as reference for the armband placement. In subsequent sessions, the participant placed the armband on their forearm themselves, aided only with the reference picture. Hence, electrode displacement between sessions is expected.

Ii-C1 Training Session

The training session’s main purpose was to generate labeled data, while familiarizing the participants with the VR setup. To do so, the participants were asked to put on and adjust the VR headset to maximize comfort and minimize blurriness. The VR platform employed in this work is the Vive headset ( After a period of adjustment of a few minutes the recording started. All in all, the delay between a participant putting the armband on their forearm and the start of the recording was approximately five minutes on average.

The VR environment showed the participant the gesture to perform using an animation of a 3D arm performing the gesture. All gesture recordings were made with the participants standing up with their forearm parallel to the floor unsupported. Starting from the neutral gesture, they were instructed, with an auditory cure, to hold the depicted gesture for five seconds. The cue given to the participants were in the following form: ”Gesture X, 3, 2, 1, Go”. The data recording began just before the movement was started by the participant as to capture the ramp-up segment of the muscle activity and always started with the neutral gesture. The recording of the eleven gestures for five seconds each was referred to as a cycle. A total of four cycles (220s of data) were recorded with no interruption between cycles (unless requested by the participant). When recording the second cycle, the participants were asked to perform each gesture (except the neutral gesture) with maximum intensity. This second cycle serves as a baseline for the maximum intensity of each gesture on a given day, on top of providing labeled data. For the other three cycles, a ”normal” level of intensity was requested from the participants (43.43% of their perceived maximum intensity on average).

Ii-C2 Evaluation Session

The evaluation session main purpose was to generate data containing the four main dynamic factors within an online setting. The sessions took the form of a ”game”, where the participants were randomly requested to hold a gesture at a given intensity and position in 3D. Figure 4 provides an overview of the evaluation session.

The evaluation session always took place after a training session within the VR environment, without removing the armband between the two session. The participants were first asked to stand with their arm stretched forward to calibrate the user’s maximum reach. Then, the user was requested to bend their elbow 90 degrees, with their forearm parallel to the floor (this was the starting position). Once the participant is ready, the researcher starts the experiment which displays a countdown to the participant in the game. When the game starts, a random gesture is requested through text on the participant’s head-up display. Additionally, a floating ring appears at a random position within reach of the participant, with a maximum angle of 45 and 70 degrees in pitch and yaw respectively. The floating ring’s color (blue, yellow and red) tells the participant at what level of intensity to perform the requested gesture. Three levels of intensity were used: (1) less than 25%, (2) between 25 to 45% and (3) above 45% of the participant’s maximal intensity as determined from the participant’s first training session. A new gesture, position and intensity are randomly requested every five seconds with a total of 42 gestures asked during an evaluation session (210 seconds).

During the experiment and using the leap motion, a virtual prosthetic arm is mapped to the participant’s arm, which matches its position and pitch/yaw angles. The participant is thus able to intuitively know where their arm is in the VR environment and how to best reach the floating ring. However, the virtual prosthetic does not match the participant’s hand/wrist movements nor its forearm’s roll angle. Instead, the leap motion’s data is leveraged to predict the subject’s current gesture using a convNet (see Section III-A for details). The hand of the virtual prosthetic then moves to perform the predicted gesture (including supination/pronation with the roll angle) based on the data recorded during the training session, providing direct feedback to the participant. Note that the sEMG data has no influence on the gesture’s prediction as to not bias the dataset toward a particular EMG classification algorithm. The virtual prosthetic also changes color (blue, yellow, red) based on the currently detected gesture intensity from the armband. Finally, a score is shown to the participant in real-time during the experiment. The score augments when the detected gesture matches the requested gesture. Bonus points are given when the participant correctly matches the requested gesture’s intensity and is performing the gesture at the right position.

Ii-D Data Pre-processing

This work aims at studying the effect of the four main dynamic factors in myoelectric control systems. Consequently, the input latency is a critical factor to consider. As the optimal guidance latency was found to be between 150 and 250 ms [31], within this work, the data from each participant is segmented into 150 ms frames with an overlap of 100 ms. The raw data is then band-pass filtered between 20-495 Hz using a fourth-order butterworth filter.

Ii-E Experiments with the Long-term 3DC Dataset

The training sessions will be employed to compare the algorithms described in this work in an offline setting. When using the training sessions for comparison, the classifiers will be trained on the first and third cycle and tested on the fourth cycle. The second cycle, comprised of the maximal intensity gestures recording, is omitted as to only take into account electrode shift/non-stationarity of the signal and to allow an easier comparison with the literature.

The evaluation session is employed to study the impact of the four main dynamic factors on EMG-based gesture recognition. Classifiers will be trained on cycle 1, 3 and 4 of the training sessions and tested on the two evaluation sessions.

Ii-F 3DC Dataset

A second dataset, referred to as the 3DC Dataset

and featuring 22 able-bodied participants, is used for architecture building, hyperparameters selection and pre-training. This dataset, presented in 

[28], features the same eleven gestures and is also recorded with the 3DC Armband. Its recording protocol closely matches the training session description (Section II-C1), with the difference being that two such sessions were recorded for each participant (one single day recording). This dataset was preprocessed as described in Section II-D. Note that when recording the 3DC Dataset, participants were wearing both the Myo and 3DC Armband, however in this work, only the data from the 3DC Armband is employed.

Iii Deep Learning Classifiers

The following section presents the deep learning architectures employed in this work for the classification of both EMG data and images from the leap motion camera. The PyTorch 

[32] implementation of the networks are readily available here (

Iii-a Leap Motion Convolutional Network

For real-time myoelectric control, visual feedback helps the participant to produce more consistent and discriminative signals [33, 11]. Such feedback is also natural to have as the participant should, in most case, be able to see the effect of its control. To avoid biasing the proposed dataset toward a particular EMG-based classification algorithm, the gesture-feedback was provided using solely the leap motion.

Image classification is arguably the domain in which ConvNet-based architecture had the greatest impact due, in part, to the vast amount of labeled data available [19]. However, within this work and as to provide consistent feedback, training data was limited to the first training session of each participant. Consequently, the network had to be trained with a low amount of data (around 200 examples per gestures). Additionally, while the training session was recorded with a constant point of view of the participant’s hand, the evaluation session generated, by design, widely different point of view that the network had to contend with during inference.

The variable point-of-view problem was addressed using the capability of the leap motion camera to generate a 3D model in the virtual environment of the participant’s hand. Three virtual depth-cameras were then placed around the arm’s 3D representation from three different and fixed point-of-view to capture images of the 3D model (see Figure 5 (A) for an example). The three images were then merged together by having each image encoded within one channel of a three-channel color image (see Figure 5 (B) for examples). Finally, pixel intensity was inverted (so that a high value corresponds to a part of the hand being close to the camera) before being fed to the ConvNet. Note that one of the main reasons to uses images as input instead of 3D point clouds is to reduce the computational requirement during both training and inference.

Fig. 5:

(A) The depth images (darker pixel are closer) of the three virtual cameras taken at the same moment. The gesture captured is Wrist Flexion. Note that, regardless of the participant’s movement, the three cameras are always placed so that they have the same point-of-view in relation to the forearm. (B) Examples of images fed to the ConvNet. The represented gestures from left to right: Wrist Flexion, Open Hand, Radial Deviation.

To address the data sparsity problem, the transfer learning algorithm described in [26, 11] was employed using the data from the 3DC Dataset for pre-training.

The leap motion ConvNet’s architecture is based on EfficientNet-B0 [34] and presented in Table I.

Layer Type
Input Dimension
Height x Width
# Layers
Source Network
# Layers
Target Network
1 Conv3x3 33 1 1
2 ConvBlock3x3 16 2 1
3 ConvBlock5x5 24 2 1
4 ConvBlock3x3 32 2 1
5 ConvBlock5x5 48 2 1
6 ConvBlock5x5 64 2 1
Conv1x1 &
Pooling &
64 1 1

Each row describes a level of the ConvNet. The pooling layer is a global average pooling layer (giving one value per channel), while ”FC” refers to a fully connected layer.

TABLE I: Leap motion ConvNet’s architecture

Iii-B EMG-based Temporal Convolutional Network

TCNs generally differ in two aspects from standard ConvNets. First, TCNs leverage stacked layers of dilated convolutions to achieve large receptive fields with a few layers. Dilated convolutions (also known as convolution à trous or convolution with holes) is a convolutional layer where the kernel is applied over a longer range by skipping input values by a constant amount [23]. Typically, the dilatation coefficient () is defined as where is the th layer from the input (starting with i=0). The second difference is that TCNs are built with dilated causal convolutions where the causal part means that the output at time is convolved only with elements from outputs from time

or earlier. In practice, such a behavior is achieved (in the 1D case with PyTorch) by padding the left side (assuming time flows from left to right) of the vector to be convolved by

, where k is the kernel’s size. This also ensures a constant output size throughout the layers.

The proposed TCN, receives the sEMG data with shape Channel  Time (10 150). The architecture is based on [27, 25]. The PyTorch implementation is derived from [25].

The TCN’s architecture (see Figure 6), contains three blocks followed by a global average pooling layer before the output layer. Each block encapsulate a dilated causal convolutional layer [23]

followed by batch normalization 


, leaky ReLU 

[36] and dropout [37].

Fig. 6: The ConvNet’s architecture employing 104 788 learnable parameters. In this figure, refers to the ith block (). Conv refers to a convolutional layer while Chomp removes the padding after the convolution.

Adam [38]

is employed for the TCN’s optimization with an initial learning rate of 0.0404709 and batch size of 512. 10% of the training data is held out as a validation set which is used for early stopping (with a ten epochs threshold) and learning rate annealing (factor of five and a patience of five). Note that all architecture choices and hyperparameters selection were performed using the

3DC Dataset or previous works.

Iii-C Calibration Training Methods

This work considers three calibration methods for long-term classification of sEMG signals: No Calibration, Re-Calibration and Delayed Calibration. In the first case, the network is trained solely from the data of the first session. In the Re-Calibration case, the model is re-trained at each new session with the new labeled data. To leverage previous data, fine-tuning [39] is applied. That is, during re-calibration, the weights of the network are first initialized with the weights found from the previous session. Note that the proposed transfer learning (Section IV) will also use the Re-Calibration setting. Delayed Calibration is similar to Re-Calibration, but the network is re-calibrated on the previous session instead of the newest one. The purpose of Delayed Calibration is to see how the classifier’s degradation evolves when there is a similar amount of days since each previous calibration.

Iv Transfer Learning

Over multiple re-calibration sessions, large amount of labeled data is recorded. However, standard training methods are limited to the data from the most recent session as they cannot take into account the signal drift between each recording. Transfer learning algorithms on the other hand can be developed to account for such signal disparity. Consequently, this work proposes to combine the Adaptive Domain Adversarial Neural Network (ADANN) training presented in [27] and the transfer learning algorithm presented in [11] for inter-session gesture recognition. This new algorithm is referred to as Transferable Adaptive Domain Adversarial Neural Network (TADANN). For simplicity’s sake, the ensemble of calibration sessions prior to the most recent one are referred to as the pre-calibration sessions whereas the most recent one is referred to as the calibration session.

The proposed algorithm contains a pre-training and a training step. During pre-training, each session within the pre-calibration sessions is considered as a separate labeled domain dataset. At each epoch, pre-training is performed by sharing the weights of a network across all the domains (i.e. pre-calibration sessions), while the Batch-Normalization (BN) statistics are learn independently for each session [27]

. The idea behind ADANN is then to extract a general feature representation from this multi-domain setting. To do so, a domain classification head (with two neurons) is added to the network. At each epoch, a batch is created containing examples from a single, randomly selected, session at a time (referred to as the

source batch). A second batch (the target batch) is then created from a, also randomly, selected session (different than the one used to create the source batch). The examples from the source batch are assigned the domain-label 0, while the domain-label 1 is assigned to the examples from the target batch. Then, a gradient reversal layer [40]

is used right after the domain-head during backpropagation to force the network to learn a session-independent feature representation. Note that the BN statistics used by the network correspond to the session from which the source or target batch originate, but that they are updated only with the source batch. Similarly, the classification head is used to back-propagate the loss only with the source batch.

After pre-training is completed, the learned weights are frozen, except for the BN parameters which allow the network to adapt to a new session. Then, a second network is initialized (in this work, the second network is identical to the pre-trained network) and connected with an element-wise summation operation in a layer-by-layer fashion to the pre-trained network (see [11] for details). Additionally, all outputs from the pre-trained network are multiplied by a learnable coefficient (clamped between 0 and 2) before the summation as to provide an easy mechanism to neuter or increase the influence of the pre-trained network at a layer-wise level.

V Results

V-a Training Sessions: Over-time classification accuracy

Figure 7 shows the average accuracy over-time across all participants for the three calibration methods and with TADANN (which uses the Re-Calibration method).

Fig. 7: Average accuracy over-time calculated on the last cycle of the training sessions. The values given on the x-axis represent the average time (in days) elapsed between the current session and the first session across all participants.

Based on Cohen’s d, the effect size of using Re-Calibration vs No-Calibration varies between large to very large [41] (0.95 and 1.32 for session two and three respectively). Overall, TADANN was the best performing method, achieving an average accuracy of 84.44%19.15% and 89.04%6.49% compared to 79.96%18.40% and 80.49%21.58% for session two and three respectively. Using the Wilcoxon signed rank test [42] shows that TADANN significantly outperforms Re-Calibration (adjusted p-value and for session two and three respectively). The effect size was small (0.40) and medium (0.54) using Cohen’s d on session two and three respectively. Note that statistical tests were not performed for session four due to the sample size (n=6).

V-B Evaluation Session

Figure 8 shows the scores obtained for all participants on the evaluation sessions in respect to TADANN’s accuracy from the corresponding session. The Pearson r correlation coefficient between the score and accuracy is . The average scored obtained during the first recording session was 56341521 which increased to 66151661 on session three, showing that the participants improved.

Fig. 8:

Score obtained by each participant at each evaluation session in respect to TADANN’s accuracy on the evaluation sessions. The translucent bar around the regression represents the standard deviation.

V-B1 Over-time classification accuracy

Figure 9 shows the average accuracy over-time on the evaluation sessions across all participants for the three calibration methods and with TADANN (which uses the Re-Calibration method). Re-Calibration again outperforms No-Calibration and the effect was small (0.31 and 0.45) according to cohen’s d for session two and three respectively. TADANN again significantly outperformed the Re-Calibration (adjusted p-value and ) and the effect size was 0.11 and 0.19 for session two and three respectively.

Fig. 9: Average accuracy over-time calculated on the evaluation sessions. The values given on the x-axis represent the average time (in days) elapsed between the current session and the first session across all participants.

V-B2 limb orientation

The impact of limb’s position on the Re-Calibrated ConvNet’s accuracy is shown in Figure 10. Accuracies were computed on the online dataset across all sessions and all participants. The first 1.5s after a new gesture was requested were removed from the data used to generate Figure 10, as to reduce the impact of gesture’s transition.

Fig. 10: Accuracy in respect to the pitch and yaw angles. The dotted line indicates the neutral orientation. Note that, a minimum threshold of 600 examples per pitch/yaw combination was set to show the accuracy.

V-B3 Gesture intensity

Figure 11 shows the impact of gesture’s intensity on the Re-Calibration classifier’s accuracy. Accuracies were computed on the online dataset across all sessions and all participants (excluding the neutral gesture). The first 1.5s after a new gesture was requested were again removed from the data used to generate Figure 11.

Fig. 11: Average accuracy obtained from the re-calibrated ConvNet in respect to the percentage of the maximum activation when performing the gestures over all evaluation sessions across all participants. Note that the data from the gesture transition period where ignored when computing the accuracy (by removing one second of data whenever a new gesture was requested).

Vi Discussion

This paper leverages the leap motion for gesture recognition, to avoid biasing the real-time dataset toward a particular sEMG-based algorithm. Figure 8 shows that the score obtained from a session correlates with the accuracy obtained from the same session. Note that the three lowest scores come from sessions where the leap motion lost tracking of the hand particularly often. Comparing the Delayed Calibration with the No Calibration from Figure 7 and 9 shows that participant where able to learn to produce more consistent gestures across sessions (from a sEMG-based classifier perspective). Thus, feedback provided by the leap motion seems to act as a good proxy, while also removing the bias normally present in online datasets. Qualitatively, the participants enjoyed the experiment gamification as almost all of them were trying to beat their own high-score and to claim to the top of the leader-board. Additionally, several participants requested to do ”one more try” to try to achieve a high-score (only allowed after their last session). As such, virtual reality can provide an entertaining environment from which to perform complex 3D tasks [14, 15, 16] at an affordable cost when compared to using robotic arms or myoelectric prosthesis.

Inter-day classification was shown to have a significant impact both offline and online. With standard classification algorithms, the need for periodic re-calibration is thus apparent. The proposed TADANN algorithm was shown to consistently achieve higher accuracy than simple fine-tuning re-calibration. In this particular dataset, on a per-subject basis, TADANN routinely outperformed fine-tuning by more than 5%, whereas for the opposite 1% or less was the most common. The difference between the two also grew as TADANN could pre-train on more sessions. Thus future work will consider even more sessions per participant to evaluate TADANN.

Figure 10 shows that gestures which were performed while the participant’s arm was externally rotated were the hardest in general for the classifier to correctly predict. This is likely due to the fact that the origin of the brachioradialis muscle (which is under the area of recording) is the lateral supracondylar ridge of the humerus. It is possible, therefore, that as the humerus becomes more externally rotated that it changes the geometry of the brachioradialis, affecting the observed signals. In addition, the arm may tend to supinate slightly for higher levels of external humeral rotation, which is known to create worse limb position effect than the overall arm position. In contrast, when the participant’s arm was internally rotated, no such drastic drop in performance was noted. As shown in [43], training a classifier by including multiple limb-positions can improve inter-position performances. Consequently, it might be beneficial for future studies to focus on including externally rotated forearm positions within the training dataset. Note however, that while the participants were instructed to limit as most as possible any torso rotation, they were not restrained and consequently such rotation are likely present within the dataset. This might explain the decreaseincrease

decrease in accuracy observed for the external rotation. Participants accepted an external rotation up to when they felt uncomfortable and then rotated their torso. This also explains the lower number of examples with an external yaw and a downward pitch as such combinations tend to be uncomfortable (the software considered all angle combination with equal probability).

The impact of gesture’s intensity obtained within this study corroborate past findings in the literature [43]. The classifier is relatively unaffected by different levels of gesture intensity between 17 and 50%. Additionally, at lower intensity, the main error factor comes from classifying the neutral gesture. However it has been shown that rejection-based classifiers can improve classifier’s usability [44]. The problematic intensities are thus all above 50% of maximal gesture intensity.

The main limitation of this study is the relatively important gap between sessions. While such a scenario is realistic (e.g. for consumer grade armband used to play video games or make a presentation) it does not allow to smoothly see the change in signals within day. As such, future works will expend upon the current dataset to include more frequent evaluation sessions for each participant (and multiple within the same day).

Vii Conclusion

This paper presented a new VR experimental protocol for sEMG-based gesture recognition leveraging the leap motion camera as to not bias the online dataset. Quantitatively and qualitatively, the participants were shown to improve over time and were motivated in taking part in the experiment. Overall, TADANN was shown to significantly outperform fine-tuning. The VR environment in conjunction with the leap motion allowed to quantify the impact of limb position with, to the best of the authors knowledge, the highest resolution yet.

Future work will use self-calibrating algorithms based on domain adversarial training [45] to hopefully reduce the impact of transient change in sEMG signals.


The authors would like to thank Alexandre Campeau-Lecours for his support without which this manuscript would not have been possible.


This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) [funding reference numbers 401220434, 376091307, 114090], the Institut de recherche Robert-Sauvé en sante et en sécurité du travail (IRSST). Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en genie du Canada (CRSNG) [numéros de référence 401220434, 376091307, 114090].


  • [1] M. Hakonen, H. Piitulainen, and A. Visala, “Current state of digital signal processing in myoelectric interfaces and related applications,” Biomedical Signal Processing and Control, vol. 18, pp. 334–359, 2015.
  • [2] X. Shusong and Z. Xia, “Emg-driven computer game for post-stroke rehabilitation,” in 2010 IEEE Conference on Robotics, Automation and Mechatronics.   IEEE, 2010, pp. 32–36.
  • [3] D. St-Onge, U. Côté-Allard, K. Glette, B. Gosselin, and G. Beltrame, “Engaging with robotic swarms: Commands from expressive motion,” ACM Transactions on Human-Robot Interaction (THRI), vol. 8, no. 2, p. 11, 2019.
  • [4] M. A. Oskoei and H. Hu, “Myoelectric control systems—a survey,” Biomedical signal processing and control, vol. 2, no. 4, pp. 275–294, 2007.
  • [5]

    A. Phinyomark, F. Quaine, S. Charbonnier, C. Serviere, F. Tarpin-Bernard, and Y. Laurillau, “Emg feature evaluation for improving myoelectric pattern recognition robustness,”

    Expert Systems with applications, vol. 40, no. 12, pp. 4832–4840, 2013.
  • [6] P. Parker, K. Englehart, and B. Hudgins, “Myoelectric signal processing for control of powered limb prostheses,” Journal of electromyography and kinesiology, vol. 16, no. 6, pp. 541–548, 2006.
  • [7] E. Scheme and K. Englehart, “Electromyogram pattern recognition for control of powered upper-limb prostheses: state of the art and challenges for clinical use.” Journal of Rehabilitation Research & Development, vol. 48, no. 6, 2011.
  • [8] C. Castellini, A. E. Fiorilla, and G. Sandini, “Multi-subject/daily-life activity emg-based control of mechanical hands,” Journal of neuroengineering and rehabilitation, vol. 6, no. 1, p. 41, 2009.
  • [9] S. M. Radhakrishnan, S. N. Baker, and A. Jackson, “Learning a novel myoelectric-controlled interface task,” Journal of neurophysiology, vol. 100, no. 4, pp. 2397–2408, 2008.
  • [10] M. A. Powell, R. R. Kaliki, and N. V. Thakor, “User training for pattern recognition-based myoelectric prostheses: Improving phantom limb movement consistency and distinguishability,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 22, no. 3, pp. 522–532, 2013.
  • [11] U. Côté-Allard, C. L. Fall, A. Drouin, A. Campeau-Lecours, C. Gosselin, K. Glette, F. Laviolette, and B. Gosselin, “Deep learning for electromyographic hand gesture signal classification using transfer learning,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 4, pp. 760–771, 2019.
  • [12]

    U. C. Allard, F. Nougarou, C. L. Fall, P. Giguère, C. Gosselin, F. Laviolette, and B. Gosselin, “A convolutional neural network for robotic arm guidance using semg based frequency-features,” in

    2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2016, pp. 2464–2470.
  • [13] A. Ameri, M. A. Akhaee, E. Scheme, and K. Englehart, “Real-time, simultaneous myoelectric control using a convolutional neural network,” PloS one, vol. 13, no. 9, p. e0203835, 2018.
  • [14] J. L. Pons, R. Ceres, E. Rocon, S. Levin, I. Markovitz, B. Saro, D. Reynaerts, W. Van Moorleghem, and L. Bueno, “Virtual reality training and emg control of the manus hand prosthesis,” Robotica, vol. 23, no. 3, pp. 311–317, 2005.
  • [15] A. Al-Jumaily and R. A. Olivares, “Electromyogram (emg) driven system based virtual reality for prosthetic and rehabilitation devices,” in Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services.   ACM, 2009, pp. 582–586.
  • [16] D. Blana, T. Kyriacou, J. M. Lambrecht, and E. K. Chadwick, “Feasibility of using combined emg and kinematic signals for prosthesis control: A simulation study using a virtual reality environment,” Journal of Electromyography and Kinesiology, vol. 29, pp. 21–27, 2016.
  • [17] D. Holz, K. Hay, and M. Buckwald, “Electronic sensor,” Apr. 14 2015, uS Patent App. 29/428,763.
  • [18] M. Zia ur Rehman, A. Waris, S. Gilani, M. Jochumsen, I. Niazi, M. Jamil, D. Farina, and E. Kamavuako, “Multiday emg-based classification of hand motions with deep learning techniques,” Sensors, vol. 18, no. 8, p. 2497, 2018.
  • [19] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [20] Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emotion recognition using cnn-rnn and c3d hybrid networks,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction.   ACM, 2016, pp. 445–450.
  • [21] Y. Guo, Y. Liu, E. M. Bakker, Y. Guo, and M. S. Lew, “Cnn-rnn: a large-scale hierarchical image classification framework,” Multimedia Tools and Applications, vol. 77, no. 8, pp. 10 251–10 271, 2018.
  • [22] Y. Hu, Y. Wong, W. Wei, Y. Du, M. Kankanhalli, and W. Geng, “A novel attention-based hybrid cnn-rnn architecture for semg-based gesture recognition,” PloS one, vol. 13, no. 10, p. e0206049, 2018.
  • [23] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
  • [24] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016.
  • [25] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
  • [26] U. Cote-Allard, C. L. Fall, A. Campeau-Lecours, C. Gosselin, F. Laviolette, and B. Gosselin, “Transfer learning for semg hand gestures recognition using convolutional neural networks,” in 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC).   IEEE, 2017, pp. 1663–1668.
  • [27] U. Côté-Allard, E. Campbell, A. Phinyomark, F. Laviolette, B. Gosselin, and E. Scheme, “Interpreting deep learning features for myoelectric control: A comparison with handcrafted features,” arXiv preprint arXiv:1912.00283, 2019.
  • [28] U. Côté-Allard, G. Gagnon-Turcotte, F. Laviolette, and B. Gosselin, “A low-cost, wireless, 3-d-printed custom armband for semg hand gesture recognition,” Sensors, vol. 19, no. 12, p. 2811, 2019.
  • [29]

    A. Phinyomark and E. Scheme, “A feature extraction issue for myoelectric control based on wearable emg sensors,” in

    2018 IEEE Sensors Applications Symposium (SAS).   IEEE, 2018, pp. 1–6.
  • [30] F. Weichert, D. Bachmann, B. Rudak, and D. Fisseler, “Analysis of the accuracy and robustness of the leap motion controller,” Sensors, vol. 13, no. 5, pp. 6380–6393, 2013.
  • [31] L. H. Smith, L. J. Hargrove, B. A. Lock, and T. A. Kuiken, “Determining the optimal window length for pattern recognition-based myoelectric control: balancing the competing effects of classification error and controller delay,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 19, no. 2, pp. 186–192, 2010.
  • [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W, 2017.
  • [33] T. Pistohl, D. Joshi, G. Ganesh, A. Jackson, and K. Nazarpour, “Artificial proprioceptive feedback for myoelectric control,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 23, no. 3, pp. 498–507, 2014.
  • [34] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” arXiv preprint arXiv:1905.11946, 2019.
  • [35] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [36] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.
  • [37] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning, 2016, pp. 1050–1059.
  • [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [39] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in neural information processing systems, 2014, pp. 3320–3328.
  • [40] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
  • [41] S. S. Sawilowsky, “New effect size rules of thumb,” Journal of Modern Applied Statistical Methods, vol. 8, no. 2, p. 26, 2009.
  • [42] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine learning research, vol. 7, no. Jan, pp. 1–30, 2006.
  • [43] E. Scheme, A. Fougner, Ø. Stavdahl, A. D. Chan, and K. Englehart, “Examining the adverse effects of limb position on pattern recognition based myoelectric control,” in 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.   IEEE, 2010, pp. 6337–6340.
  • [44] E. J. Scheme, B. S. Hudgins, and K. B. Englehart, “Confidence-based rejection for improved pattern recognition myoelectric control,” IEEE Transactions on Biomedical Engineering, vol. 60, no. 6, pp. 1563–1570, 2013.
  • [45] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand, “Domain-adversarial neural networks,” arXiv preprint arXiv:1412.4446, 2014.