Introduction
Reliable decoding of intended movements using sEMG signals is crucial for the restoration of movements of disabled persons. Henceforth, development of a reliable and robust MMI decoder remains one of the primary goals of the BMI-community. Real life applications, such as control of prosthesis, rehabilitation-aiding exoskeletons, and MMI for gaming are active research areas. Recently published studies on movement classification based on sEMG showed promising results with high accuracy and real-time processing capabilities [8512820, 8911244, faust2018deep]. However, the presence of intra- and interpersonal variability in the sEMG signals can still complicate the calibration of a model to a new user.
This paper investigates the effects of interpersonal variability on the classification accuracy of deep learning based sEMG classifiers.
In previously published studies, deep Neural Networks showed promising classification accuracy of sEMG signals at inference time, and outperformed conventional statistical and machine learning models, especially on more complex classification tasks with less manual feature engineering
[faust2018deep, tao2019multi, xiong2021deep]. Even though the results of recent studies settings seem to be very promising in lab settings, real life applications are much more complicated because the adaptation to unseen users is quite challenging.The sEMG signals exhibit high variance between subjects
[araujo2000inter], which is increased between healthy and amputated subjects [campbell2020comparison, campbell2019differences]. Other factors such as electrode placement [hogrel1998variability], fatigue [linssen1993variability], heat [herrera2020temperature], skin fat ratio [nordander2003influence] and gender [meduri2016inter] also influence the inter- and intra-subject variance of sEMG measurements. This high inter-subject variability challenges the machine learning models to adapt high generalization quality on unseen subjects.The effect of this variability has been studied to some extent for other aspects of the signal processing pipeline (e.g. normalization [burden2010should]
) or the classification algorithms (e.g. Random Forests
[palermo2017repeatability], SVM [kanoga2021semi, 6977190] or Muscle Source Activation Models [kim2020subject]).While simple models can relatively easily be re-trained on each new user, this is more challenging for larger neural networks. This is due to their huge parameter space, which result in comparatively long training times and a high-risk of overfitting on small samples sizes. Therefore, implications of deep learning in a rehabilitation framework is limited, where short calibration using only a few repetitions of each patient are desirable. So far, however, there is a lack in research on the effects of interpersonal differences on the performance of neural network based sEMG classifiers.
Some recent works addressed the problem using few shot learning [rahimian2021fs], adaptation layers with RNNs [ketyko2019domain]
and adaptive batch normalization on CNNs
[lin2020normalisation, cote2017transfer, du2017surface]In this paper, we thoroughly evaluated the effectiveness of basic weight-initialization based transfer learning for user adaptation and compared the performance and training times of user-specific deep learning models. To the best of our knowledge, no one has previously investigated the weight-initialization based transfer learning for EMG based decoder
Possible benefits of weight-initialization based transfer learning are the straight-forward implementation and higher flexibility during model building, because fine-tuning on new user data can in principle be done with any pretrained model. This paper thoroughly investigates how well this basic form of transfer-learning is suited for the task of sEMG classification and how much knowledge transfer takes place. Most papers so far only compared performance of classifiers with and without transfer learning in fixed settings. In this paper, we compare subject-specific, pretrained and fine tuned models under varying conditions.We aim to gain insights into the question under which circumstances transfer learning based user-calibration of sEMG classifiers performs well.
Method
Transfer Learning
Generally, transfer learning denotes machine learning applications where knowledge learned in one situation is transferred to a different but related situation. Domain adaptation is a more specific term, referring to the case where the task (input-output-mapping) stays the same, but the input or output data distribution is slightly different at the training and inference time, usually necessitating a slight change of the model parameters. In this case, applying a pretrained deep-learning model for sEMG classification to a new user would be an example of domain adaptation.
There are several approaches for transferring knowledge to a new domain. In this paper, we utilize the commonly used approach of weight-initialization and subsequent fine-tuning. For calibrating our deep learning models to a new user, we transfer the weights from a pretrained model instead of random initialization and subsequently fine-tune it on data from the new user.
The effect of transfer learning can be seen by comparing the model performance during training with and without transfer learning. Effective knowledge transfer can be noticed in either or a higher starting performance, a higher slope (faster learning) and an higher asymptotic performance.
We compare deep learning models for sEMG classification trained and evaluated on only one subject (subject-specific models) with transfer learning models, where we pretrain the model on other users data and further fine tune the weights on the subject used for evaluation. If transfer learning takes places we expect similar curves to figure 1.
Data
For the experiments described in this paper, we utilized several publicly available databases published by the Ninapro (Non Invasive Adaptive Prosthetics) Project [atzori2014electromyography].
The aim of the project was the creation and dissemination of benchmark scientific databases of surface electromyography data for hand prostheses. The databases contain recordings from similar experiments (same number of repetitions, movements etc.), from different subjects using different recording equipment.
In this paper, we perform the experiments on Database 2, 3 and 4. We focus on these three because the other Ninapro databases cover more specific hardware settings and experimental situations, making them less easz to compare. The unique differences of the databases used are described in table 1
For each database, participants were asked to repeat a number of movements and force patterns. A total of 61 movement patterns (+ Rest) were used for the project and split into 4 exercise groups (see figure 2). Not every database covers every exercise group. We restrict our experiments to exercise group A and B, which cover basic movement of fingers and the wrist.
The data acquisition procedure stays as similar as possible for every database (see figure 2
). The participants were asked to repeat previously known movements shown via video. Each repetition took 5 seconds with a 3 second resting period in between. The number of repetitions might differ between each database. For intact subjects, the right arm was used for the exercises, while amputated subjects were asked to imagine the shown movements as naturally as possible with their missing limb. Different kinds of electrodes were placed to record sEMG signals. Data was recorded using upto 12 electrodes, simultaneously. These electrodes were attached to the triceps brachii, biceps brachii, extensor digitorum superficialis, flexor digitorum superficialis. In addition, 8 electrodes were equally spaced around the forearm at the height of the radio-humeral joint. For intact subjects hand kinematics were additionally measured with a CyberGlove System. Depending on the used sEMG-capturing system, a noise filter might be applied. All databases are published with synchronized data, super-sampled to either 2 kHz or 100 Hz using linear or nearest-neighbor interpolation.


DB2 | DB3 | DB4 | |
Citation | [atzori2014electromyography, gijsberts2014measuring] | [atzori2014electromyography, atzori2016clinical] | [atzori2014electromyography, pizzolato2017comparison] |
Subjects | 28 male/12 female | 11 male | 6 male/ 4 female |
Amputation? | no | yes | no |
Age | 29.9 ± 3.9 | 42.36 ± 11.96 | 29.6 ± 9.24 |
Exercise used | B (17 classes) | B (17 classes) | A (12 classes) |
Repetitions | 6 | 6 | 6 |
sEMG system | Delsys Trigno | Delsys Trigno | Cometa/Dormo |
Experiment
Models and Hyperparameters
Data Processing
The raw data is preprocessed before the classification. Following steps are involved for preprocessing. At first, we standardize the sEMG signal and denoise it using a 4th-order butterworth bandpass filter (between 20Hz and 400Hz). Then the resulting signal is rectified and sliced using a 200 ms sliding window with 10 ms overlap. Lastly, the order of the slices is randomly shuffled, which was observed to improve generalisation in pre-tests.
Models
We chose two simple deep learning models for sEMG classification.
The first model is a multilayer perceptron (MLP), which classifies the sEMG signal using extracted features. We choose a set of 18 commonly used sEMG features (as e.g. in
[pereira2022automatic]), 11 time domain and 6 frequency domain features and one wavelet based feature:
The time domain features are Variance, Root Mean Squared, Integral, Mean Absolute Value, Log Detector, Waveform Length, Average Amplitude Change, Difference Absolute Standard Deviation Value, Zero Crossings, Willison Amplitude and Myopulse Percentage Rate. The frequency domain features are Frequency Ratio, Mean Frequency, Median Frequency, Peak Frequency, Mean Power, Total Power and the wavelet based feature is the Wavelet Histogram.
The second model is a 1-dimensional convolutional neural network (1D-CNN), which considers the preprocessed signal (2x400 (12 channels x 200ms@2kHz) ) as an input, learns the feature vectors, hierarchically and finally classify it.
The motivation behind this was to enable a comparison between a feature extraction and a feature engineering approach and to see how much inter-subject variance translate to commonly used feature sets. We wanted to investigate how stable learned features are in the face of inter-subject variability and fine-tuning. We chose a one-dimensional CNN over the more common two-dimensional CNN, because we can make the results more comparable by restricting it to channel-based features.
This work focuses on relatively simple models compared to the existing literature (e.g. ([pancholi2021robust, ding2018semg, 8641445, 8969418] ). This is a deliberate choice to keep small inference time and low computational cost. This allows us to run more experiments and also keep the option to run our models more easily on embedded hardware in the future.
Hyperparameter Tuning
We used the first 10 subjects from DB2 to find the appropriate architecture of the employed deep learning models and tuning the parameters of the optimizer.These subjects were only used for this step and excluded from later experiments.
Following parameters are tuned: Number of layers, number of neurons of each layer, dropout factor, number of filters, filtersize, presence of batch normalization, batch size as well as the four parameters of the adam-optimizer.
Hyperparameter tuning was implemented using Keras [chollet2015keras] and Optuna [optuna_2019].
The final learning parameters can be seen in table 2 and the network architecture of both models is presented in figure 3.

Input | Learningrate | Beta1 | Beta2 | Epsilon | Batchsize |
Features | 0.001 | 0.4 | 0.1 | 0.38 | 256 |
Raw | 0.0002 | 0.2 | 0.9 | 0.0001 | 512 |
Approach
The aim of this paper is to evaluate the effectiveness of the transfer learning approach for calibrating a deep learning model for a new user. To answer this question, we performed several analysis on three databases, individually.
For the evaluation, we apply leave-one-subject-out cross-validation. We go through each of the N subjects in a database and individually train an artificial neural network on the remaining N-1 subjects. The task of the model is the classification of the first exercise group of each database, meaning 17 (DB2 and DB3) and 12 (DB4) classes.
The initial training is done on 5 out of 6 repetitions (repetition 3 is used for testing), with 10 percent of the training data used for validating the early stopping criteria. During training, we use a fixed set of hyperparameters, described in table 2.
During retraining the classifier on the new subjects data, a variable amount of repetitions are kept unseen for evaluation. The hyperparameters stay the same as during the initial training.
We compared the performance of pretrained models, before and after fine-tuning on the new subject with subject-specific models, which are only trained on data from one subject.
Evaluation
Db2
We report the evaluation performance of two models by using transfer learning and without using transfer learning. In the case with transfer learning, we fine-tuned a pretrained model on every other subject in the dataset as described above on data from the new subject. As comparison, we also train a model with random weight initialization only on this new subjects data (subject-specific model). If transfer learning is effective for the task of sEMG classification, these subject-specific models should show a noticeable difference in their learning dynamics, compared to the models applying knowledge transfer.
As shown in figure 1, signs of successful knowledge transfer are higher starting accuracy, higher slope/faster learning and/or higher asymptotic performance. We plotted the performance of subject-specific and fine-tuned models on the validation set in Figure 4.
Transferring weights from a pretrained model has a huge influence on the training dynamic for both the models, the one using features and the one using raw sEMG data as input,as shown in the figure 4. The validation accuracy starts at a higher point and the asymptotic accuracy is reached after a fewer number of epochs. The condition of early stopping criteria fulfilled earlier, in the case of pretrained weights. The pretrained models also reached a slightly higher final validation accuracy than the subject-specific models.

While knowledge transfer seems to be taking place, we also want to evaluate the effect of fine-tuning the pretrained weights. We need to evaluate the performance on the new subjects data before and after fine-tuning to see in which situations our proposed recalibration method is successful.
Figure 5 shows the percentage improvement due to retraining with a variable size of training data on the test data (all repetitions from the new subjects not used for training).
We observe a huge improvement over the pretrained model thanks to fine-tuning in situations where we have enough data from the new subject. We noticed a quick drop in percentage improvement when reducing the amount of repetitions used for fine-tuning. In the case of the model using raw input, we even notice a decrease in accuracy on the test set, which shows that fine-tuning might lead to overfitting in such situations. In general, the models using features as input, appear to be more stable and its performance degraded less quickly than the one using feature learning.
However, both models start to overfit when further reducing the retraining data, so that it covers less than one repetition of each movement. The simple fine-tuning approach does not seem to be appropriate for calibration with only a few selected movements.

The most general approach to fine-tune the deep neural networks is initializing the network with the pretrained weights, and retrain all layers on the new subjects data. Often, it can be helpful to freeze parts of the network, keep the weights unchanged and use only selected layers for calibration. This might be especially the case, if we assume part of the model to be less affected by the domain change. Works like [yosinski2014transferable], show that for many common machine learning tasks, lower layers in artificial neural networks learn more general and easier to transfer features.
In the case of sEMG, it seems likely that it is not necessary to retrain every weight, to adopt the model to a new user.
We show in figure 6, that neither of the models tested in this paper benefit from retraining all layers compared to only the first or the last layer. In practice this means that the calibration procedure can be accelerated because we can work with less tunable parameters.

Table 3 shows the average classification accuracy on the test repetitions generated by the new subject with a varying amount of repetitions used for retraining. As shown in figure 5, the average test accuracy of the pretrained model increases, even if just one repetition is used for fine-tuning. The feature-based, subject-specific model, outperforms the pretrained model but is again outperformed by the fine-tuned model. The same can be seen with the model using raw sEMG data, with the difference that subject specific model overfit in the low data setting and perform worse than the pretrained model.
Overall a sharp drop between test accuracy and validation accuracy (seen in figure 4) can be observed. This is due to the large in-person variance, as the validation set is a randomly taken from the repetitions used for training.
Subject Specific | | Before Retraining | | After Retraining | ||
Features | 1 Repetition | 0.51 | 0.49 | 0.58 |
2 Repetitions | 0.59 | 0.51 | 0.62 | |
4 Repetitions | 0.65 | 0.50 | 0.67 | |
5 Repetitions | 0.65 | 0.47 | 0.67 | |
raw sEMG | 1 Repetition | 0.21 | 0.56 | 0.58 |
2 Repetitions | 0.23 | 0.57 | 0.64 | |
4 Repetitions | 0.51 | 0.54 | 0.68 | |
5 Repetitions | 0.64 | 0.49 | 0.68 |
Db3
The data acquisition setup for database 3 was identical to database 2. However, using a smaller number of total subjects, which where all amputated.
We expected to observe a higher variability between subjects, which may result in overall poor performance. However, a stronger impact of fine-tuning is expected.
Table 4 shows that overall performance of the classifiers is indeed degraded. However, in leave-one-subject-out cross-validation low variability is observed as well as a lower percentage improvement between the pretrained and the fine-tuned model.
This might be explained with the amputation having a lower impact on between-subject variability than expected. Other factor such as sex, which are more diverse in database 2, could have an higher influence.
Despite this, figures 9,8 and 7 show similar trends as already observed on database 2.
Again, we observe quicker learning and higher asymptotic performance compared to the subject-specific models and we see no noticeable difference between retraining different layers. The improvement due to fine-tuning drops even faster, when reducing the amount of repetitions, with the feature-based models being more stable in low-data situations.
Why the models start earlier to overfit is difficult to explain. One possibility might be that the base pretrained model is unable to generalize well. This could be caused by the lower number of subjects used during pre-training, or due to the values of hyperparameters that were kept fixed (taken from database 2).



Subject Specific | | Before Retraining | | After Retraining | ||
Features | 1 Repetition | 0.33 | 0.46 | 0.36 |
2 Repetitions | 0.41 | 0.47 | 0.42 | |
4 Repetitions | 0.46 | 0.43 | 0.47 | |
5 Repetitions | 0.50 | 0.41 | 0.50 | |
raw sEMG | 1 Repetition | 0.15 | 0.59 | 0.45 |
2 Repetitions | 0.16 | 0.58 | 0.45 | |
4 Repetitions | 0.48 | 0.54 | 0.51 | |
5 Repetitions | 0.52 | 0.43 | 0.52 |
Db4
We observe similar training curves (figure 10 and effects of retraining method (figure 12) with database 4, as in the previous two databases. Here, the subject-specific models perform again worse than the pretrained models.
Looking at the percentage improvement (figure 11) and overall performance (table 5) , we notice a different pattern than before.
On database 4, the feature learning (raw) model does not appear to be less stable than the one using manually extracted features. This is the only case where the model does not overfit using raw sEMG inputs with five repetitions for training.
A possible reason for this behaviour might be the different movement tasks compared to DB2 and 3. Previous databases(DB2 and 3) used various hand configurations and wrist movements, while DB4 uses finger movements. It seems plausible that this task benefits stronger from being able to learn features because the activity measured takes place further away from the sEMG-electrode locations and might therefore be less well captured by feature engineering approach.
We however also see in table 5, that the before retraining performance varies more strongly over the experimental settings than on the other two databases. This suggests that the intra-subject variability between each repetition might be higher than in those other databases. One possible cause for this could be the different movement tasks and recording equipment.
The overall lower accuracy (compared to database 2) might be due to hyperparameters not being exactly tuned for this database and the lower number of subjects.



Subject Specific | | Before Retraining | | After Retraining | ||
Features | 1 Repetition | 0.39 | 0.55 | 0.45 |
2 Repetitions | 0.34 | 0.55 | 0.37 | |
4 Repetitions | 0.45 | 0.51 | 0.47 | |
5 Repetitions | 0.45 | 0.46 | 0.46 | |
raw sEMG | 1 Repetition | 0.22 | 0.71 | 0.50 |
2 Repetitions | 0.29 | 0.69 | 0.57 | |
4 Repetitions | 0.38 | 0.63 | 0.55 | |
5 Repetitions | 0.50 | 0.47 | 0.54 |
Comparison with other works on NinaPro
Comparing our models with other works on the same database is difficult, due to differences in the chosen exercise group, input features, window size and the unbalanced nature of the NinaPro databases (e.g. in [josephs2020semg] a drop of more than ten percentage points was noticed due to balancing issues in database 5).
However, when comparing the accuracy achieved by models in this work with previously achieved performance (see table 6), our models lie on the lower end of the spectrum. This is not surprising due to our focus on simpler models with fewer tunable parameters. Looking at the only other paper working with the same simple 1D-CNN, our raw model performs comparatively well.
We don’t have reasons to assume that applying larger models would affect the general tendencies we found in regard to the effects of transfer learning. Increasing the parameter space would likely only lead to an even wider gap between subject-specific models and pretrained models, while having a better performing base model, might further strengthen the regularization provided by transfer learning. In pre-tests performed using 2D-CNNs, we didn’t notice any effects of transfer learning that deviate from the effects presented in this paper.
Citation | Database | Classes | Balanced | Input | Window | Model | Subject-specific | Result |
---|---|---|---|---|---|---|---|---|
[8641445] | DB2 | 50 | no | sEMG features | 200 ms | Multiview 2D-CNN | yes | 83.7 |
[pancholi2021robust] | DB2 (10 subjects) | 49 | yes | spectrogram | 150 ms | 2D-CNN | yes? | 73.12 |
MPP+MZP | 89.45 | |||||||
DB3 (5 subjects) | spectrogram | 66.31 | ||||||
MPP+MZP | 81.67 | |||||||
[xiemovement] | DB 2 | 18 | no | raw sEMG | 64 ms | 1D-CNN | no? | 52.17 |
1D-CNN+RNN | 63.74 | |||||||
[10.3389/fnins.2017.00379] | DB 2 | 17 | yes | raw sEMG | 200 ms | 2D-CNN | yes? | 82.22 |
49 | 78.71 | |||||||
[8969418] | DB2 | 17 | yes | raw sEMG | 100 ms | Dilated Causal CNN | no? | 92.5 |
[ding2018semg] | DB 2 | 17 | yes | raw sEMG | 100 ms | 2-Block 2D-CNN | yes? | 83.17 |
49 | 78.66 |
The few other papers working on deep transfer learning on NinaPro data can be seen in table 7.
On Database 2, our models outperforms [9495836] in base classification accuracy, in settings where we can retrain on many repetitions, and percentage improvement in any tested setting. On the other two databases, their models outperformed those tested in this paper, which again might be due to our hyperparameters being tuned on database 2. Their distribution alignment approach has the additional advantage, that it only needs unlabeled data for retraining.
The approach in [ketyko2019domain] is quite similar to ours, as it also uses a weight-initialization approach. Here however, they freeze all pretrained layers and add a new ’adaptation layer’. It is not apparent how either of the two approaches to fine-tuning should differ in principle. Our results are difficult to compare because they use database 1, which differs a lot from the databases used in our work. They report a lower base performance but a higher percentage improvement using 5 repetitions.
Citation | Database | Classes | Balanced | Method | mMdel | without TL | with TL |
---|---|---|---|---|---|---|---|
[9495836] | DB2 | 50 | no | distribution alignment + DNM | (TL-)MKCNN | 63.73 | 65.29 |
DB3 | 76.84 | 82.47 | |||||
DB4 | 53 | 60.10 | 62.33 | ||||
[s17030458] | DB1 | 52 | yes | AdaBN | ensemble CNN | - | 67.4 |
[ketyko2019domain] | DB1 | 12 | yes | domain adaptation layer | LSTM | 35.10 | 65.29 |
Conclusion
The conducted experiments indicate that relying on training user-specific deep learning models for sEMG classification is not advisable. Fine-tuning of a pretrained model on a new user data did outperform the user-specific trained models in every setting covered by this paper. This is likely caused by additional regularization provided through the weight-initialization based transfer learning. Given the right circumstances, even this relatively simple approach to transfer learning seems to be able to perform comparatively well in improving the classification accuracy on sEMG over subject-specific models.
Being able to reduce the parameter size by freezing all but the last layer and stopping after a few epochs of retraining, makes this a feasible calibration method in practical settings.
While it would be desirable to calibrate on just a few data points, reducing the number of iterations used for retraining can lead quickly to overfitting to the given samples. It should be noted that we didn’t change the hyperparameters for the fine-tuning stage. So it is possible that a more optimal set of learning parameters could lead to a better performance in low data situations.
The better performance on the larger database 2 indicates that having a better performing base model, pretrained on more subjects might lead to a stronger regularization and a more stable behaviour during retraining. It is however difficult to know upfront which model could work as a stable basis, as preretraining performance does not appear to aid as a clear indicator for successful fine tuning. While both the models using raw data and the model using manual features showed a similar classification accuracy, the latter showed more stability during fine-tuning and was less prone to overfitting on fewer data points. Following from this, potential future research could investigate what features of a model make it a promising base model for further fine-tuning.
Another open question is the effect of intra-personal variability of the sEMG signal. This work focused on interpersonal variability and model transfer between subjects. However as we saw, especially in database 4, the same model showed highly different base performance depending on which repetition of the same person the samples were coming from. This takes into question whether a one-time calibration on a few data points gathered from a new user might ever be enough to generate a stable model that accounts for the variability of sEMG signal. Future work should focus on whether inter- or intra-personal variance of sEMG signals have a stronger influence on deep learning based classifiers. These results could help find appropriate calibration schemes for sEMG classifiers in practical settings.
References
Acknowledgements
This work is supported by the Ministry of Economics, Innovation, Digitization and Energy of the State of North Rhine-Westphalia and the European Union, grants GE-2-2-023A (REXO) and IT-2-2-023 (VAFES)
Author contributions statement
Conceptualization: SL MS TG II.
Funding acquisition: II.
Investigation: SL.
Writing – original draft: SL
Writing – review& editing: SL MS TG II
Corresponding author
Correspondence to Stephan Lehmler
Competing interests statement
The authors declare no competing interests.
Comments
There are no comments yet.