1 Introduction
Machine Learning (ML) is widely used for several tasks with timeseries and biosensor data such as for human activity recognition, electronic health records databased predictions [Ismail Fawaz et al., 2019], and realtime bionsensorbased decisions. Various classification goals are addressed related to electrocardiography (ECG) [Jambukia et al., 2015], electroencephalography (EEG) [Craik et al., 2019, Dose et al., 2018], and electromyograpy (EMG) [Ketykó et al., 2019, Hu et al., 2018, Patricia et al., 2014, Du et al., 2017].
Sensing hand gestures can be done by means of wearables or by means of image or video analysis of hand or finger motion. A wearablebased detection can physically rely on measuring the acceleration and rotations of our body parts (arms, hands or fingers) with Inertial Measurement Unit (IMU) sensors or by measuring the myoelectric signals generated by the various muscles of our arms or fingers with EMG sensors. Surface EMG (sEMG) records muscle activity from the surface of the skin which is above the muscle being evaluated. The signal is collected via surface electrodes.
We are interested in sEMGsensor placement to the forearm and performing hand gesture recognition with ML. In this context, all ML prediction models suffer from intersession and intersubject domain shifts (see Figure 1).

Intra session scenario: the device is not removed, and the training and validation data are recorded together in the same session of the same subject. In this situation the gesture recognition accuracy is generally above .

Intersession scenario: the device is reattached, and the validation data is recorded separately in a new session of the same subject. Under this domain shift the validation accuracy degrades below .

Intersubject scenario: The validation data is on another subject. In this case, the validation accuracy degrades below as well.
Our focus is to investigate: 1) the metrics of these domain discrepancies, and 2) the adaptation solutions with special attention on those, which do not rely on source data samples.
This paper is organized as follows, Section 2 provides a summary of ML, model risks, domains, domain divergences, and domain adaptation methods. Then our source dataabsent metric and adaptation model is introduced in Section 3. Next, we validate our approaches using publicly available sEMG datasets: the experimental setup and results are described in Section 4. Finally, we conclude and summarize our results.
2 Related Work
2.1 Machine Learning
At the most basic level, ML seeks to develop methods for computers to improve their performance at certain tasks on the basis of observed data. Almost all ML tasks can be formulated as making inferences about missing or latent data from the observed data. To make inferences about unobserved data from the observed data, the learning system needs to make some assumptions; taken together these assumptions constitute a model. The probabilistic approach to modeling uses probability theory to express all forms of uncertainty. Since any sensible model will be uncertain when predicting unobserved data, uncertainty plays a fundamental part in modeling. The probabilistic approach to modeling is conceptually very simple: probability distributions are used to represent all the uncertain unobserved quantities in a model (including structural, parametric and noiserelated) and how they relate to the data. Then the basic rules of probability theory are used to infer the unobserved quantities given the observed data. Learning from data occurs through the transformation of the prior probability distributions (defined before observing the data), into posterior distributions (after observing data)
[Ghahramani, 2015].We define an input space which is a subset of dimensional real space
. We define also a random variable
with probability distribution which takes values drawn from . We call the realisations offeature vectors and noted
.A generative model describes the marginal distribution over : , where samples of are observed at learning time in a dataset and the probability distribution depends on some unknown parameter . A generative model family which is important for timeseries analysis is the autoregressive one. Here, we fix an ordering of the variables and the distribution for the th random variable depends on the values of all the preceding random variables in the chosen ordering [Bengio et al., 2015]
. By the chain rule of probability, we can factorize the joint distribution over the
dimensions as:(1) 
We define also an output space and a random variable taking values drawn from with distribution
. In the supervised learning setting
is conditioned on (i.e., ), so the joint distribution is actually .A discriminative model is relaxed to the posterior conditional probability distribution:
and it reflects straight the discrimination/classification task with lower asymptotic errors than the generative models. This transductive learning setting has been introduced by Vapnik [Ng and Jordan, 2001].For all modeling approaches, the learning is to fit their distributions over the observed variables in our dataset
. With other words (of a statistician), a good estimate of the unknown parameter
would be the value of that maximizes the likelihood of getting the data we observed in our dataset . Formally, the goal of the Maximum Likelihood Estimation (MLE) is to find the :(2) 
Learning in neural networks is about solving optimisation problems. In case of probabilistic (and differentiable) cost functions, the backpropagation method
[Rumelhart et al., 1986] is an estimator for MLE.2.2 Risk of Machine Learning Models
We start with the general and probabilistic description then we move on to the discriminative models.
Let us introduce the concept of loss function
. A loss function
measure the loss that a model distribution makes on a particular instance . Our goal is to find the model that minimizes the expected loss or risk:(3) 
Note that the loss function which corresponds to MLE is the log loss . The risk evaluated on is the insample error or empirical risk:
(4) 
The generalization gap is the and our model intends to have small probability that this difference is larger than a very small value .
In case of supervised learning setting, we have a paired dataset of observations . The loss function becomes the conditional loglikelihood and the risk is described as:
(5) 
Let us introduce the hypothesis space which is the set/class of predictors . A hypothesis estimates the target function from . The target function is a proxy for the conditional distribution , such as: , where is the noise term. Substituting into Equation (5) we get the transductive version of the risk:
(6) 
Substituting into Equation (4):
(7) 
In the transductive setting the generalisation gap can be quantified:
(8) 
2.3 Domain and DomainShift Concepts
First, it is necessary to clarify what a domain is and what kind of domain discrepancies there can be. There are several good survey papers that describe this field deeply e.g., [Kouw, 2018], [Kouw and Loog, 2019], and [Csurka, 2017]. In this paper, the domain adaptationrelated problem statement and notations follow [Kouw, 2018].
The problem statement is introduced from a classification point of view to simplify the definitions, but it can be generalized to other supervised machine learning task. A domain contains three elements: Input space , Output space and joint distribution over and .
Two domains are different if at least one of their above mentioned components are not equal. In case of domain adaptation the input spaces and output spaces of the domains are the same but the distributions are different. More general cases belong to different fields of transfer learning, a detailed taxonomy of transfer learning tasks can be found in
[Pan and Yang, 2010]. During domain adaptation there is a trained machine learning model on a socalled source domain (S) and there is an intent to apply it on a target domain (T). From this point onwards S and T in subscript refer to source and target domains.Let us analyze the risk of a source classifier (Equation (
6)) on a target domain in the crossdomain setting:(9) 
It can be seen that the ratio of the source and target joint distributions () defines the risk . The investigation of this ratio allows us to define domain shift cases [MorenoTorres et al., 2012]: prior shift, covariate shift and concept shift.
In case of prior shift, the marginal distribution of the labels are different between the source domain and the target domain , but the conditional distributions are equal . Typical example for prior shift: the symptoms of a disease are usually population independent but the distribution of the diseases is population dependent. These conditions allow us to simplify the risk:
(10) 
This means, that the complete labeled dataset from the target domain is not needed but the estimation of the marginal distribution of the labels is needed on the target domain.
Covariate shift is a wellstudied domain shift, for further reference see [Kouw, 2018]. It is defined as follows: the posterior distributions are equivalent, this means , but the marginal distributions of the samples are different . The typical cause of covariate shift is the sample selection bias. Only the sample distributions determine the risk:
(11) 
In case of concept shift, the marginal distributions of input vectors are similar on both source and target domains , on the other hand, the posterior distributions differ from each other . Usually, nonstationary environment causes this data drift [Widmer and Kubat, 1996]. It is not possible to simplify significantly the crossdomain risk and the domain adaptation cannot be done without labeled target data:
(12) 
In general, none of the above mentioned assumption is valid, thus it is not possible to simplify the risk on target domain. The differing posterior distributions cause the major domain shift related issues. The optimal transport approach assumes that there is transport that satisfies [Courty et al., 2016]. Finding this transportation map is intractable but it is possible to relax it to a simpler optimization problem, where is estimated via a Wasserstein distance minimization between the two domains [Courty et al., 2016, Kouw and Loog, 2019].
2.4 Divergence Metrics and Theoretical Bounds
As the input space and output space are common in case of domain adaptation, the distance and divergence metrics of the distributions can measure and quantify the domain discrepancies. We elaborate the most common metrics in the field of domain adaptation.
The KullbackLeibler divergence
[Cover and Thomas, 1991] is a wellknown information theorybased metrics between two distributions. It measures the relative entropy between two distributions. One of its main disadvantage is, that it is difficult to calculate it from samples in some cases [BenDavid et al., 2010].(13) 
In general, the KL divergence is an asymmetric metric as . A commonly used symmetric version is the JensenShannon divergence [Lin, 1991]. It measures the total divergence from the average divergence.
(14) 
The origin of the Wasserstein distance is the optimal transport problem: a distribution of mass should be transported to another distribution of mass with minimal cost. Usually Wassersten1 distance is used with the Euclidean distance [Arjovsky et al., 2017].
(15) 
Where is the set of all joint distributions with marginals and . This distance metric allows us to construct a continous and differentiable loss function [Arjovsky et al., 2017]. In case of domain adaptation, this distance is calculated between the marginal distributions and to get a tractable problem [Kouw, 2018].
The H divergence allows to find upper bound to crossdomain risk [Kifer et al., 2004, BenDavid et al., 2006, BenDavid et al., 2010]. The definitions and formulas are provided for binary classification because of simplification, but they can be generalized to multiclass problems, as well:
(16) 
[BenDavid et al., 2006] provide two different techniques to estimate H divergence: from finite sample and from empirical risk of domain classifier. If the hypothesis space is symmetrical, the empirical H divergence can be calculated form finite samples of the source and target domains:
(17) 
where is an indicator function which gives if predicate is correct, otherwise . For the computation of during the minimization, the whole hypothesis space must be tackled. [BenDavid et al., 2006] introduced an approximation to empirical H divergence, which is called ProxyA Distance:
(18) 
where is the empirical risk of a linear domain classifier, which is trained (in a supervised fashion) to distinguish the source and target domains.
The crossdomain risk can be estimated by the empirical H divergence:
(19) 
where is a complexity measure of hypothesis space, is the risk of the socalled single good hypothesis. The is the best classifier that can generalize on both domains:
(20) 
The minimization of H divergence gives better result, however the the risk of single good hypothesis can ruin the performance of the domain adaptation. In other words, if there is no single good hypothesis, the domains are too far from each other to build an efficient domain adaptation.
2.5 Domain Adaptation Techniques
All the adaptive learning strategies focus on identifying how to leverage the information coming from both the source and target domains. Incorporating exclusively the target domain information is disadvantegous because sometimes there is no labelled targed data at all, or typically the amount of the labelled target data is small. Building on information present in the source domain and adapting that to the target is generally expected to be the superior [Patricia et al., 2014] solution.
We make a split in the viewpoint of sourcesample availability at DA time. We discuss separately methods that assume source sample availability and methods that do not, first generally, and later in the context of sEMGbased gesture recognition.
2.5.1 Source Databased
The majority of the approaches incorporate the unlabeled source data samples at DA time. Cycle Generative Adversarial Network (CycleGAN) [Zhu et al., 2017] is a stateoftheart deep generative model which is able to learn and implicitly represent the source and target distributions to pull them close together in an adversarial fashion. It is composed of two Generative Adversarial Networks (GANs) [Goodfellow et al., 2014] and learns two mappings ( and ) to achieve the cycleconsistency between the source and target distributions ( and ) via the minimax game.
Besides the GANs, the autoassociative Auto Encoder (AE) models are capable of building domaininvariant representations in their latent space. The nonlinear Denoising AE (DAE) [Glorot et al., 2011] builds strong representation of the input distribution with the help of mastering to denoise the input (augmented with noise or corruptions). As a side effect, the multidomain input ends up with a domaininvariant latent representation in the model. Inspired by the DAE, a linear counterpart: the Marginalized Denoising AE (mDA) [Chen et al., 2012] has been proposed to keep the optimization convex with closedform solution and achieve ordersofmagnitude faster computation (at the expense of the representation power is limited be to linear).
Data augmentation with marginalized corruptions has been studied for the transductive learning setting [Ng and Jordan, 2001] also: the Marginalized Corrupted Features (MFC) classifier [van der Maaten et al., 2014] has strong performance in case of validation under domain shift. In particular, as the corrupting distribution may be used to shift the data distribution in the source domain towards the data distribution in the target domain  potentially, by learning the parameters of the corrupting distribution using maximum likelihood.
In the transductive learning setting a classifier can be explicilty guided to learn a domaininvariant representation of the conditional distribution of class labels among two domains. DomainAdversarial Neural Network DANN) [Ganin et al., 2016] adversarially connects a binary domain classifier into the neural network directly exploiting the idea exhibited by Equation (19).
The binary domain classifier of the DANN [Ganin et al., 2016] and the mDA [Chen et al., 2012] have been paired in [Clinchant et al., 2016] to get domainadaptation regularization for the linear mDA model. Hence, the mDA has been explicitly guided to develop a latent representation space which is domaininvariant. Linear classifiers built in that latent space have have had comparable performance results in several image classification tasks.
The 2Stage Weighting framework for MultiSource Domain Adaptation (2SWMDA) and the Geodesic Flow Kernel (GFK) methods in [Patricia et al., 2014] tackle intersubject DA for sEMGbased gesture recognizers. In 2SWMDA all the data of each source subject are weighted and combined with the target subject samples with a linear supervised method; for GFK the source and target data are embedded in a lowdimensional manifold (with PCA) and the geodesic flow is used to reduce the domain shift when evaluating the cross domain sample similarity.
2.5.2 Source Dataabsent
The overwhelming majority of existing domain adaptation methods makes an assumption of freely available source domain data. An equal access to both source and target data makes it possible to measure the discrepancy between their distributions and to build representations common to both target and source domains. In reality, such a simplifying assumption rarely holds, since source data are routinely a subject of legal and contractual constraints between data owners and data customers [Chidlovskii et al., 2016].
Despite the absence of available source samples it is still possible to rely on: 1) statistical information of the source retrieved in advance, 2) model(s) trained on the source data.
CORrelation ALignment (CORAL) [Sun et al., 2016] minimizes domain shift by aligning the secondorder statistics of source and target distributions, without requiring any target labels. In contrast to subspace manifold methods (e.g., [Fernando et al., 2013]), it aligns the original feature distributions of the source and target domains, rather than the bases of lowerdimensional subspaces. CORAL performs a linear whitening transformation on the source data then a linear coloring transformation (based on the secondorder statistics of the target data). If the statistical parameters of the source data are retrieved in advance of the DA then it can be considered as a source dataabsent method.
Adaptive Batch Normalization (AdaBN) (which is an approximation of the whitening transformation, usually applied in deep neural networks) is utilised for DA in
[Du et al., 2017]for sEMGbased gesture classification. Furthermore, it builds upon the deep Convolutional Neural Network (CNN) architecture to extract spatial information from the highdensity sEMG sensor input. However, it is not modeling the possible temporal information in the timeseries data. Apart from that, it has stateofthe unsupervised DA performance which has been validated under intersession and intersubject domain shifts on several datasets.
[Fernando et al., 2013] introduces (linear) subspace alignment between the source and target domains with PCA. In the common linear subspace classifiers can be trained with comparable performance. The alignments (i.e., the PCA transformations) are learned on the source and target data, respectively. If the source alignment is learned (as a model of the source) in advance of the DA then it can be considered as a source dataabsent method.
[Farshchian et al., 2019] introduces the Adversarial Domain Adaptation Network (ADAN) with an AE (trained on the source data) for BrainMachine Interfaces (BMIs). With the representation power of the AE it is possible to capture the source distribution then continously align the shifting target distributions back to it. ADAN is trained in an adversarial fashion with an Energybased GAN architecture [Zhao et al., 2019] where the ”energy” is the reconstruction loss of the AE, and the domain shifts are represented as the residualloss distributions of the AE. ADAN learns via the minimax game to pull the target residual distributions to those of the source.
In the transductive learning setting [Ng and Jordan, 2001] there are several source dataabsent DA approaches building on the pretrained source classifier(s).
The Transductive Doman Adaptation (TDA) in [Chidlovskii et al., 2016] utilizes the representation capabilities of the mDA [Chen et al., 2012] to (linearly) adapt the output of a trained source classifier to the target domain. TDA performs unsupervised DA in closed form without the presence of any extra source information.
The transductive MultiAdapt and the MultiKernel Adaptive Learning (MKAL) in [Patricia et al., 2014] both tackle the intersubject DA for sEMGbased gesture recognizers by the adaptation of trained source classifiers. In MultiAdapt, an SVM is learned from each source and used as reference (resulted by a convex optimization) when performing supervised learning on the target dataset. In MKAL each SVM source classifier predicts on the target samples and the scores are used as extra input features for the learning of the gesture classifier on the target dataset. MultiAdapt and MKAL have had comparable performance at that time even though these models do not capture the available temporal information in the timeseries data.
[Dose et al., 2018] builds a BMI and investigates DA for multivariate EEG timeseries data classification. The timeseries classification of the multivariate EEG signals is a very similar challenge to the multivariate sEMG signals. [Dose et al., 2018] captures both the spatial and temporal correlations in the data with a CNN architecture. However, the DA is about supervised finetuning of all the model parameters on the target subject (such as [Donahue et al., 2014]) which is suboptimal as highlighted by [Du et al., 2017, Ketykó et al., 2019].
The 2Stage Recurrent Neural Network (2SRNN) model for sEMG gesture recognition and DA in
[Ketykó et al., 2019]can be viewed as the deep neural, autoregressive modeling analogy of the MKAL
[Patricia et al., 2014]. It utilizes a trained source classifier and performs supervised DA to the target (session or subject) via learning a linear transformation between the domains. The transformation is then applied to the input (samples coming from the target). Learning is on the divergent (intersession or intersubject) domains via the backpropagation [Rumelhart et al., 1986] of the classifier’s crossentropy loss to its DA layer (which is a linear readout layer of the input). The size of its DA layer is less than of the overall 2SRNN (in terms of the trainable parameters) so it achieves fast computation of the DA, and the 2SRNN has the stateoftheart performance in intersession and intersubject domain shift validations.3 Our Divergence Metric and Adaptation Method
We provide a sequential, source dataabsent, transductive, probabilitybased divergence metric and DA method as well. First, we introduce the RNN architecture for temporal modeling, then the source dataabsent and transductive 2SRNN model in details.
3.1 Recurrent Neural Network
Recurrent Neural Network (RNN) [Jordan, 1986] is an autoregressive neural network architecture in which there are feedback loops in the system. Feedback loops allow processing the previous output with the current input, thus making the network stateful, being influenced the earlier inputs in each step (see Figure 2). A hidden layer that has feedback loops is also called a recurrent layer. The mathematical representation of a simple recurrent layer can be seen in Equation (21):
(21)  
The hidden state depends on the input and the previous hidden state . There is a nonlinear dependency (via the wrapper) between them.
However, regular RNNs suffer from the vanishing or exploding gradient problems which means that the gradient of the loss function decays/rises exponentially with time, making it difficult to learn longterm temporal dependencies in the input data
[Pascanu et al., 2013]. Long Short Term Memory (LSTM) recurrent cells have been proposed to solve these
[Hochreiter and Schmidhuber, 1997].(22)  
LSTM units contain a set of (learnable) gates that are used to control the stages when information enters the cell (input gate: ), when it is output (output gate: ) and when it is forgotten (forget gate: ) as seen in Equation (22). This architecture allows the neural network to learn longerterm dependencies because it learn also how to incorporate an additional information channel . In Figure 3 yellow rectangles represent a neural network layer, circles are pointwise operations and arrows denote the flow of data. Lines merging denote concatenation (notation of in Equation (22)), while a line forking denote its content being copied and the copies going to different locations.
For autoregressive modeling of timeseries data, RNN with LSTM cells is widely adopted [Hu et al., 2018, Ketykó et al., 2019].
3.2 2Stage Recurrent Neural Networkbased Domain Divergence Metric
Similarly to ADAN [Farshchian et al., 2019], we build a source dataabsent, probabilitybased divergence metric on the validation loss of the source model to measure domain shifts. In ADAN, the distribution of the residual loss (of the AE) is incorporated to express the divergence of target distributions from the one of source. However, we follow the transductive learning setting and directly take the (crossentropy) loss of the source classifier (exhibiting Equation (5)). Our source classifier is a sequential model (i.e., built on the autoregressive RNN architecture to have temporal modeling capabilites). For this task, we utilise the sequence classifier of the 2SRNN architecture [Ketykó et al., 2019].
The sequence classifier of 2SRNN (visualised as block 2 in Figure 4) is a deep stacked RNN with the manytoone setup followed by a way fullyconnected layer ( is the number of gestures to be recognized) and a softmax transformation at the output. The sequence classifier is directly modeling the conditional distribution of , where belongs to a categorical distribution with (gesture) classes. Learning is via the categorical crossentropy loss (of the ground truth and the predicted ):
(23) 
where is the indicator function whether class label is the correct classification for the given observation and is the predicted probability that the observation is of class .
For the divergence measure of distributions (between the source and target domains), we take the categorical crossentropy losses of the sequence classifier in the following way: the classifier is trained on the source distribution then evaluated on a target one. Hence, the resulting expresses the domain shift in the loss space of the two domains. The crossentropy between and :
(24) 
The expresses the empirical by the model. A valid source classifier is expected to model the source with the entropy of , so in fact the crossentropy captures the actual KullbackLeibler divergence among and (Equation (2.4)).
Furthermore, let and be the corresponding means of and . We measure the dissimilarity between these two distributions by a lower bound to the Wasserstein distance (Equation (15)), provided by the absolute value of the difference between the means [Berthelot et al., 2017]:
(25) 
The difference of the empirical means with the approximates Equation (25).
3.3 2Stage Recurrent Neural Networkbased Domain Adaptation
We build a source dataabsent, probability distribution of driven DA. [Ketykó et al., 2019] implements a linear version (L2SRNN), we extend it to a deep, nonlinear one, and name it the Deep 2SRNN (D2SRNN). Generally, the DA is applied to the input of the sequence classifier at each timestamp (visualised as block 1 in Figure 4). L2SRNN learns the weights of a linear transformation:
(26) 
D2SRNN learns the weights of chained nonlinear transformations:
(27) 
Figure 5 presents the two consecutive stages of the DA process:

[label=)]

The DA component initially is the identity transformation, and the weights of it are frozen. The sequence classifier is trained from scratch on the labelled source dataset.

The weights of the sequence classifier are frozen and the DA component’s weights are trained on a minor subset of the labelled target dataset: is backpropagated [Rumelhart et al., 1986] to the DA component during the process. Hence, the in Equation (24) or expressed via the in Equation (25) gets minimized.
4 Results
We perform experiments to validate our divergence metric and DA for sEMGbased gesture recognition in case of intersession and intersubject scenarios. We follow the exact same hyperparametrization and network implementations as in [Ketykó et al., 2019]. The parameters in Equations (26) and (27), where is equal to the size of the input features (number of sEMG channels). The nonlinearity in Equation (27
) is the REctified Linear Unit
[Nair and Hinton, 2010].For the sequence classifier we use a 2stack RNN with LSTM cells. Each LSTM cell has a dropout with the probability of 0.5 and 512 hidden units. The RNN is followed by a way fullyconnected layer with 512 units (dropout with a probability of 0.5) and a softmax classifier. Adam [Kingma and Ba, 2014]
with the learning rate of 0.001 is used for the stochastic gradient descent optimization. The size of the DA component in both the linear and deep cases is less than
of the total trainable network parameters. The gesture recognition accuracy is calculated as given below:(28) 
We investigate the intersession and intersubject divergences and validate our DA method on public sparse and dense sEMG datasets. We follow the experiment setups of previous works for comparability. Since we do sequential modeling in all experiments, we decompose the sEMG signals into small sequences using the sliding window strategy with overlapped windowing scheme. The sequence length must be shorter than 300 ms to satisfy realtime usage constraints. To compare our current experiments with previous works, we follow the segmentation strategy in previous studies.
The denseelectrode sEMG CapgMyo dataset has been thoroughly analysed by [Du et al., 2017, Hu et al., 2018, Ketykó et al., 2019] such as the sparseelectrode sEMG NinaPro dataset by [Patricia et al., 2014, Du et al., 2017, Ketykó et al., 2019].
The CapgMyo dataset [Du et al., 2017]: includes HDsEMG data for 128 electrode channels. The sampling rate is 1 KHz:

DBb: 8 isometric, isotonic hand gestures from 10 subjects in two recording sessions on different days.

DBc: 12 basic movements of the fingers were obtained from 10 subjects.
We downloaded the preprocessed version from http://zjucapg.org/myo/data to work with the exact same data as [Du et al., 2017, Ketykó et al., 2019]
for fair comparison. In that version, the powerline interference was removed from the sEMG signals by using a bandstop filter (45–55 Hz, secondorder Butterworth). Only the static part of the movements was kept in it (for each trial, the middle onesecond window, 1000 frames of data). They used the middle, onesecond data to ensure that no transition movements are included in it. We rescaled the data to have zero mean and unit variance, then we rectified it and applied smoothing (as lowpass filtering).
The NinaPro DB1 dataset [Patricia et al., 2014] contains sparse 10channel sEMG recordings:

Gesture numbers 1–12: 12 basic movements of the fingers (flexions and extensions). These are equivalent to gestures in CapgMyo DBc.
The data is recorded at a sampling rate of 100 Hz, using 10 sparsely located electrodes placed on subjects’ upper forearms. The sEMG signals were rectified and smoothed by the acquisition device. We downloaded the version from http://zjucapg.org/myo/data/ninaprodb1.zip to use the exact same data as [Du et al., 2017, Ketykó et al., 2019] for fair comparison. For each trial, we used the middle 1.5second window, 180 frames of data to get the static part of the gestures.
4.1 Divergence Metric Validation
We validate the proposed domain divergence metric in Section 3.2 on the CapgMyo DBb dataset which covers both the intersession and intersubject scenarios.
The divergence results are shown in Figure 6 and Figure 7. In both figures the empirical distributions of are illustrated by their histograms and mean values.
Figure 6 presents the intersession and intersubject divergences before DA. The values with red () represent the means of ; is the low mean loss in case of intra session; and (along with their histograms) show high intersession and intersubject domain shifts. shows the power of the sequence classifier: is close to the theoretical lower bound of crossentropy which is in the current case.
Figure 7 presents the intersession and intersubject divergences after DA. Intra session statistics () represent the source distribution (towards the divergent distributions are aimed to be adapted). The histograms and corresponding mean values with gray () represent the validation loss after L2SRNN DA; the histograms and corresponding mean values with green () represent the validation loss after D2SRNN DA. In all cases, the postDA distributions appear to be close to one of the source which is in line with the improved recognition accuracy results in Section 4.2.
4.2 Doman Adaptation Validation
For comparison purposes, we take the exact same pretrained source classifiers from [Ketykó et al., 2019] and perform D2SRNN DA (described in Section 3.3). The evaluation of the D2SRNN DA is exactly the same as of the L2SRNN and the AdaBN [Du et al., 2017] approaches. Furthermore, the comparison to the MKAL [Patricia et al., 2014] also is exactly the same as in [Du et al., 2017, Ketykó et al., 2019].
Table 1 presents the intersession recognition accuracy results on the dense CapgMyo DBb dataset. The L2SRNN and D2SRNN share the exact same pretrained source classifier models. The D2SRNN DA brings improvement which is better by percentage points than the L2SRNN.
Table 2 shows the intersubject recognition accuracy results on the dense CapgMyo DBb & DBc and the sparse NinaPro DB1 datasets. The L2SRNN and D2SRNN share the exact same pretrained source classifier models. The D2SRNN DA achieves: improvement on the DBb, improvement on the DBc, improvement on the DB1. The performance ratio (between the deep and the linear solutions) is in case of the dense datasets, and in case of the sparse one which suggests that there is higher gain by nonlinear adaptation in case of a sparseelectrode situation.
preDA  postDA  
AdaBN [Du et al., 2017]  47.9%  63.3% 
L2SRNN [Ketykó et al., 2019]  54.6%  83.8% 
D2SRNN  —"—  85.9% 
preDA  postDA  
DBb  DBc  DB1  DBb  DBc  DB1  
AdaBN [Du et al., 2017]  39.0%  26.3%    55.3%  35.1%   
MKAL [Patricia et al., 2014]      30%      55% 
L2SRNN [Ketykó et al., 2019]  52.6%  34.8%  35.1%  89.9%  85.4%  65.2% 
D2SRNN  —"—  —"—  —"—  92.0%  89.2%  72.8% 
5 Conclusions
We showed that the divergences between the empirical distributions of the crossentropy losses by a source classifier trained on the source distribution and evaluated on the target one is a valid measure for the domain shifts between source and target. It works in the absence of source data and a domain adaptation method built on minimizing that divergence is an effective solution in the transductive learning setting. Furthermore, we pointed out that this metric and the corresponding adaptation method is applicable to investigate and improve sEMGbased gesture recognition performance in intersession and intersubject scenarios under severe domain shifts. The proposed deep/nonlinear transformation component enhances the performance of the 2SRNN architecture especially in a sparse sEMG setting.
The code is available at https://github.com/ketyi/Deep2SRNN.
References
 [Arjovsky et al., 2017] Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia. PMLR.
 [BenDavid et al., 2010] BenDavid, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. (2010). A theory of learning from different domains. Machine Learning, 79(1):151–175.
 [BenDavid et al., 2006] BenDavid, S., Blitzer, J., Crammer, K., and Pereira, F. (2006). Analysis of representations for domain adaptation. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, pages 137–144, Cambridge, MA, USA. MIT Press.
 [Bengio et al., 2015] Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 1171–1179. Curran Associates, Inc.
 [Berthelot et al., 2017] Berthelot, D., Schumm, T., and Metz, L. (2017). BEGAN: Boundary Equilibrium Generative Adversarial Networks. ArXiv, abs/1703.10717.

[Chen et al., 2012]
Chen, M., Xu, Z., Weinberger, K. Q., and Sha, F. (2012).
Marginalized denoising autoencoders for domain adaptation.
In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pages 1627–1634, USA. Omnipress.  [Chidlovskii et al., 2016] Chidlovskii, B., Clinchant, S., and Csurka, G. (2016). Domain adaptation in the absence of source domain data. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 451–460, New York, NY, USA. ACM.
 [Clinchant et al., 2016] Clinchant, S., Csurka, G., and Chidlovskii, B. (2016). A domain adaptation regularization for denoising autoencoders. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 26–31, Berlin, Germany. Association for Computational Linguistics.
 [Courty et al., 2016] Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. (2016). Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1853–1865.
 [Cover and Thomas, 1991] Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. WileyInterscience, New York, NY, USA.
 [Craik et al., 2019] Craik, A., He, Y., and ContrerasVidal, J. L. (2019). Deep learning for electroencephalogram (EEG) classification tasks: a review. Journal of Neural Engineering, 16(3):031001.
 [Csurka, 2017] Csurka, G. (2017). Domain adaptation for visual applications: A comprehensive survey. CoRR, abs/1702.05374.
 [Donahue et al., 2014] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In Xing, E. P. and Jebara, T., editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 647–655, Bejing, China. PMLR.
 [Dose et al., 2018] Dose, H., Møller, J. S., Iversen, H. K., and Puthusserypady, S. (2018). An endtoend deep learning approach to MIEEG signal classification for BCIs. Expert Systems with Applications, 114:532 – 542.
 [Du et al., 2017] Du, Y., Jin, W., Wei, W., Hu, Y., and Geng, W. (2017). Surface emgbased intersession gesture recognition enhanced by deep domain adaptation. Sensors (Basel, Switzerland), 17(3):458. 28245586[pmid].
 [Farshchian et al., 2019] Farshchian, A., Gallego, J. A., Cohen, J. P., Bengio, Y., Miller, L. E., and Solla, S. A. (2019). ADVERSARIAL DOMAIN ADAPTATION FOR STABLE BRAINMACHINE INTERFACES. In International Conference on Learning Representations.
 [Fernando et al., 2013] Fernando, B., Habrard, A., Sebban, M., and Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In ICCV.
 [Ganin et al., 2016] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. (2016). Domainadversarial training of neural networks. J. Mach. Learn. Res., 17(1):2096–2030.

[Ghahramani, 2015]
Ghahramani, Z. (2015).
Probabilistic machine learning and artificial intelligence.
Nature, 521(7553):452–459.  [Glorot et al., 2011] Glorot, X., Bordes, A., and Bengio, Y. (2011). Domain adaptation for largescale sentiment classification: A deep learning approach. In Getoor, L. and Scheffer, T., editors, ICML, pages 513–520. Omnipress.
 [Goodfellow et al., 2014] Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc.
 [Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long ShortTerm Memory. Neural Comput., 9(8):1735–1780.
 [Hu et al., 2018] Hu, Y., Wong, Y., Wei, W., Du, Y., Kankanhalli, M., and Geng, W. (2018). A novel attentionbased hybrid CNNRNN architecture for sEMGbased gesture recognition. PLOS ONE, 13(10):1–18.
 [Ismail Fawaz et al., 2019] Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., and Muller, P.A. (2019). Deep learning for time series classification: a review. Data Mining and Knowledge Discovery, 33(4):917–963.
 [Jambukia et al., 2015] Jambukia, S. H., Dabhi, V. K., and Prajapati, H. B. (2015). Classification of ecg signals using machine learning techniques: A survey. In 2015 International Conference on Advances in Computer Engineering and Applications, pages 714–721.
 [Jordan, 1986] Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 531–546. Hillsdale, NJ: Erlbaum.
 [Ketykó et al., 2019] Ketykó, I., Kovács, F., and Varga, K. Z. (2019). Domain adaptation for sEMGbased gesture recognition with Recurrent Neural Networks. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–7.
 [Kifer et al., 2004] Kifer, D., BenDavid, S., and Gehrke, J. (2004). Detecting change in data streams. In Proceedings of the Thirtieth International Conference on Very Large Data Bases  Volume 30, VLDB ’04, pages 180–191. VLDB Endowment.
 [Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). (adam): A method for stochastic optimization. cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015.
 [Kouw, 2018] Kouw, W. M. (2018). An introduction to domain adaptation and transfer learning. CoRR, abs/1812.11806.
 [Kouw and Loog, 2019] Kouw, W. M. and Loog, M. (2019). A review of singlesource unsupervised domain adaptation. CoRR, abs/1901.05335.
 [Lin, 1991] Lin, J. (1991). Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151.
 [MorenoTorres et al., 2012] MorenoTorres, J. G., Raeder, T., AlaizRodríGuez, R., Chawla, N. V., and Herrera, F. (2012). A unifying view on dataset shift in classification. Pattern Recogn., 45(1):521–530.

[Nair and Hinton, 2010]
Nair, V. and Hinton, G. E. (2010).
Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 807–814, USA. Omnipress. 
[Ng and Jordan, 2001]
Ng, A. Y. and Jordan, M. I. (2001).
On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes.
In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, pages 841–848, Cambridge, MA, USA. MIT Press.  [Olah, 2015] Olah, C. (2015). Understanding LSTM Networks.
 [Pan and Yang, 2010] Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359.
 [Pascanu et al., 2013] Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ICML’13, pages III–1310–III–1318. JMLR.org.

[Patricia et al., 2014]
Patricia, N., Tommasi, T., and Caputo, B. (2014).
Multisource Adaptive Learning for Fast Control of Prosthetics
Hand.
In
Proceedings of the 22nd International Conference on Pattern Recognition
, pages 2769–2774. IEEE.  [Rumelhart et al., 1986] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Learning Internal Representations by Error Propagation, pages 318–362. MIT Press, Cambridge, MA, USA.
 [Sun et al., 2016] Sun, B., Feng, J., and Saenko, K. (2016). Return of frustratingly easy domain adaptation. In AAAI.
 [van der Maaten et al., 2014] van der Maaten, L., Chen, M., Tyree, S., and Weinberger, K. Q. (2014). Marginalizing corrupted features. CoRR, abs/1402.7001.
 [Widmer and Kubat, 1996] Widmer, G. and Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Mach. Learn., 23(1):69–101.
 [Zhao et al., 2019] Zhao, J., Mathieu, M., and LeCun, Y. (2019). Energybased generative adversarial networks. 5th International Conference on Learning Representations, ICLR 2017 ; Conference date: 24042017 Through 26042017.

[Zhu et al., 2017]
Zhu, J., Park, T., Isola, P., and Efros, A. A. (2017).
Unpaired imagetoimage translation using cycleconsistent adversarial networks.
In2017 IEEE International Conference on Computer Vision (ICCV)
, pages 2242–2251.