I Introduction
A braincomputer interface (BCI) provides a direct communication pathway between a user’s brain and a computer [1, 2]. Electroencephalogram (EEG), a multichannel timeseries, is the most frequently used BCI input signal. There are three common paradigms in EEGbased BCIs: motor imagery (MI) [3], eventrelated potentials (ERPs) [4], and steadystate visual evoked potentials [2]. The first two are the focus of this paper.
In MI tasks, the user needs to imagine the movements of his/her body parts, which causes modulations of brain rhythms in the involved cortical areas. In ERP tasks, the user is stimulated by a majority of nontarget stimuli and a few target stimuli; a special ERP pattern appears in the EEG response after the user perceives a target stimulus. EEGbased BCI systems have been widely used to help people with disabilities, and also the ablebodied [1].
A standard EEG signal analysis pipeline consists of temporal (bandpass) filtering, spatial filtering, and classification [5]. Spatial filters such as common spatial patterns (CSP) [6]
are widely used to enhance the signaltonoise ratio. Recently, there is a trend to utilize the covariance matrices of EEG trials, which are symmetric positive definite (SPD) and can be viewed as points on a Riemannian manifold, in EEG signal analysis
[7, 8, 9]. For MI tasks, the discriminative information is mainly spatial, and can be directly encoded in the covariance matrices. On the contrary, the main discriminative information of ERP trials is temporal. A novel approach was proposed in [10]to augment each EEG trial by the mean of all target trials that contain the ERP, and then their covariance matrices are computed. However, Riemannian space based approaches are computationally expensive, and not compatible with Euclidean space machine learning approaches.
A major challenge in BCIs is that different users have different neural responses to the same stimulus, and even the same user can have different neural responses to the same stimulus at different time/locations. Besides, when calibrating the BCI system, acquiring a large number of subjectspecific labeled training examples for each new subject is timeconsuming and expensive. Transfer learning [11, 12, 13, 14, 15], which uses data/information from one or more source domains to help the learning in a target domain, can be used to address these problems. Some representative applications of transfer learning in BCIs can be found in [16, 17, 18, 19, 20, 21]. Many researchers [19, 20, 21] attempted to seek a set of subjectinvariant CSP filters to increase the signaltonoise ratio. Another pipeline is Riemannian geometry based. Zanini et al. [22] proposed a Riemannian alignment (RA) framework to align the EEG covariance matrices from different subjects. He and Wu [23]
extended RA to Euclidean alignment (EA) in the Euclidean space, so that any Euclidean space classifier can be used after it.
To utilize the excellent properties of the Riemannian geometry and avoid its high computational cost, as well as to leverage knowledge learned from the source subjects, this paper proposes a manifold embedded knowledge transfer (MEKT) framework, which first aligns the covariance matrices of the EEG trials in the Riemannian manifold, then performs domain adaptation in the tangent space by minimizing the joint probability distribution shift between the source and the target domains, while preserving their geometric structures, as illustrated in Fig. 1. Additionally, we propose a domain transferability estimation (DTE) approach to select the most beneficial subjects in multisource transfer learning. Experiments on four datasets from two different BCI paradigms (MI and ERP) verified the effectiveness of MEKT and DTE.
The remainder of this paper is organized as follows: Section II introduces related work on spatial filters, Riemannian geometry, tangent space mapping, RA, EA, and subspace adaptation. Section III describes the details of the proposed MEKT and DTE. Section IV presents experiments to compare the performance of MEKT with several stateoftheart data alignment and transfer learning approaches. Finally, Section V draws conclusions.
Ii Related Work
This section introduces background knowledge on spatial filters, Riemannian geometry, tangent space mapping, RA, EA, and subspace adaptation, which will be used in the next section.
Iia Spatial Filters
Spatial filtering can be viewed as a datadriven dimensionality reduction approach that promotes the variance difference between two conditions
[24]. It is common in MIbased BCIs to use CSP filters [25] to simultaneously diagonalize the two intraclass covariance matrices.Consider a binary classification problem. Let be the th labeled training example, where , in which is the number of EEG channels, and the number of time domain samples. For Class (, CSP finds a spatial filter matrix , where is the number of spatial filters, to maximize the variance difference between Class and Class :
(1) 
where is the mean covariance matrix of all EEG trials in Class , and is the trace of a matrix. The solution is the concatenation of the
leading eigenvectors of the matrix
.Finally, we concatenate the spatial filters from both classes to obtain the complete CSP filters:
(2) 
and compute the spatially filtered by:
(3) 
The logvariances of the filtered trial can be extracted:
(4) 
and used as input features in classification.
IiB Riemannian Geometry
All SPD matrices form a differentiable Riemannian manifold. Riemannian geometry is used to manipulate them. Some basic definitions are provided below.
The Riemannian distance between two SPD matrices and is:
(5) 
where is the Frobenius norm, and
donates the logarithm of the eigenvalues of
.The Riemannian mean of is:
(6) 
The Euclidean mean is:
(7) 
IiC Tangent Space Mapping
Tangent space mapping is also known as the logarithmic mapping, which maps a Riemannian space SPD matrix
to a Euclidean tangent space vector
around an SPD matrix , which is usually the Riemannian or Euclidean mean:(9) 
where takes the upper triangular part of a SPD matrix and forms a vector , and is a reference matrix. To obtain a tangent space locally homomorphic to the manifold, is needed [24].
Congruent transform and congruence invariance [27] are two important properties in the Riemannian space:
(10)  
(11) 
where is the Euclidean or Riemannian mean operation, is a nonsingular square matrix, and is an invertible symmetric matrix. (11) suggests that the Riemannian distance between two SPD matrices does not change, if both are left and right multiplied by an invertible symmetric matrix.
IiD Riemannian Alignment (RA)
RA [22] first computes the covariance matrices of some resting (or nontarget) trials, , in which the subject is not performing any task (or not performing the target task), and then the Riemannian mean of these matrices, which is used as the reference matrix to reduce the intersession or intersubject variations, by the following transformation:
(12) 
where is the covariance matrix of the th trial, and the corresponding aligned covariance matrix. Then, all can be classified by a minimum distance to mean (MDM) classifier [7].
IiE Euclidean Alignment (EA)
Although RAMDM has demonstrated promising performance, it still has some limitations [23]: 1) it processes the covariance matrices in the Riemannian space, whereas there are very few Riemannian space classifiers; 2) it computes the reference matrix from the nontarget stimuli in ERPbased BCIs, which requires some labeled trials from the new subject.
EA [23] extends RA and solves the above problems by transforming an EEG trial in the Euclidean space:
(13) 
where is the Euclidean mean of the covariance matrices of all EEG trials, computed by (7).
However, EA only considers the marginal probability distribution shift, and works best when the number of EEG channels is small. When there are a large number of channels, computing may be numerically unstable.
IiF Subspace Adaptation
Tangent space vectors usually have very high dimensionality, so they cannot be used easily in transfer learning. An intuitive approach is to align them in a lower dimensional subspace. Pan et al. [11] proposed transfer component analysis (TCA) to learn the transferable components across domains in a reproducible kernel Hilbert space using maximum mean discrepancy (MMD) [28]
. Joint distribution adaptation (JDA)
[14] improves TCA by considering the conditional distribution shift using pseudo label refinement. Joint geometrical and statistical alignment (JGSA) [15] further improves JDA by adding two regularization terms, which minimize the withinclass scatter matrix and maximize the betweenclass scatter matrix.Iii Manifold Embedded Knowledge Transfer (MEKT)
This section proposes the MEKT approach. Its goal is to use one or multiple source subjects’ data to help the target subject, given that they have the same feature space and label space. For the ease of illustration, we focus on a single source domain first.
Assume the source domain has labeled instances , where is the th feature matrix, and is the corresponding onehot label vector, in which , and denote the number of channels, time domain samples, and classes, respectively. Let . Assume also the target domain has unlabeled feature matrices , where .
MEKT consists of the following three steps:

Covariance matrix centroid alignment (CA): Align the centroids of the covariance matrices of and , so that their marginal probability distributions are close.

Tangent space feature extraction
: Map the aligned covariance matrices to a tangent space feature matrix , and , where is the dimensionality of the tangent space features. 
Mapping matrices identification: Find projection matrices and , where is the dimensionality of a shared subspace, such that and are close.
After MEKT, a Euclidean space classifier can be trained on and applied to .
Next, we describe the details of the above three steps.
Iiia Covariance Matrix Centroid Alignment (CA)
CA serves as a preprocessing step to reduce the marginal probability distribution shift of different domains, and enables transfer from multiple source domains.
Let be the th covariance matrix in the source domain, and , where can be the Riemannian mean in (6), the Euclidean mean in (7), or the LogEuclidean mean in (8). Then, we align the covariance matrices by
(14) 
Similarly, we can obtain the aligned covariance matrices of the target domain.
CA has two desirable properties:

Marginal probability distribution shift minimization. From the properties of congruent transform and congruence invariance, we have
(15) i.e., if we choose
as the Riemannian (or Euclidean) mean, then different domains’ geometric (or arithmetic) centers all equal the identity matrix. Therefore, the marginal distributions of the source and the target domains are brought closer on the manifold.

EEG trial whitening. In the following, we show that each aligned covariance matrix is approximately an identity matrix after CA.
If we decompose the reference matrix as , then the th element of is:
(16) From (15) we have
(17) The above equation holds no matter whether is the Riemannian mean, or the Euclidean mean.
For CA using the Euclidean mean, the average of the th diagonal element of is
(18) Meanwhile, for each diagonal element, we have , therefore the diagonal elements of are around 1. Similarly, the offdiagonal elements of are around 0. Thus, is approximatively an identify matrix, i.e., the aligned EEG trials are approximated whitened.
CA with the Riemannian mean is an iterative process initialized by the Euclidean mean. CA with the LogEuclidean mean is an approximation of CA with the Riemannian mean, with reduced computational cost [9]. So, (18) also holds approximately for these two means.
This whitening effect will also be experimentally demonstrated in Section IVE.
IiiB Tangent Space Feature Extraction
After covariance matrix CA, we map each covariance matrix to a tangent space feature vector in :
(19)  
(20) 
Note that this is different from the original tangent space mapping in (9), in that (9) uses the same reference matrix for all subjects, whereas our approach uses a subjectspecific for each different subject.
Next, we form new feature matrices and .
IiiC Mapping Matrices Identification
CA does not reduce the conditional probability distribution discrepancies. We next find projection matrices
, which map and to lower dimensional matrices and , with the following desirable properties:
Joint probability distribution shift minimization. In traditional domain adaptation [11, 14], MMD is frequently used to reduce the marginal and conditional probability distribution discrepancies between the source and the target domains, i.e.,
(21) where and are the tangent space vectors in the th () class of the source domain and the target domain, respectively, and and are the number of examples in the th class of the source domain and the target domain, respectively.
Next, we propose a new measure, joint probability MMD, to quantify the probability distribution shift between the source and the target domains, by considering the joint probability directly, instead of the marginal and the conditional probabilities separately.
Let the source domain onehot label matrix be , and the predicted target domain onehot label matrix be . Then, the joint probability MMD between the source and the target domains is:
(22) where
(23) The joint probability MMD is based on the joint probability rather than the conditional probability, which in theory can handle more probability distribution shifts.

Source domain discriminability. During subspace mapping, the discriminating ability of the source domain can be preserved by:
(24) where is the withinclass scatter matrix, in which , and is the betweenclass scatter matrix, in which is the mean of samples from Class , and is the mean of all samples.

Target domain locality preservation. We also introduce a graphbased regularization term to preserve the local structure in the target domain. Under the manifold assumption [29], if two samples and are close in the original target domain, then they should also be close in the projected subspace.
Let be a similarity matrix:
(25) where is the set of the nearest neighbors of .
Using the normalized graph Laplacian matrix , where is a diagonal matrix with , graph regularization is expressed as:
(26) To remove the scaling effect, we add a constraint on the target embedding [30]:
(27) where is the centering matrix of the target domain.

Parameter transfer and regularization. Since the source and the target domains have the same feature space, and CA has brought their probability distributions closer, we want the projection matrix to be similar to the projection matrix learned in the source domain. Additionally, for better generalization performance, we want to ensure that and do not include extreme values. Thus, we have the following constraints on the projection matrices:
(28)
IiiD The Overall Loss Function of MEKT
Integrating all regularizations and constraints above, the formulation of MEKT is:
(29) 
Let . Then, the Lagrange function is
(30) 
where
(31)  
(32)  
(33) 
Setting the derivative , we have
(34) 
(34) can be solved by generalized eigendecomposition, and consists of the eigenvectors corresponding to the smallest eigenvalues. Since is needed in [see (23)], and hence
, we use a general expectationmaximization like pseudo label refinement procedure
[14] to refine the estimation, as shown in Algorithm 1.Note that for the clarity of explanation, Algorithm 1 only considers one source domain. When there are multiple source domains, we perform CA and compute the tangent space feature vectors for each source domain separately, and then assemble their feature vectors into a single source domain feature matrix .
IiiE Domain Transferability Estimation (DTE)
When there are a large number of source domains, estimating domain transferability can advise which domains are more important, and also reduce the computational cost. In BCIs, DTE can be used to find subjects which have low correlations to the tasks and hence may cause negative transfer. Although source domain selection is important, it is very challenging, and hence very few publications can be found in the literature [13, 4, 31, 32].
Next, we propose an unsupervised DTE strategy.
Assume there are labeled sources domains , where is the feature matrix of the th source domain, is the corresponding label vector. Assume also there is a target domain with unlabeled feature matrix . Let be the betweenclass scatter matrix, similar to in (24), and be the scatter matrix between the source and the target domains. We define the discriminability of the th source domain as , and the difference between the source domain and the target domain as .
Then, the transferability of Source Domain is computed as:
(35) 
We then select source subjects with the highest .
Iv Experiments
In this section, we evaluate our method for both singlesource to singletarget (STS) transfers and multisource to singletarget (MTS) transfers. The code is available online^{1}^{1}1https://github.com/chamwen/MEKT.
Iva Datasets
We used two MI datasets and two ERP datasets in our experiments.
For both MI datasets, a subject sat in front of a computer screen. At the beginning of a trial, a fixation cross appeared on the black screen to prompt the subject to be prepared. Shortly after, an arrow pointing to a certain direction was presented as a visual cue for a few seconds, during which the subject performed a specific MI task. Then, the visual cue disappeared, and the next trial started after a short break. EEG signal was recorded during the course, and used to classify which MI the user was performing.
For the first MI dataset^{2}^{2}2http://www.bbci.de/competition/iv/desc_1.html. (MI1), 59channel EEGs were recorded from seven healthy subjects, each with 100 left hand MIs and 100 right hand MIs. For the second MI dataset^{3}^{3}3http://www.bbci.de/competition/iv/desc_2a.pdf. (MI2), 22channel EEGs were recorded from nine heathy subjects, each with 72 left hand MIs and 72 right hand MIs. Both datasets were used for twoclass classification.
The first ERP dataset^{4}^{4}4https://www.physionet.org/physiobank/database/ltrsvp/. contained 8channel EEG recordings from 11 healthy subjects in a rapid serial visual presentation (RSVP) experiment. The images were presented at different rates (5, 6, and 10 Hz) in three different experiments. We only used the 5 Hz version. The goal was to classify from EEG if the subject had seen a target image (with airplane) and nontarget image (without airplane). The number of images for different subjects varying between 368 and 565, and the target to nontarget ratio was around 1:9.
The second ERP dataset^{5}^{5}5https://www.kaggle.com/c/inriabcichallenge. was recorded from a feedback errorrelated negativity (ERN) experiment [33], which was used in a Kaggle competition for twoclass classification. It was collected from 26 subjects with 56 electrodes and partitioned into training set (16 subjects) and test set (10 subjects). We only used the 16 subjects in the training set as we do not have access to the test set. The average target to nontarget ratio was around 1:4.
IvB EEG Data Preprocessing
EEG signals from all datasets were preprocessed using the EEGLAB toolbox [34]. For the two MI datasets, a causal 50order finite impulse response bandpass filter (830 Hz) was applied to remove muscle artifacts and direct current drift. Next, EEG signals between seconds after the cue appearance were extracted as one trial. The RSVP signal was bandpass filtered between
Hz, downsampled to 64Hz, and epoched to the
second interval timelocked to the stimulus onset. The ERN signal was downsampled to 200Hz, bandpass filtered to 140Hz, epoched between second, and normalized.MI1 had 59 EEG channels, which were not easy to manipulate. Thus, we reduced the number of its tangent space features to the number of source domain samples (144), according to their values in oneway ANOVA. For the ERN dataset, we used xDAWN [35] to reduce the number of channels from 56 to 6.
The dimensionalities of different input spaces are shown in Table I.
Feature Space  Dimensionality  

MI1  MI2  ERP  
Euclidean  6200  6  20 
Tangent  200200  253  
Riemannian  5959200  2222 
Next we describe how the Euclidean space features were determined.
For the two MI datasets, six CSP variances [see (4
)] were used as the features. For the two ERP datasets, after spatial filtering by xDAWN, we assembled each EEG trail into a vector, performed principal component analysis on all vectors from the source subjects, and extracted the scores for the first 20 principal components as the features.
IvC Baseline Algorithms
We compared our MEKT approaches (MEKTR: the Riemannian mean is used as the reference matrix; MEKTE: the Euclidean mean is used as the reference matrix; MEKTL: the LogEuclidean mean is used as the reference matrix) with seven stateoftheart baseline algorithms for BCI classification. According to the feature space type, these baselines can be divided into three categories:

Euclidean space approaches:

CSPLDA (linear discriminant analysis) [36]
for MI, and CSPSVM (support vector machine)
[37] for ERP. 
EACSPLDA for MI, and EAxDAWNSVM for ERP, i.e., we performed EA [23] as a preprocessing step before spatial filtering and classification.


Riemannian space approach: RAMDM [22] for MI, and xDAWNRAMDM for ERP.

Tangent
space approaches, which were proposed for computer vision applications, and have not been used in BCIs before. CA was used before each of them. In each learned subspace, the sLDA classifier
[38] was used for MI, and SVM for ERP.
Hyperparameters of all baselines were set according to the recommendations in their corresponding publications. For MEKT, , , , , and were used .
IvD Experimental Settings
We evaluated unsupervised STS and MTS transfers. In STS, one subject was used as the target, and another as the source. In MTS, one subject was used as the target, and all others as the sources, which is similar to leaveonesubjectout crossvalidation in traditional BCI classification.
Let be the number of subjects in a dataset. Then, there are different STS tasks, and different MTS tasks.
The balanced classification accuracy (BCA) was used as the performance measure:
(36) 
where and are the number of true positives and the number of samples in Class , respectively.
IvE Visualization
As explained in Section IIIA, CA makes the aligned covariance matrices approximate the identity matrix, no matter whether the Riemannian mean, or the Euclidean mean, or the LogEuclidean mean, is used as the reference matrix. To demonstrate that, Fig. 2 shows the raw covariance matrix of the first EEG trial of Subject 1 in MI2, and the aligned covariance matrices using different references. The raw covariance matrix is nowhere close to identity, but after CA, the covariance matrices are approximately identity, and hence the corresponding EEG trials are approximately whitened.
Next, we used SNE [39] to reduce the dimensionality of the EEG trials to two, and visualize if MEKT can bring the data distributions of the source and the target domains together. Fig. 3 shows the results on transferring Subject 2’s data to Subject 1 in MI2, before and after different data alignment approaches. Before CA, the source domain and target domain samples do not overlap at all. After CA, the two sets of samples have identical mean, but different variances. CAGFK and CAJDA make the variance of the source domain samples and the variance of the target domain samples approximately identical, but different classes are still not well separated. MEKTR not only makes the overall distributions of the source domain samples and the target domain samples consistent, but also samples from the same class in the two domains close, which should benefit the classification.
IvF Classification Accuracies
The average BCAs on the four datasets are shown in Table II. All MEKTbased approaches outperformed the baselines.
STS  MTS  
MI1  MI2  MI1  MI2  Avg  
CSPLDA  57.61  58.60  59.71  67.82  60.94 
RAMDM  64.98  66.60  73.29  72.07  69.24 
EACSPLDA  66.96  65.16  79.79  73.53  71.36 
CA  66.17  66.02  76.29  71.84  70.08 
CACORAL  67.69  67.26  78.86  72.38  71.55 
CAGFK  66.62  65.54  76.79  72.99  70.49 
CAJDA  66.01  66.59  81.07  74.15  71.96 
CAJGSA  65.81  65.90  76.79  73.07  70.39 
MEKTE  69.19  68.34  81.29  76.00  73.71 
MEKTL  70.74  68.56  83.07  76.54  74.73 
MEKTR  70.99  68.74  83.42  76.31  74.87 
RSVP  ERN  RSVP  ERN  Avg  
xDAWNSVM  58.58  54.34  65.36  61.87  60.04 
xDAWNRAMDM  60.37  56.22  67.29  62.90  61.70 
EAxDAWNSVM  58.76  55.57  69.07  64.63  62.01 
CA  58.34  56.97  67.35  65.89  62.14 
CACORAL  58.45  57.04  66.94  66.17  62.15 
CAGFK  59.93  57.24  67.75  66.03  62.74 
CAJDA  60.27  57.56  66.06  64.64  62.13 
CAJGSA  55.23  57.17  64.57  57.68  58.66 
MEKTE  61.08  58.01  67.92  66.70  63.43 
MEKTL  61.15  57.91  68.40  65.98  63.36 
MEKTR  61.24  57.85  68.38  66.17  63.41 
Fig. 4 shows the BCAs of all tangent space based approaches when different reference matrices were used in CA. The Riemannian mean obtained the best BCA in four out of the six approaches, and also the best overall performance.
We also performed paired tests on the BCAs to check if the performance improvements of MEKTR over others were statistically significant. Before each test, we performed a Lilliefors test [40]
to verify that the null hypothesis that the data come from a normal distribution cannot be rejected. Then, we performed false discovery rate corrections
[41] by a linearstep up procedure under a fixed significance level () on the paired values of each task.The false discovery rate adjusted values (values) are shown in Table III. MEKTR significantly outperformed all baselines in almost all STS transfers. The performance improvements became less significant when there were multiple source domains, which is reasonable, because generally in machine learning the differences between different algorithms diminish as the amount of training data increases.
MEKTR vs  MI1  MI2  RSVP  ERN  

STS  CSPLDA  .0000  .0000  –  – 
xDAWNSVM  –  –  .0002  .0000  
RAMDM  .0003  .0340  .0412  .0004  
EACSPLDA  .0044  .0003  –  –  
EAxDAWNSVM  –  –  .0000  .0000  
CA  .0000  .0006  .0000  .0010  
CACORAL  .0005  .0340  .0000  .0014  
CAGFK  .0000  .0001  .0016  .0130  
CAJDA  .0003  .0183  .0386  .2627  
CAJGSA  .0021  .0006  .0000  .0241  
MTS  CSPLDA  .0329  .1239  –  – 
xDAWNSVM  –  –  .2077  .0306  
xDAWNRAMDM  .0824  .1636  .5347  .0632  
EACSPLDA  .2808  .1636  –  –  
EAxDAWNSVM  –  –  .5733  .2632  
CA  .0329  .1260  .4727  .8380  
CACORAL  .0897  .1636  .3477  .9914  
CAGFK  .0824  .1260  .5347  .9117  
CAJDA  .2379  .1636  .0349  .0632  
CAJGSA  .1344  .1636  .0323  .0018 
IvG Computational Cost
This subsection compares the computational cost of different algorithms, which were implemented in Matlab 2018a on a laptop with i78550U CPU@2.00GHz, 8GB memory, running 64bit Windows 10 Education Edition.
For MTS transfers, base classifiers construction took the most time, because there were a large number of source domain training samples. To emphasize the computational cost of different data alignment approaches, we only show the computing time on MI2 and RSVP STS tasks in Fig. 5. EA was the most efficient. RAMDM, CAJDA and MEKTR had similar computational cost. MEKTL and MEKTE had comparable performance with MEKTR (Table II), but much shorter computing time. MEKTL seemed to be the best compromise between classification accuracy and computational cost.
IvH Effectiveness of the Joint Probability MMD
To validate the superiority of the joint probability MMD over the traditional MMD, we replaced the joint probability MMD term in (29) by the traditional MMD term in (21), and repeated the experiments. The results are shown in Table IV. The joint probability MMD outperformed the traditional MMD in six out of the eight tasks. We expect that the joint probability MMD should also be advantageous in other applications that the traditional MMD is now used.
IvI Effectiveness of DTE
This subsection validates our DTE strategy on MTS tasks to select the most beneficial source subjects.
Table V shows the BCAs when different source domain selection approaches were used: RAND randomly selected
source subjects [because there was randomness, we repeated the experiment 20 times, and report the mean and standard deviation (in the parentheses)], ROD was the approach proposed in
[13], and ALL used all source subjects. Table VI shows the computational cost of different algorithms.Tables V and VI shows that the proposed DTE outperformed RAND and ROD in terms of the classification accuracy. Although its BCAs were generally slightly worse than those of ALL, its computational cost was much lower than ALL, especially when became large, i.e., when , it can save over 50% computational cost.
RAND  ROD  DTE  ALL  

MI1  7  81.53 (1.19)  81.86  82.14  83.42 
MI2  9  75.05 (1.06)  74.38  76.23  76.31 
RSVP  11  67.48 (0.31)  67.79  68.70  68.38 
ERN  16  65.31 (0.52)  65.36  65.51  66.17 
RAND  ROD  DTE  ALL  

MI1  7  11.55  12.46  11.77  12.84 
MI2  9  0.72  0.91  0.76  1.11 
RSVP  11  4.01  4.31  4.08  8.64 
ERN  16  7.65  8.28  7.79  15.80 
V Conclusions
Transfer learning is popular in EEGbased BCIs to cope with variations among different subjects and/or tasks. This paper has considered offline unsupervised crosssubject EEG classification, i.e., we have labeled EEG trials from one or more source subjects, but only unlabeled EEG trials from the target subject. We proposed a novel MEKT approach, which has three steps: 1) align the covariance matrices of the EEG trials in the Riemannian manifold; 2) extract tangent space features; and, 3) perform domain adaptation by minimizing the joint probability distribution shift between the source and the target domains, while preserving their geometric structures. An optional fourth step, DTE, was also proposed to identify the most beneficial source domains, and hence to reduce the computational cost. Experiments on four EEG datasets from two different BCI paradigms demonstrated that MEKT outperformed several stateoftheart transfer learning approaches. Moreover, DTE can reduce more than half of the computational cost when the number of source subjects is large, with little sacrifice of classification accuracy.
References
 [1] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan, “Braincomputer interfaces for communication and control,” Clinical Neurophysiology, vol. 113, no. 6, pp. 767–791, 2002.
 [2] R. P. Rao, Braincomputer interfacing: an introduction. Cambridge, England: Cambridge University Press, 2013.
 [3] B. He, B. Baxter, B. J. Edelman, C. C. Cline, and W. W. Ye, “Noninvasive braincomputer interfaces based on sensorimotor rhythms,” Proc. of the IEEE, vol. 103, no. 6, pp. 907–925, May 2015.
 [4] D. Wu, “Online and offline domain adaptation for reducing BCI calibration effort,” IEEE Trans. on HumanMachine Systems, vol. 47, no. 4, pp. 550–563, 2017.
 [5] F. Lotte, L. Bougrain, A. Cichocki, M. Clerc, M. Congedo, A. Rakotomamonjy, and F. Yger, “A review of classification algorithms for EEGbased brain–computer interfaces: a 10 year update,” Journal of neural engineering, vol. 15, no. 3, p. 031005, 2018.
 [6] Z. J. Koles, M. S. Lazar, and S. Z. Zhou, “Spatial patterns underlying population differences in the background EEG,” Brain Topography, vol. 2, no. 4, pp. 275–284, 1990.
 [7] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Multiclass braincomputer interface classification by Riemannian geometry,” IEEE Trans. on Biomedical Engineering, vol. 59, no. 4, pp. 920–928, Apr. 2012.
 [8] F. Yger, M. Berar, and F. Lotte, “Riemannian approaches in braincomputer interfaces: a review,” IEEE Trans. on Neural Systems and Rehabilitation Engineering, vol. 25, no. 10, pp. 1753–1762, Nov. 2017.
 [9] A. Barachant and M. Congedo, “A plug & play P300 BCI using information geometry,” arXiv: 1409.0107, 2014.
 [10] L. Korczowski, M. Congedo, and C. Jutten, “Singletrial classification of multiuser P300based BrainComputer Interface using riemannian geometry,” in Proc. 37th Annu. Int’l. Conf. IEEE Eng. Med. Biol. Soc., Milan, Italy, Aug. 2015, pp. 1769–1772.

[11]
S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via
transfer component analysis,”
IEEE Trans. on Neural Networks
, vol. 22, no. 2, pp. 199–210, Feb. 2011.  [12] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Proc. 30th AAAI Conf. on Artificial Intell., Arizona, Feb. 2016.

[13]
B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for
unsupervised domain adaptation,” in
Proc. IEEE Conf. on Computer Vision and Pattern Recognition
, Providence, RI, Jun. 2012, pp. 2066–2073.  [14] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in Proc. IEEE Int’l Conf. on Computer Vision, Sydney, Australia, Dec. 2013, pp. 2200–2207.
 [15] J. Zhang, W. Li, and P. Ogunbona, “Joint geometrical and statistical alignment for visual domain adaptation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Hawaii, Jul. 2017, pp. 1859–1867.
 [16] D. Wu, B. J. Lance, and T. D. Parsons, “Collaborative filtering for braincomputer interaction using transfer learning and active class selection,” PLOS one, vol. 8, no. 2, p. e56624, 2013.
 [17] D. Wu, V. J. Lawhern, W. D. Hairston, and B. J. Lance, “Switching EEG headsets made easy: Reducing offline calibration effort using active wighted adaptation regularization,” IEEE Trans. on Neural Systems and Rehabilitation Engineering, vol. 24, no. 11, pp. 1125–1137, Mar. 2016.
 [18] V. Jayaram, M. Alamgir, Y. Altun, B. Scholkopf, and M. GrosseWentrup, “Transfer learning in braincomputer interfaces,” IEEE Comput. Intell. Mag., vol. 11, no. 1, pp. 20–31, Jan. 2016.
 [19] H. Kang, Y. Nam, and S. Choi, “Composite common spatial pattern for subjecttosubject transfer,” IEEE Signal Processing Letters, vol. 16, no. 8, pp. 683–686, 2009.
 [20] F. Lotte and C. Guan, “Learning from other subjects helps reducing braincomputer interface calibration time,” in Proc. IEEE Int’l. Conf. on Acoustics Speech and Signal Processing, Dallas, TX, Mar. 2010, pp. 614–617.
 [21] Y. Jin, M. Mousavi, and V. R. de Sa, “Adaptive CSP with subspace alignment for subjecttosubject transfer in motor imagery braincomputer interfaces,” in Proc. 6th Int’l. Conf. on BrainComputer Interface (BCI), GangWon, South Korea, 2018, pp. 1–4.
 [22] P. Zanini, M. Congedo, C. Jutten, S. Said, and Y. Berthoumieu, “Transfer learning: a Riemannian geometry framework with applications to braincomputer interfaces,” IEEE Trans. on Biomedical Engineering, vol. 65, no. 5, pp. 1107–1116, Aug. 2018.
 [23] H. He and D. Wu, “Transfer learning for braincomputer interfaces: A Euclidean space data alignment approach,” IEEE Trans. on Biomedical Engineering, Apr. 2019.
 [24] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Classification of covariance matrices using a riemannianbased kernel for BCI applications,” Neurocomputing, vol. 112, pp. 172–178, 2013.
 [25] H. Ramoser, J. MullerGerking, and G. Pfurtscheller, “Optimal spatial filtering of single trial EEG during imagined hand movement,” IEEE Trans. on Rehabilitation Engineering, vol. 8, no. 4, pp. 441–446, 2000.

[26]
V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “LogEuclidean metrics for fast and simple calculus on diffusion tensors,”
Magnetic Resonance in Medicine: An Official Journal of the Int’l. Society for Magnetic Resonance in Medicine, vol. 56, no. 2, pp. 411–421, 2006.  [27] R. Bhatia, Positive Definite Matrices. New Jersey: Princeton University Press, 2009.
 [28] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel twosample test,” Journal of Machine Learning Research, vol. 13, no. 3, pp. 723–773, Mar. 2012.

[29]
M. Belkin and P. Niyogi, “Semisupervised learning on Riemannian manifolds,”
Machine learning, vol. 56, no. 13, pp. 209–239, 2004.  [30] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003.
 [31] D. Wu, V. J. Lawhern, S. Gordon, B. J. Lance, and C.T. Lin, “Driver drowsiness estimation from EEG signals using online weighted adaptation regularization for regression (OwARR),” IEEE Trans. on Fuzzy Systems, vol. 25, no. 6, pp. 1522–1535, 2017.
 [32] C.S. Wei, Y.P. Lin, Y.T. Wang, T.P. Jung, N. BigdelyShamlo, and C.T. Lin, “Selective transfer learning for EEGbased drowsiness detection,” in Proc. IEEE Int’l Conf. on Systems, Man and Cybernetics, Hong Kong, October 2015, pp. 3229–3232.
 [33] P. Margaux, M. Emmanuel, D. Sbastien, B. Olivier, and M. Jrmie, “Objective and subjective evaluation of online error correction during P300based spelling,” Advances in HumanComputer Interaction, vol. 2012, p. 4, 2012.

[34]
A. Delorme and S. Makeig, “EEGLAB: An open source toolbox for analysis of singletrial EEG dynamics including independent component analysis,”
Journal of Neuroscience Methods, vol. 134, no. 1, pp. 9–21, 2004.  [35] B. Rivet, A. Souloumiac, V. Attina, and G. Gibert, “xDAWN algorithm to enhance evoked potentials: application to braincomputer interface,” IEEE Trans. on Biomedical Engineering, vol. 56, no. 8, pp. 2035–2043, Aug. 2009.
 [36] C. M. Bishop, Pattern recognition and machine learning. New York: springer, 2006.
 [37] C.C. Chang and C.J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. on Intell. Systems and Technol., vol. 2, no. 3, p. 27, Apr. 2011.
 [38] R. Peck and J. Van Ness, “The use of shrinkage estimators in linear discriminant analysis,” IEEE Trans. on Pattern Analysis and Machine Intell., vol. 4, no. 5, pp. 530–537, May 1982.
 [39] L. v. d. Maaten and G. Hinton, “Visualizing data using tSNE,” Journal of Machine Learning Research, vol. 9, no. Nov., pp. 2579–2605, 2008.
 [40] H. W. Lilliefors, “On the KolmogorovSmirnov test for normality with mean and variance unknown,” Journal of the American statistical Association, vol. 62, no. 318, pp. 399–402, 1967.
 [41] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal statistical society: series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995.
Comments
There are no comments yet.