Manifold Embedded Knowledge Transfer for Brain-Computer Interfaces

10/14/2019 ∙ by Wen Zhang, et al. ∙ Huazhong University of Science u0026 Technology 0

Transfer learning makes use of data or knowledge in one problem to help solve a different, yet related, problem. It is particularly useful in brain-computer interfaces (BCIs), for coping with variations among different subjects and/or tasks. This paper considers offline unsupervised cross-subject electroencephalogram (EEG) classification, i.e., we have labeled EEG trials from one or more source subjects, but only unlabeled EEG trials from the target subject. We propose a novel manifold embedded knowledge transfer (MEKT) approach, which first aligns the covariance matrices of the EEG trials in the Riemannian manifold, extracts features in the tangent space, and then performs domain adaptation by minimizing the joint probability distribution shift between the source and the target domains, while preserving their geometric structures. MEKT can cope with one or multiple source domains, and can be computed efficiently. We also propose a domain transferability estimation (DTE) approach to identify the most beneficial source domains, in case there are a large number of source domains. Experiments on four EEG datasets from two different BCI paradigms demonstrated that MEKT outperformed several state-of-the-art transfer learning approaches, and DTE can reduce more than half of the computational cost when the number of source subjects is large, with little sacrifice of classification accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A brain-computer interface (BCI) provides a direct communication pathway between a user’s brain and a computer [1, 2]. Electroencephalogram (EEG), a multi-channel time-series, is the most frequently used BCI input signal. There are three common paradigms in EEG-based BCIs: motor imagery (MI) [3], event-related potentials (ERPs) [4], and steady-state visual evoked potentials [2]. The first two are the focus of this paper.

In MI tasks, the user needs to imagine the movements of his/her body parts, which causes modulations of brain rhythms in the involved cortical areas. In ERP tasks, the user is stimulated by a majority of non-target stimuli and a few target stimuli; a special ERP pattern appears in the EEG response after the user perceives a target stimulus. EEG-based BCI systems have been widely used to help people with disabilities, and also the able-bodied [1].

A standard EEG signal analysis pipeline consists of temporal (band-pass) filtering, spatial filtering, and classification [5]. Spatial filters such as common spatial patterns (CSP) [6]

are widely used to enhance the signal-to-noise ratio. Recently, there is a trend to utilize the covariance matrices of EEG trials, which are symmetric positive definite (SPD) and can be viewed as points on a Riemannian manifold, in EEG signal analysis

[7, 8, 9]. For MI tasks, the discriminative information is mainly spatial, and can be directly encoded in the covariance matrices. On the contrary, the main discriminative information of ERP trials is temporal. A novel approach was proposed in [10]

to augment each EEG trial by the mean of all target trials that contain the ERP, and then their covariance matrices are computed. However, Riemannian space based approaches are computationally expensive, and not compatible with Euclidean space machine learning approaches.

A major challenge in BCIs is that different users have different neural responses to the same stimulus, and even the same user can have different neural responses to the same stimulus at different time/locations. Besides, when calibrating the BCI system, acquiring a large number of subject-specific labeled training examples for each new subject is time-consuming and expensive. Transfer learning [11, 12, 13, 14, 15], which uses data/information from one or more source domains to help the learning in a target domain, can be used to address these problems. Some representative applications of transfer learning in BCIs can be found in [16, 17, 18, 19, 20, 21]. Many researchers [19, 20, 21] attempted to seek a set of subject-invariant CSP filters to increase the signal-to-noise ratio. Another pipeline is Riemannian geometry based. Zanini et al. [22] proposed a Riemannian alignment (RA) framework to align the EEG covariance matrices from different subjects. He and Wu [23]

extended RA to Euclidean alignment (EA) in the Euclidean space, so that any Euclidean space classifier can be used after it.

Fig. 1: Illustration of our proposed MEKT. Squares and circles represent examples from different classes. Different colors represent different domains. All domains are first aligned on the Riemannian manifold, and then mapped onto a tangent space. and are projection matrices of the source and the target domains, respectively.

To utilize the excellent properties of the Riemannian geometry and avoid its high computational cost, as well as to leverage knowledge learned from the source subjects, this paper proposes a manifold embedded knowledge transfer (MEKT) framework, which first aligns the covariance matrices of the EEG trials in the Riemannian manifold, then performs domain adaptation in the tangent space by minimizing the joint probability distribution shift between the source and the target domains, while preserving their geometric structures, as illustrated in Fig. 1. Additionally, we propose a domain transferability estimation (DTE) approach to select the most beneficial subjects in multi-source transfer learning. Experiments on four datasets from two different BCI paradigms (MI and ERP) verified the effectiveness of MEKT and DTE.

The remainder of this paper is organized as follows: Section II introduces related work on spatial filters, Riemannian geometry, tangent space mapping, RA, EA, and subspace adaptation. Section III describes the details of the proposed MEKT and DTE. Section IV presents experiments to compare the performance of MEKT with several state-of-the-art data alignment and transfer learning approaches. Finally, Section V draws conclusions.

Ii Related Work

This section introduces background knowledge on spatial filters, Riemannian geometry, tangent space mapping, RA, EA, and subspace adaptation, which will be used in the next section.

Ii-a Spatial Filters

Spatial filtering can be viewed as a data-driven dimensionality reduction approach that promotes the variance difference between two conditions

[24]. It is common in MI-based BCIs to use CSP filters [25] to simultaneously diagonalize the two intra-class covariance matrices.

Consider a binary classification problem. Let be the th labeled training example, where , in which is the number of EEG channels, and the number of time domain samples. For Class (, CSP finds a spatial filter matrix , where is the number of spatial filters, to maximize the variance difference between Class  and Class :

(1)

where is the mean covariance matrix of all EEG trials in Class , and is the trace of a matrix. The solution is the concatenation of the

leading eigenvectors of the matrix

.

Finally, we concatenate the spatial filters from both classes to obtain the complete CSP filters:

(2)

and compute the spatially filtered by:

(3)

The log-variances of the filtered trial can be extracted:

(4)

and used as input features in classification.

Ii-B Riemannian Geometry

All SPD matrices form a differentiable Riemannian manifold. Riemannian geometry is used to manipulate them. Some basic definitions are provided below.

The Riemannian distance between two SPD matrices and is:

(5)

where is the Frobenius norm, and

donates the logarithm of the eigenvalues of

.

The Riemannian mean of is:

(6)

The Euclidean mean is:

(7)

The Log-Euclidean mean [26] is:

(8)

where is usually set to .

Ii-C Tangent Space Mapping

Tangent space mapping is also known as the logarithmic mapping, which maps a Riemannian space SPD matrix

to a Euclidean tangent space vector

around an SPD matrix , which is usually the Riemannian or Euclidean mean:

(9)

where takes the upper triangular part of a SPD matrix and forms a vector , and is a reference matrix. To obtain a tangent space locally homomorphic to the manifold, is needed [24].

Congruent transform and congruence invariance [27] are two important properties in the Riemannian space:

(10)
(11)

where is the Euclidean or Riemannian mean operation, is a nonsingular square matrix, and is an invertible symmetric matrix. (11) suggests that the Riemannian distance between two SPD matrices does not change, if both are left and right multiplied by an invertible symmetric matrix.

Ii-D Riemannian Alignment (RA)

RA [22] first computes the covariance matrices of some resting (or non-target) trials, , in which the subject is not performing any task (or not performing the target task), and then the Riemannian mean of these matrices, which is used as the reference matrix to reduce the inter-session or inter-subject variations, by the following transformation:

(12)

where is the covariance matrix of the -th trial, and the corresponding aligned covariance matrix. Then, all can be classified by a minimum distance to mean (MDM) classifier [7].

Ii-E Euclidean Alignment (EA)

Although RA-MDM has demonstrated promising performance, it still has some limitations [23]: 1) it processes the covariance matrices in the Riemannian space, whereas there are very few Riemannian space classifiers; 2) it computes the reference matrix from the non-target stimuli in ERP-based BCIs, which requires some labeled trials from the new subject.

EA [23] extends RA and solves the above problems by transforming an EEG trial in the Euclidean space:

(13)

where is the Euclidean mean of the covariance matrices of all EEG trials, computed by (7).

However, EA only considers the marginal probability distribution shift, and works best when the number of EEG channels is small. When there are a large number of channels, computing may be numerically unstable.

Ii-F Subspace Adaptation

Tangent space vectors usually have very high dimensionality, so they cannot be used easily in transfer learning. An intuitive approach is to align them in a lower dimensional subspace. Pan et al. [11] proposed transfer component analysis (TCA) to learn the transferable components across domains in a reproducible kernel Hilbert space using maximum mean discrepancy (MMD) [28]

. Joint distribution adaptation (JDA)

[14] improves TCA by considering the conditional distribution shift using pseudo label refinement. Joint geometrical and statistical alignment (JGSA) [15] further improves JDA by adding two regularization terms, which minimize the within-class scatter matrix and maximize the between-class scatter matrix.

Iii Manifold Embedded Knowledge Transfer (MEKT)

This section proposes the MEKT approach. Its goal is to use one or multiple source subjects’ data to help the target subject, given that they have the same feature space and label space. For the ease of illustration, we focus on a single source domain first.

Assume the source domain has labeled instances , where is the -th feature matrix, and is the corresponding one-hot label vector, in which , and denote the number of channels, time domain samples, and classes, respectively. Let . Assume also the target domain has unlabeled feature matrices , where .

MEKT consists of the following three steps:

  1. Covariance matrix centroid alignment (CA): Align the centroids of the covariance matrices of and , so that their marginal probability distributions are close.

  2. Tangent space feature extraction

    : Map the aligned covariance matrices to a tangent space feature matrix , and , where is the dimensionality of the tangent space features.

  3. Mapping matrices identification: Find projection matrices and , where is the dimensionality of a shared subspace, such that and are close.

After MEKT, a Euclidean space classifier can be trained on and applied to .

Next, we describe the details of the above three steps.

Iii-a Covariance Matrix Centroid Alignment (CA)

CA serves as a preprocessing step to reduce the marginal probability distribution shift of different domains, and enables transfer from multiple source domains.

Let be the -th covariance matrix in the source domain, and , where can be the Riemannian mean in (6), the Euclidean mean in (7), or the Log-Euclidean mean in (8). Then, we align the covariance matrices by

(14)

Similarly, we can obtain the aligned covariance matrices of the target domain.

CA has two desirable properties:

  1. Marginal probability distribution shift minimization. From the properties of congruent transform and congruence invariance, we have

    (15)

    i.e., if we choose

    as the Riemannian (or Euclidean) mean, then different domains’ geometric (or arithmetic) centers all equal the identity matrix. Therefore, the marginal distributions of the source and the target domains are brought closer on the manifold.

  2. EEG trial whitening. In the following, we show that each aligned covariance matrix is approximately an identity matrix after CA.

    If we decompose the reference matrix as , then the -th element of is:

    (16)

    From (15) we have

    (17)

    The above equation holds no matter whether is the Riemannian mean, or the Euclidean mean.

    For CA using the Euclidean mean, the average of the -th diagonal element of is

    (18)

    Meanwhile, for each diagonal element, we have , therefore the diagonal elements of are around 1. Similarly, the off-diagonal elements of are around 0. Thus, is approximatively an identify matrix, i.e., the aligned EEG trials are approximated whitened.

    CA with the Riemannian mean is an iterative process initialized by the Euclidean mean. CA with the Log-Euclidean mean is an approximation of CA with the Riemannian mean, with reduced computational cost [9]. So, (18) also holds approximately for these two means.

    This whitening effect will also be experimentally demonstrated in Section IV-E.

Iii-B Tangent Space Feature Extraction

After covariance matrix CA, we map each covariance matrix to a tangent space feature vector in :

(19)
(20)

Note that this is different from the original tangent space mapping in (9), in that (9) uses the same reference matrix for all subjects, whereas our approach uses a subject-specific for each different subject.

Next, we form new feature matrices and .

Iii-C Mapping Matrices Identification

CA does not reduce the conditional probability distribution discrepancies. We next find projection matrices

, which map and to lower dimensional matrices and , with the following desirable properties:

  1. Joint probability distribution shift minimization. In traditional domain adaptation [11, 14], MMD is frequently used to reduce the marginal and conditional probability distribution discrepancies between the source and the target domains, i.e.,

    (21)

    where and are the tangent space vectors in the -th () class of the source domain and the target domain, respectively, and and are the number of examples in the -th class of the source domain and the target domain, respectively.

    Next, we propose a new measure, joint probability MMD, to quantify the probability distribution shift between the source and the target domains, by considering the joint probability directly, instead of the marginal and the conditional probabilities separately.

    Let the source domain one-hot label matrix be , and the predicted target domain one-hot label matrix be . Then, the joint probability MMD between the source and the target domains is:

    (22)

    where

    (23)

    The joint probability MMD is based on the joint probability rather than the conditional probability, which in theory can handle more probability distribution shifts.

  2. Source domain discriminability. During subspace mapping, the discriminating ability of the source domain can be preserved by:

    (24)

    where is the within-class scatter matrix, in which , and is the between-class scatter matrix, in which is the mean of samples from Class , and is the mean of all samples.

  3. Target domain locality preservation. We also introduce a graph-based regularization term to preserve the local structure in the target domain. Under the manifold assumption [29], if two samples and are close in the original target domain, then they should also be close in the projected subspace.

    Let be a similarity matrix:

    (25)

    where is the set of the -nearest neighbors of .

    Using the normalized graph Laplacian matrix , where is a diagonal matrix with , graph regularization is expressed as:

    (26)

    To remove the scaling effect, we add a constraint on the target embedding [30]:

    (27)

    where is the centering matrix of the target domain.

  4. Parameter transfer and regularization. Since the source and the target domains have the same feature space, and CA has brought their probability distributions closer, we want the projection matrix to be similar to the projection matrix learned in the source domain. Additionally, for better generalization performance, we want to ensure that and do not include extreme values. Thus, we have the following constraints on the projection matrices:

    (28)

Iii-D The Overall Loss Function of MEKT

Integrating all regularizations and constraints above, the formulation of MEKT is:

(29)

Let . Then, the Lagrange function is

(30)

where

(31)
(32)
(33)

Setting the derivative , we have

(34)

(34) can be solved by generalized eigen-decomposition, and consists of the eigenvectors corresponding to the smallest eigenvalues. Since is needed in [see (23)], and hence

, we use a general expectation-maximization like pseudo label refinement procedure

[14] to refine the estimation, as shown in Algorithm 1.

Input: source domain samples , where and ;
          target domain feature matrices , where ;
         Number of iterations ;
         Weights , , ;
         Dimensionality of the shared subspace, .
Output: , the labels for .
Calculate the covariance matrices and their mean matrix in the source domain, using (6), (7), or (8);
Calculate using (14);
Map each to a tangent space feature vector using (19) ();
Repeat the above procedure to get using (20);
Form and ;
Construct , , , and in (31)-(33);
for  do
       Solve (34), and construct as the eigenvectors corresponding to the smallest eigenvalues;
       Construct as the first rows in , and as the last rows;
       Train a classifier on  and apply it to to update ;
       Update in (33).
end for
return .
Algorithm 1 Manifold Embedded Knowledge Transfer (MEKT)

Note that for the clarity of explanation, Algorithm 1 only considers one source domain. When there are multiple source domains, we perform CA and compute the tangent space feature vectors for each source domain separately, and then assemble their feature vectors into a single source domain feature matrix .

Iii-E Domain Transferability Estimation (DTE)

When there are a large number of source domains, estimating domain transferability can advise which domains are more important, and also reduce the computational cost. In BCIs, DTE can be used to find subjects which have low correlations to the tasks and hence may cause negative transfer. Although source domain selection is important, it is very challenging, and hence very few publications can be found in the literature [13, 4, 31, 32].

Next, we propose an unsupervised DTE strategy.

Assume there are labeled sources domains , where is the feature matrix of the -th source domain, is the corresponding label vector. Assume also there is a target domain with unlabeled feature matrix . Let be the between-class scatter matrix, similar to in (24), and be the scatter matrix between the source and the target domains. We define the discriminability of the -th source domain as , and the difference between the source domain and the target domain as .

Then, the transferability of Source Domain  is computed as:

(35)

We then select source subjects with the highest .

Iv Experiments

In this section, we evaluate our method for both single-source to single-target (STS) transfers and multi-source to single-target (MTS) transfers. The code is available online111https://github.com/chamwen/MEKT.

Iv-a Datasets

We used two MI datasets and two ERP datasets in our experiments.

For both MI datasets, a subject sat in front of a computer screen. At the beginning of a trial, a fixation cross appeared on the black screen to prompt the subject to be prepared. Shortly after, an arrow pointing to a certain direction was presented as a visual cue for a few seconds, during which the subject performed a specific MI task. Then, the visual cue disappeared, and the next trial started after a short break. EEG signal was recorded during the course, and used to classify which MI the user was performing.

For the first MI dataset222http://www.bbci.de/competition/iv/desc_1.html. (MI1), 59-channel EEGs were recorded from seven healthy subjects, each with 100 left hand MIs and 100 right hand MIs. For the second MI dataset333http://www.bbci.de/competition/iv/desc_2a.pdf. (MI2), 22-channel EEGs were recorded from nine heathy subjects, each with 72 left hand MIs and 72 right hand MIs. Both datasets were used for two-class classification.

The first ERP dataset444https://www.physionet.org/physiobank/database/ltrsvp/. contained 8-channel EEG recordings from 11 healthy subjects in a rapid serial visual presentation (RSVP) experiment. The images were presented at different rates (5, 6, and 10 Hz) in three different experiments. We only used the 5 Hz version. The goal was to classify from EEG if the subject had seen a target image (with airplane) and non-target image (without airplane). The number of images for different subjects varying between 368 and 565, and the target to non-target ratio was around 1:9.

The second ERP dataset555https://www.kaggle.com/c/inria-bci-challenge. was recorded from a feedback error-related negativity (ERN) experiment [33], which was used in a Kaggle competition for two-class classification. It was collected from 26 subjects with 56 electrodes and partitioned into training set (16 subjects) and test set (10 subjects). We only used the 16 subjects in the training set as we do not have access to the test set. The average target to non-target ratio was around 1:4.

Iv-B EEG Data Preprocessing

EEG signals from all datasets were preprocessed using the EEGLAB toolbox [34]. For the two MI datasets, a causal 50-order finite impulse response band-pass filter (8-30 Hz) was applied to remove muscle artifacts and direct current drift. Next, EEG signals between seconds after the cue appearance were extracted as one trial. The RSVP signal was bandpass filtered between

Hz, downsampled to 64Hz, and epoched to the

second interval time-locked to the stimulus onset. The ERN signal was downsampled to 200Hz, bandpass filtered to 1-40Hz, epoched between second, and -normalized.

MI1 had 59 EEG channels, which were not easy to manipulate. Thus, we reduced the number of its tangent space features to the number of source domain samples (144), according to their values in one-way ANOVA. For the ERN dataset, we used xDAWN [35] to reduce the number of channels from 56 to 6.

The dimensionalities of different input spaces are shown in Table I.

Feature Space Dimensionality
MI1 MI2 ERP
Euclidean 6200 6 20
Tangent 200200 253
Riemannian 5959200 2222
TABLE I: Dimensionalities of different input spaces in STS tasks. is the number of samples in the -th domain, and the number of selected channels for the two ERP datasets.

Next we describe how the Euclidean space features were determined.

For the two MI datasets, six CSP variances [see (4

)] were used as the features. For the two ERP datasets, after spatial filtering by xDAWN, we assembled each EEG trail into a vector, performed principal component analysis on all vectors from the source subjects, and extracted the scores for the first 20 principal components as the features.

Iv-C Baseline Algorithms

We compared our MEKT approaches (MEKT-R: the Riemannian mean is used as the reference matrix; MEKT-E: the Euclidean mean is used as the reference matrix; MEKT-L: the Log-Euclidean mean is used as the reference matrix) with seven state-of-the-art baseline algorithms for BCI classification. According to the feature space type, these baselines can be divided into three categories:

  1. Euclidean space approaches:

    1. CSP-LDA (linear discriminant analysis) [36]

      for MI, and CSP-SVM (support vector machine)

      [37] for ERP.

    2. EA-CSP-LDA for MI, and EA-xDAWN-SVM for ERP, i.e., we performed EA [23] as a preprocessing step before spatial filtering and classification.

  2. Riemannian space approach: RA-MDM [22] for MI, and xDAWN-RA-MDM for ERP.

  3. Tangent

    space approaches, which were proposed for computer vision applications, and have not been used in BCIs before. CA was used before each of them. In each learned subspace, the sLDA classifier

    [38] was used for MI, and SVM for ERP.

    1. CA (centroid alignment).

    2. CA-CORAL (correlation alignment) [12].

    3. CA-GFK (geodesic flow kernel) [13].

    4. CA-JDA (joint distribution adaptation) [14].

    5. CA-JGSA (joint geometrical and statistical alignment) [15].

Hyper-parameters of all baselines were set according to the recommendations in their corresponding publications. For MEKT, , , , , and were used .

Iv-D Experimental Settings

We evaluated unsupervised STS and MTS transfers. In STS, one subject was used as the target, and another as the source. In MTS, one subject was used as the target, and all others as the sources, which is similar to leave-one-subject-out cross-validation in traditional BCI classification.

Let be the number of subjects in a dataset. Then, there are different STS tasks, and different MTS tasks.

The balanced classification accuracy (BCA) was used as the performance measure:

(36)

where and are the number of true positives and the number of samples in Class , respectively.

Iv-E Visualization

As explained in Section III-A, CA makes the aligned covariance matrices approximate the identity matrix, no matter whether the Riemannian mean, or the Euclidean mean, or the Log-Euclidean mean, is used as the reference matrix. To demonstrate that, Fig. 2 shows the raw covariance matrix of the first EEG trial of Subject 1 in MI2, and the aligned covariance matrices using different references. The raw covariance matrix is nowhere close to identity, but after CA, the covariance matrices are approximately identity, and hence the corresponding EEG trials are approximately whitened.

Fig. 2: The raw covariance matrix (Trial 1, Subject 1, MI2), and those after CA using different reference matrices.

Next, we used -SNE [39] to reduce the dimensionality of the EEG trials to two, and visualize if MEKT can bring the data distributions of the source and the target domains together. Fig. 3 shows the results on transferring Subject 2’s data to Subject 1 in MI2, before and after different data alignment approaches. Before CA, the source domain and target domain samples do not overlap at all. After CA, the two sets of samples have identical mean, but different variances. CA-GFK and CA-JDA make the variance of the source domain samples and the variance of the target domain samples approximately identical, but different classes are still not well separated. MEKT-R not only makes the overall distributions of the source domain samples and the target domain samples consistent, but also samples from the same class in the two domains close, which should benefit the classification.

Fig. 3: -SNE visualization of the data distributions before and after CA, and with different transfer learning approaches, when transferring Subject 2’s data (source) to Subject 1 (target) in MI2.

Iv-F Classification Accuracies

The average BCAs on the four datasets are shown in Table II. All MEKT-based approaches outperformed the baselines.

STS MTS
MI1 MI2 MI1 MI2 Avg
CSP-LDA 57.61 58.60 59.71 67.82 60.94
RA-MDM 64.98 66.60 73.29 72.07 69.24
EA-CSP-LDA 66.96 65.16 79.79 73.53 71.36
CA 66.17 66.02 76.29 71.84 70.08
CA-CORAL 67.69 67.26 78.86 72.38 71.55
CA-GFK 66.62 65.54 76.79 72.99 70.49
CA-JDA 66.01 66.59 81.07 74.15 71.96
CA-JGSA 65.81 65.90 76.79 73.07 70.39
MEKT-E 69.19 68.34 81.29 76.00 73.71
MEKT-L 70.74 68.56 83.07 76.54 74.73
MEKT-R 70.99 68.74 83.42 76.31 74.87
RSVP ERN RSVP ERN Avg
xDAWN-SVM 58.58 54.34 65.36 61.87 60.04
xDAWN-RA-MDM 60.37 56.22 67.29 62.90 61.70
EA-xDAWN-SVM 58.76 55.57 69.07 64.63 62.01
CA 58.34 56.97 67.35 65.89 62.14
CA-CORAL 58.45 57.04 66.94 66.17 62.15
CA-GFK 59.93 57.24 67.75 66.03 62.74
CA-JDA 60.27 57.56 66.06 64.64 62.13
CA-JGSA 55.23 57.17 64.57 57.68 58.66
MEKT-E 61.08 58.01 67.92 66.70 63.43
MEKT-L 61.15 57.91 68.40 65.98 63.36
MEKT-R 61.24 57.85 68.38 66.17 63.41
TABLE II: Average BCAs (%) in STS and MTS transfers.

Fig. 4 shows the BCAs of all tangent space based approaches when different reference matrices were used in CA. The Riemannian mean obtained the best BCA in four out of the six approaches, and also the best overall performance.

Fig. 4: Average BCAs (%) of the tangent space approaches on the four datasets, when different reference matrices were used in CA.

We also performed paired -tests on the BCAs to check if the performance improvements of MEKT-R over others were statistically significant. Before each -test, we performed a Lilliefors test [40]

to verify that the null hypothesis that the data come from a normal distribution cannot be rejected. Then, we performed false discovery rate corrections

[41] by a linear-step up procedure under a fixed significance level () on the paired -values of each task.

The false discovery rate adjusted -values (-values) are shown in Table III. MEKT-R significantly outperformed all baselines in almost all STS transfers. The performance improvements became less significant when there were multiple source domains, which is reasonable, because generally in machine learning the differences between different algorithms diminish as the amount of training data increases.

MEKT-R vs MI1 MI2 RSVP ERN
STS CSP-LDA .0000 .0000
xDAWN-SVM .0002 .0000
RA-MDM .0003 .0340 .0412 .0004
EA-CSP-LDA .0044 .0003
EA-xDAWN-SVM .0000 .0000
CA .0000 .0006 .0000 .0010
CA-CORAL .0005 .0340 .0000 .0014
CA-GFK .0000 .0001 .0016 .0130
CA-JDA .0003 .0183 .0386 .2627
CA-JGSA .0021 .0006 .0000 .0241
MTS CSP-LDA .0329 .1239
xDAWN-SVM .2077 .0306
xDAWN-RA-MDM .0824 .1636 .5347 .0632
EA-CSP-LDA .2808 .1636
EA-xDAWN-SVM .5733 .2632
CA .0329 .1260 .4727 .8380
CA-CORAL .0897 .1636 .3477 .9914
CA-GFK .0824 .1260 .5347 .9117
CA-JDA .2379 .1636 .0349 .0632
CA-JGSA .1344 .1636 .0323 .0018
TABLE III: False discovery rate adjusted -values in paired -tests (). For the CA-based approaches, the sLDA classifier was used for MI, and SVM for ERP.

Iv-G Computational Cost

This subsection compares the computational cost of different algorithms, which were implemented in Matlab 2018a on a laptop with i7-8550U CPU@2.00GHz, 8GB memory, running 64-bit Windows 10 Education Edition.

For MTS transfers, base classifiers construction took the most time, because there were a large number of source domain training samples. To emphasize the computational cost of different data alignment approaches, we only show the computing time on MI2 and RSVP STS tasks in Fig. 5. EA was the most efficient. RA-MDM, CA-JDA and MEKT-R had similar computational cost. MEKT-L and MEKT-E had comparable performance with MEKT-R (Table II), but much shorter computing time. MEKT-L seemed to be the best compromise between classification accuracy and computational cost.

Fig. 5: Computing time (seconds) in MI2 and RSVP STS tasks.

Iv-H Effectiveness of the Joint Probability MMD

To validate the superiority of the joint probability MMD over the traditional MMD, we replaced the joint probability MMD term in (29) by the traditional MMD term in (21), and repeated the experiments. The results are shown in Table IV. The joint probability MMD outperformed the traditional MMD in six out of the eight tasks. We expect that the joint probability MMD should also be advantageous in other applications that the traditional MMD is now used.

STS MI1 65.33 70.99
MI2 66.78 68.74
RSVP 61.11 61.24
ERN 58.62 57.85
MTS MI1 73.86 83.42
MI2 74.23 76.31
RSVP 69.33 68.38
ERN 65.59 66.17
Avg 66.86 69.14
TABLE IV: Average BCAs (%) when in (21) or in (22) was used in (29).

Iv-I Effectiveness of DTE

This subsection validates our DTE strategy on MTS tasks to select the most beneficial source subjects.

Table V shows the BCAs when different source domain selection approaches were used: RAND randomly selected

source subjects [because there was randomness, we repeated the experiment 20 times, and report the mean and standard deviation (in the parentheses)], ROD was the approach proposed in

[13], and ALL used all source subjects. Table VI shows the computational cost of different algorithms.

Tables V and VI shows that the proposed DTE outperformed RAND and ROD in terms of the classification accuracy. Although its BCAs were generally slightly worse than those of ALL, its computational cost was much lower than ALL, especially when became large, i.e., when , it can save over 50% computational cost.

RAND ROD DTE ALL
MI1 7 81.53 (1.19) 81.86 82.14 83.42
MI2 9 75.05 (1.06) 74.38 76.23 76.31
RSVP 11 67.48 (0.31) 67.79 68.70 68.38
ERN 16 65.31 (0.52) 65.36 65.51 66.17
TABLE V: Average BCAs (%) with different source domain selection approaches. RAND, ROD and DTE each selected source subjects. ALL used all source subjects.
RAND ROD DTE ALL
MI1 7 11.55 12.46 11.77 12.84
MI2 9 0.72 0.91 0.76 1.11
RSVP 11 4.01 4.31 4.08 8.64
ERN 16 7.65 8.28 7.79 15.80
TABLE VI: Computing time (seconds) of different source domain selection approaches. RAND, ROD and DTE each selected source subjects. ALL used all source subjects.

V Conclusions

Transfer learning is popular in EEG-based BCIs to cope with variations among different subjects and/or tasks. This paper has considered offline unsupervised cross-subject EEG classification, i.e., we have labeled EEG trials from one or more source subjects, but only unlabeled EEG trials from the target subject. We proposed a novel MEKT approach, which has three steps: 1) align the covariance matrices of the EEG trials in the Riemannian manifold; 2) extract tangent space features; and, 3) perform domain adaptation by minimizing the joint probability distribution shift between the source and the target domains, while preserving their geometric structures. An optional fourth step, DTE, was also proposed to identify the most beneficial source domains, and hence to reduce the computational cost. Experiments on four EEG datasets from two different BCI paradigms demonstrated that MEKT outperformed several state-of-the-art transfer learning approaches. Moreover, DTE can reduce more than half of the computational cost when the number of source subjects is large, with little sacrifice of classification accuracy.

References

  • [1] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan, “Brain-computer interfaces for communication and control,” Clinical Neurophysiology, vol. 113, no. 6, pp. 767–791, 2002.
  • [2] R. P. Rao, Brain-computer interfacing: an introduction.   Cambridge, England: Cambridge University Press, 2013.
  • [3] B. He, B. Baxter, B. J. Edelman, C. C. Cline, and W. W. Ye, “Noninvasive brain-computer interfaces based on sensorimotor rhythms,” Proc. of the IEEE, vol. 103, no. 6, pp. 907–925, May 2015.
  • [4] D. Wu, “Online and offline domain adaptation for reducing BCI calibration effort,” IEEE Trans. on Human-Machine Systems, vol. 47, no. 4, pp. 550–563, 2017.
  • [5] F. Lotte, L. Bougrain, A. Cichocki, M. Clerc, M. Congedo, A. Rakotomamonjy, and F. Yger, “A review of classification algorithms for EEG-based brain–computer interfaces: a 10 year update,” Journal of neural engineering, vol. 15, no. 3, p. 031005, 2018.
  • [6] Z. J. Koles, M. S. Lazar, and S. Z. Zhou, “Spatial patterns underlying population differences in the background EEG,” Brain Topography, vol. 2, no. 4, pp. 275–284, 1990.
  • [7] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Multiclass brain-computer interface classification by Riemannian geometry,” IEEE Trans. on Biomedical Engineering, vol. 59, no. 4, pp. 920–928, Apr. 2012.
  • [8] F. Yger, M. Berar, and F. Lotte, “Riemannian approaches in brain-computer interfaces: a review,” IEEE Trans. on Neural Systems and Rehabilitation Engineering, vol. 25, no. 10, pp. 1753–1762, Nov. 2017.
  • [9] A. Barachant and M. Congedo, “A plug & play P300 BCI using information geometry,” arXiv: 1409.0107, 2014.
  • [10] L. Korczowski, M. Congedo, and C. Jutten, “Single-trial classification of multi-user P300-based Brain-Computer Interface using riemannian geometry,” in Proc. 37th Annu. Int’l. Conf. IEEE Eng. Med. Biol. Soc., Milan, Italy, Aug. 2015, pp. 1769–1772.
  • [11] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,”

    IEEE Trans. on Neural Networks

    , vol. 22, no. 2, pp. 199–210, Feb. 2011.
  • [12] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Proc. 30th AAAI Conf. on Artificial Intell., Arizona, Feb. 2016.
  • [13] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in

    Proc. IEEE Conf. on Computer Vision and Pattern Recognition

    , Providence, RI, Jun. 2012, pp. 2066–2073.
  • [14] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in Proc. IEEE Int’l Conf. on Computer Vision, Sydney, Australia, Dec. 2013, pp. 2200–2207.
  • [15] J. Zhang, W. Li, and P. Ogunbona, “Joint geometrical and statistical alignment for visual domain adaptation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Hawaii, Jul. 2017, pp. 1859–1867.
  • [16] D. Wu, B. J. Lance, and T. D. Parsons, “Collaborative filtering for brain-computer interaction using transfer learning and active class selection,” PLOS one, vol. 8, no. 2, p. e56624, 2013.
  • [17] D. Wu, V. J. Lawhern, W. D. Hairston, and B. J. Lance, “Switching EEG headsets made easy: Reducing offline calibration effort using active wighted adaptation regularization,” IEEE Trans. on Neural Systems and Rehabilitation Engineering, vol. 24, no. 11, pp. 1125–1137, Mar. 2016.
  • [18] V. Jayaram, M. Alamgir, Y. Altun, B. Scholkopf, and M. Grosse-Wentrup, “Transfer learning in brain-computer interfaces,” IEEE Comput. Intell. Mag., vol. 11, no. 1, pp. 20–31, Jan. 2016.
  • [19] H. Kang, Y. Nam, and S. Choi, “Composite common spatial pattern for subject-to-subject transfer,” IEEE Signal Processing Letters, vol. 16, no. 8, pp. 683–686, 2009.
  • [20] F. Lotte and C. Guan, “Learning from other subjects helps reducing brain-computer interface calibration time,” in Proc. IEEE Int’l. Conf. on Acoustics Speech and Signal Processing, Dallas, TX, Mar. 2010, pp. 614–617.
  • [21] Y. Jin, M. Mousavi, and V. R. de Sa, “Adaptive CSP with subspace alignment for subject-to-subject transfer in motor imagery brain-computer interfaces,” in Proc. 6th Int’l. Conf. on Brain-Computer Interface (BCI), GangWon, South Korea, 2018, pp. 1–4.
  • [22] P. Zanini, M. Congedo, C. Jutten, S. Said, and Y. Berthoumieu, “Transfer learning: a Riemannian geometry framework with applications to brain-computer interfaces,” IEEE Trans. on Biomedical Engineering, vol. 65, no. 5, pp. 1107–1116, Aug. 2018.
  • [23] H. He and D. Wu, “Transfer learning for brain-computer interfaces: A Euclidean space data alignment approach,” IEEE Trans. on Biomedical Engineering, Apr. 2019.
  • [24] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Classification of covariance matrices using a riemannian-based kernel for BCI applications,” Neurocomputing, vol. 112, pp. 172–178, 2013.
  • [25] H. Ramoser, J. Muller-Gerking, and G. Pfurtscheller, “Optimal spatial filtering of single trial EEG during imagined hand movement,” IEEE Trans. on Rehabilitation Engineering, vol. 8, no. 4, pp. 441–446, 2000.
  • [26]

    V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “Log-Euclidean metrics for fast and simple calculus on diffusion tensors,”

    Magnetic Resonance in Medicine: An Official Journal of the Int’l. Society for Magnetic Resonance in Medicine, vol. 56, no. 2, pp. 411–421, 2006.
  • [27] R. Bhatia, Positive Definite Matrices.   New Jersey: Princeton University Press, 2009.
  • [28] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,” Journal of Machine Learning Research, vol. 13, no. 3, pp. 723–773, Mar. 2012.
  • [29]

    M. Belkin and P. Niyogi, “Semi-supervised learning on Riemannian manifolds,”

    Machine learning, vol. 56, no. 1-3, pp. 209–239, 2004.
  • [30] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003.
  • [31] D. Wu, V. J. Lawhern, S. Gordon, B. J. Lance, and C.-T. Lin, “Driver drowsiness estimation from EEG signals using online weighted adaptation regularization for regression (OwARR),” IEEE Trans. on Fuzzy Systems, vol. 25, no. 6, pp. 1522–1535, 2017.
  • [32] C.-S. Wei, Y.-P. Lin, Y.-T. Wang, T.-P. Jung, N. Bigdely-Shamlo, and C.-T. Lin, “Selective transfer learning for EEG-based drowsiness detection,” in Proc. IEEE Int’l Conf. on Systems, Man and Cybernetics, Hong Kong, October 2015, pp. 3229–3232.
  • [33] P. Margaux, M. Emmanuel, D. Sbastien, B. Olivier, and M. Jrmie, “Objective and subjective evaluation of online error correction during P300-based spelling,” Advances in Human-Computer Interaction, vol. 2012, p. 4, 2012.
  • [34]

    A. Delorme and S. Makeig, “EEGLAB: An open source toolbox for analysis of single-trial EEG dynamics including independent component analysis,”

    Journal of Neuroscience Methods, vol. 134, no. 1, pp. 9–21, 2004.
  • [35] B. Rivet, A. Souloumiac, V. Attina, and G. Gibert, “xDAWN algorithm to enhance evoked potentials: application to brain-computer interface,” IEEE Trans. on Biomedical Engineering, vol. 56, no. 8, pp. 2035–2043, Aug. 2009.
  • [36] C. M. Bishop, Pattern recognition and machine learning.   New York: springer, 2006.
  • [37] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. on Intell. Systems and Technol., vol. 2, no. 3, p. 27, Apr. 2011.
  • [38] R. Peck and J. Van Ness, “The use of shrinkage estimators in linear discriminant analysis,” IEEE Trans. on Pattern Analysis and Machine Intell., vol. 4, no. 5, pp. 530–537, May 1982.
  • [39] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. Nov., pp. 2579–2605, 2008.
  • [40] H. W. Lilliefors, “On the Kolmogorov-Smirnov test for normality with mean and variance unknown,” Journal of the American statistical Association, vol. 62, no. 318, pp. 399–402, 1967.
  • [41] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal statistical society: series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995.