I Introduction
A braincomputer interface (BCI) system [1, 2] acquires the brain signal, decodes it, and then translates it into control commands for external devices, so that a user can interact with his/her surroundings using thoughts directly, bypassing the normal pathway of peripheral nerves and muscles. Electroencephalogram (EEG) may be the most popular BCI input signal due to its convenience, safety, and low cost. The pipeline for decoding EEG signals usually involves:

Signal processing
, which includes bandpass filtering and spatial filtering. Bandpass filtering reduces interferences and noise such as muscle artifacts, eye blinks, and DC drift. Spatial filtering combines different EEG channels to increase the signaltonoise ratio. Common spatial patterns (CSP)
[3, 4, 5, 6] may be the most frequently used spatial filtering approach. 
Feature extraction
. Different features, e.g., time domain, frequency domain, timefrequency domain, Riemannian space, could be used.

Classification
. Popular classifiers include linear discriminant analysis (LDA) and support vector machine (SVM).
Recently, Barachant et al. [7] proposed a novel preprocessing and classification pipeline in the Riemannian space, which integrated spatial filtering and feature extraction into one single step. This Riemannian pipeline uses the covariance matrices of the EEG trials, which are symmetric positive definite and lie on a Riemannian manifold [8]. The covariance matrices encode spatial information of the brain activities, which are useful in many BCI tasks. A popular classifier in the Riemannian space, minimum distance to mean [7], treats the covariance matrices as points on the Riemannian manifold, and uses their Riemannian distances to the class mean for classification. Another more sophisticated approach maps the covariance matrices from the Riemannian space to a Euclidean tangent space (TS) around the Riemannian mean, where the Riemannian space covariance matrices are transformed into Euclidean space vectors, and then used in Euclidean space classifiers as features.
Motor imagery [9] is one of the most frequently used paradigms of BCIs. It is based on the voluntary modulation of the sensorimotor rhythm, which does not need any external stimuli. The imagined movements of different body parts (e.g., hands, feet, and tongue) cause modulations of brain rhythms in the involved cortical areas. So, they can be distinguished by decoding such brain rhythm modulations, and used to control external devices such as powered exoskeletons, wheelchairs, and robots.
Motor imagerybased BCIs were originally designed to help those with neuromuscular impairments [10]. Recent research has extended its application scope to ablebodied users [11, 12]
. However, EEG signals are very weak, and easily contaminated by interferences and noise. Moreover, individual differences make it difficult, if not impossible, to build a generic machine learning model optimal for all subjects. Usually a calibration session is needed to collect some subjectspecific data for a new subject, which is timeconsuming and userunfriendly.
Researchers have proposed many different approaches [13, 14, 15, 16, 17, 18, 19, 20] to reduce this calibration effort. One of them is transfer learning [21]
, or domain adaptation (DA). Its main idea is to leverage the data from auxiliary subjects (called source subjects or source domains) to improve the learning performance for a new subject (called target subject or target domain). A popular idea in DA is to project the source domain and target domain data into low dimensional subspaces where the geometrical shift or/and distribution shift are reduced, such as joint distribution adaptation (JDA)
[22], joint geometrical and statistical alignment (JGSA) [23], and manifold embedded distribution alignment (MEDA) [24]. Computational intelligence techniques have also been used in transfer learning, as reviewed by Lu et al. [25]. In BCIs, Zanini et al. [26] proposed a Riemannian geometry framework to align EEG covariance matrices from different subjects in the Riemannian space. Recently, we [27]proposed a Euclidean alignment (EA) approach, which can be used as a preprocessing step before many Euclidean space feature extraction and pattern recognition algorithms.
However, most existing DA approaches assume that the source domains have the same feature space and label space as the target domain, which may not hold in many realworld applications. There have been some heterogenous feature space DA approaches [14, 28, 29], which address the problem that the source domains have different feature spaces from the target domain. For example, in BCIs, Wu et al. [14] performed transfer learning for heterogenous feature spaces: the source and target EEG trials are collected from different EEG headsets, with different numbers of channels and channel locations. Its main idea is to select the source domain channels closest to the target domain channels.
There have also been a few heterogeneous label space DA approaches [30, 31, 32, 33], as shown in Fig. 1. Busto et al. [30] first proposed the concept of open set DA, assuming the source and target domains have some known classes in common, and also some classes that are different and unknown. Saito et al. [31] considered the case that the target domain contains all classes in the source domain, plus an “unknown” class (different from [30], herein the source domain does not contain an “unknown” class). You et al. [33]
proposed universal DA, which classifies a target domain sample if it belongs to any known class in the source domain, or marks it as “unknown” otherwise. In summary, both open set DA and universal DA train a model to either classify a target domain sample into a known class which has appeared in the source domain, or mark it as “unknown”. An application scenario of open set DA and universal DA is face recognition, where some test samples may not appear in the training database and have to be marked as “unknown”.
This paper considers different set DA in BCIs, i.e., the source domains have different label spaces from the target domain, as shown in Fig. 1. For Motor imagerybased BCIs, this means the source subjects and the target subject perform different motor imagery tasks. To our knowledge, no one has studied this problem before in BCIs.
Motivated by EA, a stateoftheart homogeneous transfer learning approach for BCIs, we propose a label alignment (LA) approach to align EEG covariance matrices of the source subjects to those of the target subject, so that a model trained on source subjects can classify each target trial into the category it actually belongs to, even though the source and target subjects have completely different label spaces.
The main contributions of this paper are:

We introduce a practical setting of different set DA in BCIs: The source and target domains have known and different label sets; we need to classify each target trial into the category it actually belongs to, with the help of the source domain data. This setting is different from existing open set DA and universal DA. To our knowledge, it has not been studied before in BCIs.

We propose an effective LA approach for different set DA in BCIs, which has three desirable properties: 1) It only needs as few as one labeled EEG trial from each class of the target subject; 2) It can be used as a preprocessing step in different feature extraction and classification algorithms; and, 3) It can be integrated with other DA approaches to achieve even better performance.
The remainder of this paper is organized as follows: Section II introduces related background knowledge on the Riemannian space and the EA. Section III proposes the LA. Section IV introduces the datasets used in our experiments. Sections V compares the performance of LA with several other DA approaches. Finally, Section VI draws conclusions and points out some future research directions.
Ii Related Work
This section introduces some basic concepts of the Riemannian space and its TS, and the EA, a stateoftheart data alignment approach for BCIs, which also motivated our proposed LA.
Iia Riemannian Distance
Each symmetric positive definite matrix can be viewed as a point on a Riemannian manifold. The Riemannian distance between two symmetric positive definite matrices and is the length of the geodesic, defined as the minimum length curve connecting and on the Riemannian manifold:
(1) 
where the subscript denotes the Frobenius norm, and (
) are the real eigenvalues of
.remains unchanged under linear invertible transformations:
(2) 
where
is an invertible matrix. This property, called
congruence invariance, is useful in both EA and LA.IiB Tangent Space (TS) Mapping
Most machine learning approaches are applicable only in the Euclidean space, and cannot be used in the Riemannian space. TS mapping maps the covariance matrices from the Riemannian space to a Euclidean TS, so that they can be used by a Euclidean space classifier.
For each point on the Riemannian manifold, the TS can be defined by a set of tangent vectors at . Each tangent vector is defined as the derivative at of the geodesic between and the exponential mapping :
(3) 
The inverse mapping is given by the logarithmic mapping:
(4) 
TS mapping converts each 2D EEG trial into a 1D feature vector, so that many machine learning algorithms can be used.
IiC Euclidean Alignment (EA)
EA [27] is a stateoftheart DA approach for BCIs, which reduces the individual differences by aligning the EEG covariance matrices.
Some DA approaches [22, 14] first find a proper discrepancy measure between different distributions, then learn a shared subspace where the distribution discrepancy is explicitly minimized. Maximum mean discrepancy [34] is a popular distribution discrepancy measure, which is defined as the distance between the mean feature embeddings of different distributions.
Similar to these maximum mean discrepancy based DA approaches, EA attempts to find a projection for each subject so that the distance between different subjects is minimized. More specifically, EA views the covariance matrices as the feature embeddings of different EEG trials, and finds projections to minimize the distance between the mean covariance matrices of different subjects.
For Subject with trials, EA first computes its mean covariance matrix by
(5) 
where is the th EEG trial of Subject , in which is the number of channels, and the number of time domain samples. The projection matrix for Subject is then computed as:
(6) 
Finally, EA performs the following projection for each covariance matrix of Subject :
(7) 
After the alignment, the mean covariance matrix of Subject is:
(8) 
i.e., it becomes the identity matrix.
After performing EA for all subjects, they share the same mean covariance matrix, i.e., the distances between the mean covariance matrices of different subjects are minimized (they become zero), and hence data distributions from different subjects become more similar.
We can also understand EA as a correction of data shift. If we view each EEG covariance matrix as a point on a Riemannian manifold, then individual differences cause shifts of these points, although they may entail more than just a simple displacement [26]. In order to correct this shift, EA moves the covariance matrices of each subject to center them at the identity matrix. The congruence invariance property makes sure that the distances among the withinsubject covariance matrices remain unchanged. So, EA makes the data distributions from different subjects closer, while preserving the local distance information of each subject.
Iii Label Alignment (LA) for Different Set DA
This section introduces our proposed LA for different set DA, and discusses its relationship with EA and CORAL [35].
Iiia Label Alignment (LA)
Generally, there are three types of data shift in transfer learning:

Prior probability shift: the distributions of the output are different.

Concept shift [38]: the relationships between the inputs and the output are different.
EA considers only the covariate shift but ignores the other two. Although it has been shown to significantly improve the crosssubject classification performance in [27], it only aligns the data in the feature space, and may not work well when the source subjects and the target subject have different label spaces.
This section proposes LA, which extends EA to different label spaces, by simultaneously considering multiple types of data shift. Its main idea is to independently move the perclass covariance matrices of each source subject, to recenter them at the corresponding class center of the target subject.
More specifically, for an class classification problem, we assume the source and target subjects have the same number of classes, but their class labels are partially or completely different. Our goal is to use the source data to help the classification of the target trials. LA seeks a transformation matrix for the trials of the th class () from Source Subject , such that the distance between the mean covariance matrices of the corresponding class in different domains are minimized:
(9) 
where is the mean covariance matrix of the th class of Source Subject , and the mean covariance matrix of the th class of the target subject. In this paper, we use the LogEuclidean mean [39], which is frequently used for symmetric positive definite matrices and much easier to compute than the Riemannian mean.
Then, each EEG trial of Source Subject is transformed to:
(11) 
The difference between the mean covariance matrices of the corresponding class between the transformed source subject and the target subject becomes
(12) 
where
is an allzero matrix, i.e., the objective function in (
9) is minimized.A key question in LA is how to obtain
, which requires some labeled target domain samples. We consider the following offline classification scenario: we have access to the unlabeled EEG trials (the same assumption is also used in EA), and we can label a few of them to estimate
. To have a good estimate of from only a few labeled trials, we need to select these trials very carefully. In this paper, we perform medoids clustering based on the Riemannian distances among the target EEG trials, label the cluster centers, and then estimate from them. In the rare case that the centers had fewer than different labels, we use EA to replace LA.IiiB Label Alignment (LA) versus Euclidean Alignment (EA)
The difference between LA and EA is illustrated in Fig. 2. For clarity, binary classification is used, but both EA and LA can be easily extended to multiclass classification, as shown later in this paper. In Fig. 2, each EEG trial is represented by its covariance matrix, as a point on a Riemannian manifold. The source domain (blue points) and target domain (black points) represent two different subjects, who have trials from different motor imagery tasks (indicated by different shapes of the points. Note that the shapes in the target domain are only used to help understand our approach, but not to suggest that we need to know all target labels). Initially, the source and target domains scatter far away from each other, due to the domain gap and also the category gap. If we build a classifier on the source domain (indicated by the red dashed line) and apply it directly to the target domain, it may not work at all. EA and LA alleviate this problem by reducing the gaps between the two domains before classification:

EA focuses on the domain gap but ignores the category gap completely. It first computes the mean covariance matrix of each domain (indicated by the red stars), from which a transformation matrix of each domain is computed. Using the transformation matrix, EA then recenters each domain at the identity matrix, and makes the source and target domains overlap with each other, i.e., the domain gap between them is reduced. If we build a classifier in the source domain (the red dashed line) and apply it to the target domain, the classification performance would be improved.

LA considers the domain gap and the category gap simultaneously. It first computes the mean covariance matrix of each source domain class (indicated by the red circle and the red diamond), and estimates the mean covariance matrix of each target domain class (indicated by the red triangle and the red square). Then, LA recenters each source domain class at the corresponding estimated class mean of the target domain. If we build a classifier in the source domain (the red dashed line) and apply it to the target domain, the classification performance would be further improved.
IiiC Label Alignment (LA) versus CORrelation ALignment (CORAL)
Sun et al. [35] proposed an unsupervised DA approach, CORrelation ALignment (CORAL), to minimize the domain shift by aligning the secondorder statistics of different distributions.
Given a source domain and a target domain , where and are the number of trials in the source domain and the target domain, respectively, and the feature dimensionality. CORAL first computes the feature covariance matrix in the source domain and
in the target domain. Then, it finds a linear transformation matrix
for the source domain features, so that the Frobenius norm of the difference between the covariance matrices of the two domains is minimized, i.e.,(13) 
Although (13) seems similar to the objective function of LA in (9), they are different:

CORAL uses 1D features, and each domain has only one feature covariance matrix, which measures the covariances between different pairs of individual features. LA uses 2D features (EEG trials), and each EEG trial has a covariance matrix, which measures the covariances between different pairs of EEG channels. So, the covariance matrices in CORAL and LA have different meanings.

CORAL minimizes the distance between the covariance matrices in different domains, whereas LA minimizes the distance between the mean covariance matrices of the corresponding class in different domains.

CORAL works when the source domain has the same class labels as the target domain, and it finds one transformation matrix for each source domain. LA considers the case that the source and target domains have different class labels (of course, it also works when the two domains have the same class labels), and it finds one transformation matrix for each class of the source domain.
In summary, LA and CORAL have different inputs, different optimization objectives, and also different application scenarios. When the source and target domains have the same class labels, each 2D EEG trial can be mapped from the Riemannian manifold to the tangent space to obtain a 1D feature vector, and hence be plugged into CORAL. However, CORAL cannot be used when the source and target domains have different labels.
Iv Datasets
This section describes and visualizes the two motor imagery datasets used in our experiments.
Iva Datasets and Preprocessing
Both datasets were from BCI Competition IV^{1}^{1}1http://www.bbci.de/competition/iv/., and were collected in a cuebased setting, as shown in Fig. 3. In each experiment, the subject was sitting in front of a computer and performed motor imagery tasks at the prompt of visual cues. More specifically, each trial began when a fixation cross appeared on the black screen, which prompted the subject to be prepared. After a short period, an arrow pointing to a certain direction was displayed as the visual cue. The cue was displayed for a few seconds, during which the subject was instructed to perform the desired motor imagery task according to the direction of the arrow. The subject stopped the motor imagery when the visual cue disappeared. A short break followed, until the next trial began.
The first dataset^{2}^{2}2http://www.bbci.de/competition/iv/desc_1.html. (Dataset 1 [40]) was recorded from seven healthy subjects by 59 EEG channels at 100 Hz. Each subject was instructed to perform two classes of motor imagery tasks, which were selected from three options: left hand, right hand, and feet. The recording of each subject was divided into three sessions: calibration, evaluation, and special feature. This paper only used the calibration data, because they included complete label information. Each subject had 100 trials from each class.
The second dataset^{3}^{3}3http://www.bbci.de/competition/iv/desc_2a.pdf. (Dataset 2a) was recorded from nine healthy subjects by 22 EEG channels and 3 EOG channels at 250 Hz (we downsampled it to 100 Hz, to be consistent with Dataset 1). Each subject was instructed to perform four classes of motor imagery tasks: left hand, right hand, both feet, and tongue, which were represented by labels 1, 2, 3 and 4, respectively. A training session and an evaluation session were recorded on different days for each subject. We only used the 22channel EEG data in the training session, which included complete label information. Each subject had 72 trials from each class, and 288 trials in total.
For both datasets, the EEG signals were preprocessed using the Matlab EEGLAB toolbox [41], following the guideline in [42]. First, a causal bandpass filter (20order linear phase Hamming window FIR filter designed by Matlab function fir1, with 6dB cutoff frequencies at [8, 30] Hz) was applied to remove muscle artifacts, linenoise contamination and DC drift. Next, we extracted EEG signals between seconds after the cue appearance as our trials.
Table I summaries the characteristics of the two datasets.
Number of  

Channels  Time samples  Subjects  Classes  Trials/class  
Data 1  59  300  7  2  100 
Data 2a  22  300  9  4  72 
IvB Data Visualization
In order to intuitively show how EA and LA reduce the distribution discrepancies between the target and source subjects, we first projected the EEG covariance matrices from the Riemannian manifold into the tangent space, then used the 1D tangent vectors as features to represent the EEG trials, as introduced in Section IIB. Finally, we used stochastic neighbor embedding (SNE) [43], a technique for dimensionality reduction and highdimensional dataset visualization, to display the EEG trials (tangent vectors) before and after EA/LA in 2D.
More specifically, we first divided Dataset 2a into two datasets with different label spaces: the source dataset consisted of trials with Labels 1 and 4, and the target dataset with Labels 2 and 3. Then, we picked one subject from the target dataset as the target subject, and the remaining eight subjects from the source dataset as the source subjects. Fig. 4 shows two examples when the first two subjects were used as the target subjects, respectively. The red and black dots are trials of Labels 2 and 3 from the target subject, respectively. The blue and green dots are trials of Labels 1 and 4 from the source subjects, respectively. The first column shows the trials without alignment, the second column shows the trials after EA, and the third after LA.
Observe that trials from the source subjects (blue and green dots) are scattered far away from those of the target subject (red and black dots), when no alignment is performed. However, the target and source trials overlap with each other after EA, since their centers are now identical. After LA, the target and source trials are further aligned according to their labels. It’s clear that different classes are more distinguishable after LA.
V Experiments and Results
This section presents performance comparisons of LA with other approaches on the two datasets.
Va Domain Adaptation (DA) Scenarios
We investigated the problem that the source and target subjects have different label spaces, and considered the following four DA scenarios:

Scenario Ia: The source and target subjects have the same feature space and partially overlapping label spaces (binary classification).

Scenario Ib: The source and target subjects have the same feature space and partially overlapping label spaces (multiclass classification).

Scenario II: The source and target subjects have the same feature space and completely different label spaces.

Scenario III: The source and target subjects have different feature spaces and also completely different label spaces.
For Scenarios Ia, Ib and II, in each experiment we divided Dataset 2a into two subdatasets, a source dataset and a target dataset, such that they had the same feature space and different label spaces. Each subdataset was named by its label space, for example, subdataset “1, 2” consisted of trials with Labels 1 and 2 only, and subdataset “3, 4” consisted of trials with Labels 3 and 4 only. Then, “1, 23, 4” denotes the experiment that Subdataset “1, 2” was used as the source dataset and Subdataset “3, 4” the target dataset.
Then, the datasets used in the four DA scenarios were:

Scenario Ia: We divided Dataset 2a into a source subdataset and a target subdataset, ensuring they had one identical label and one different label. There were 24 such subdataset combinations in total, e.g., “1, 21, 3” and “1, 22, 3”.

Scenario Ib: We divided Dataset 2a into a source subdataset and a target subdataset, ensuring they had two identical labels and a different label. There were 12 such combinations in total, e.g., “1, 2, 31, 2, 4” and “1, 2, 32, 3, 4”.

Scenario II: We divided Dataset 2a into a source subdataset and a target subdataset, ensuring they had completely different labels. There were six such combinations in total, e.g., “1, 23, 4” and “2, 31, 4”.

Scenario III: We used Dataset 1 as the source dataset and subdataset “3, 4” of Dataset 2a as the target dataset, so that they had different feature spaces (their EEG channels were different) and also completely different label spaces.
Once a dataset choice was made, each time we picked one subject from the target dataset as the target subject, and the remaining subjects from the source dataset as the source subjects. As the target dataset always had nine subjects, we had nine subexperiments for each dataset combinations. Table II summaries the characteristics of the four scenarios, where is the number of labeled target subject trials.
No. of dataset  No. of sub  No. of  No. of  

combinations  experiments  training trials  test trials  
Scenario Ia  
Scenario Ib  
Scenario II  
Scenario III 
VB Experimental Settings
We first divided the BCI classification pipeline into three stages:

Alignment, which selectively performs different alignments.

Classification, which extracts features and trains classifiers.
In order to emphasize the effect of LA, the algorithms to be compared consisted of the same preprocessing and classification stages, but different alignments. More specifically, three alignment approaches were compared:

Raw, which did not perform any alignment.

EA, which performed EA.

LA, which performed LA.
In each scenario, the experiments were designed to answer the following two questions:

Question 1: Can LA be used as an effective preprocessing step before different feature extraction and classification algorithms?

Question 2: Can LA be integrated with other DA approaches to further improve the classification performance?
For Question 1, we used two feature extraction and classification pipelines:
Combining these two pipelines with the three alignment approaches (Raw, EA, LA), we had algorithms to be compared. Our goal was to verify whether LA performs the best in both pipelines.
For Question 2, we first extracted the Riemannian TS features, and then used different DA approaches in classification stage (because they need 1D features):

BL (baseline), which directly applied an SVM classifier to the TS features, without any additional DA approach.

JDA, which applied JDA to the TS features, and then used them in an SVM classifier.

JGSA, which applied JGSA to the TS features, and then used them in an SVM classifier.

MEDA, which applied MEDA to the TS features.
Combining these four approaches with the three alignments (Raw, EA, LA), we had algorithms to be compared. Our goal was to verify whether “LA+JDA/JGSA/MEDA>LA+BL>Raw+JDA/JGSA/MEDA”, where “>” means “outperform”. For example, “LA+BL>Raw+JDA/JGSA/MEDA” means LA outperforms classical DA approaches such as JDA, JGSA and MEDA, and “LA+JDA/JGSA/MEDA>LA+BL” means the performance could be further improved by integrating LA with other DA approaches, i.e., LA is compatible with and complementary to other DA approaches.
VC Scenario Ia: Same Feature Space and Partially Overlapping Label Spaces in Binary Classification
This subsection considers the binary classification scenario that the source and target subjects have the same feature space and partially overlapping label spaces. As introduce in Section VA, we had 24 subdataset combinations to be tested.
Because in Scenario Ia the source subjects had one identical label and one different label from the target subject, we only needed to deal with the different label. For algorithms without LA, we directly assigned the target label to the trials of the corresponding class from the source subjects. For example, in the scenario of “1, 2 1, 3”, we assigned Label 3 to the trials of Label 2. For algorithms with LA, we first aligned the trials with the same label, then aligned the trials with different labels, and assigned the target label to the trials of the corresponding class from the source subjects. For example, in the scenario of “1, 2 1, 3”, we first aligned the trials with Label 1 from the source subjects to the trials with Label 1 from the target subject, then aligned the trials with Label 2 from the source subjects to the trials with Label 3 from the target subject, and assigned Label 3 to the trials of Label 2.
For algorithms involving LA, we considered in medoids clustering of LA in Section III. In the rare case that the labeled target trials came from the same class, we cannot perform LA as there was not enough information to estimate the two class means of the target subject; thus, we performed EA instead of LA. No matter whether the labeled target trials were used in the alignment or not, they were always combined with the labeled source trials for feature extraction and classification. All labeled target subject trials were excluded from the test set, so all algorithms had the same training set and test set.
Question 1: Can LA be used as an effective preprocessing step before different feature extraction and classification algorithms?
We compared Raw, EA, and LA in the two classification pipelines to answer this question. Fig. 5 shows the performances of the six algorithms on the 24 different subdataset combinations, where each subfigure shows the average accuracies across the nine subjects (each as the target subject once). The last subfigure shows the average performances across the 24 experiments. Observe that:

EACSPLDA outperformed RawCSPLDA on 20 out of the 24 experiments, and EATSSVM outperformed RawTSSVM on 14 out of the 24 experiments. On average EACSPLDA outperformed RawCSPLDA, and EATSSVM outperformed RawTSSVM. This suggests EA was generally effective, but not always, when the source and target label spaces were different.

When became large, LACSPLDA outperformed RawCSPLDA in all 24 experiments, and LATSSVM also outperformed RawTSSVM in all 24 experiments. These suggest that LA was able to cope well with partially different label spaces.

When became large, LACSPLDA outperformed EACSPLDA on all 24 experiments, and LATSSVM also outperformed EATSSVM on all 24 experiments. This suggests LA was more effective and robust than EA.

Generally, the classification accuracies of LACSPLDA and LATSSVM increased when there were more labeled target trials for estimating the class means, which is intuitive.
We also performed statistical tests to determine if the differences between the LAbased algorithms and others were statistically significant. We first defined an aggregated performance measure called the area under the curve (AUC). For a particular algorithm on a particular subject, the AUC was the area under its accuracy curve when the number of labeled target subject trials increased from 2 to 20. Subjects from all 24 experiments were concatenated, so we had subjects in total. Each algorithm had 216 AUCs. We then performed paired
tests on these AUCs. The null hypothesis was that the difference between the paired samples has zero mean, which was rejected if
, where was used. The results are shown in Table III, where the statistically significant ones are marked in bold. LACSPLDA significantly outperformed EACSPLDA, and LATSSVM significantly outperformed EATSSVM. These results echo the observations from Fig. 5 and answer Question 1: LA can be used as an effective preprocessing step before different feature extraction and classification algorithms.EACSPLDA  EATS SVM  

LACSPLDA  0.0000  
LATSSVM  0.0000 
Question 2: Can LA be integrated with other DA approaches to further improve the clssification performance?
As introduced in previous subsection, we had 12 algorithms to be compared. We used the same target and source subjects as introduced in Question 1, which resulted in 24 experiments again. Fig. 6 shows the performances of the 12 algorithms in the 24 experiments, where each subfigure shows the average accuracies across the nine subjects, and the last subfigure shows the average performances across the 24 experiments.
Observe that:

When was large, LABL always outperformed RawBL and EABL, LAJDA always outperformed RawJDA and EAJDA, LAJGSA always outperformed RawJGSA and EAJGSA, and LAMEDA always outperformed RawMEDA and EAMEDA. These suggest that LA was effective regardless of whether additional DA approaches were used or not.

LABL always outperformed RawJDA and RawMEDA, and outperformed RawJGSA in 23 out of 24 experiments. These suggest that LA can outperform classical DA approaches such as JDA, JGSA and MEDA.

Both LAJDA and LAMEDA always outperformed LABL, and LAJGSA outperformed LABL in 23 out of the 24 experiments. These suggest that it may be advantageous to integrate other DA approaches with LA.
We also performed paired ttests on the AUCs in Fig. 6. The results are shown in Table IV, which indicate that the algorithms involving LA (i.e., LABL, LAJDA, LAJGSA, LAMEDA) significantly outperformed those involving EA (i.e., EAJDA, EAJGSA, EAMEDA), and the algorithms combining LA and additional DA approaches (i.e., LAJDA, LAJGSA, LAMEDA) significantly outperformed LABL. These results echo the observations from Fig. 6 and answer Question 2: LA can not only outperform EA and classical DA approaches, but the classification performance can be further improved when integrated with other DA approaches.
LABL  EAJDA  EAJGSA  EAMEDA  

LABL  0.0000  0.0000  0.0000  
LAJDA  0.0000  0.0000  
LAJGSA  0.0001  0.0000  
LAMEDA  0.0000  0.0000 
VD Scenario Ib: Same Feature Space and Partially Overlapping Label Spaces in MultiClass Classification
This subsection considers the multiclass classification scenario that the source and target subjects have the same feature space and partially overlapping label spaces. As introduced in Section VA, we had 12 subdataset combinations to be tested.
Question 1: Can LA be used as an effective preprocessing step before different feature extraction and classification algorithms?
Again, we compared Raw, EA, and LA in the two classification pipelines to answer this question. CSP filtering was extended from binary classification to multiclass classification by the oneversustherest approach [44]. As we had three class centers of the target subject to be estimated in LA, we considered in medoids clustering.
Fig. 7 shows the performances of the six algorithms on the 12 different subdataset combinations, where each subfigure shows the average classification accuracies across the nine subjects (each as the target subject once). The last subfigure shows the average performances across the 12 experiments. LACSPLDA always outperformed RawCSPLDA and EACSPLDA, and LATSSVM always outperformed RawTSSVM and EATSSVM. These suggest that LA was effective with different feature extraction and classification algorithms.
Paired ttests on the AUCs in Fig. 7 were also performed to check if the differences between different algorithms were statistically significant. Here the AUC was the area under the accuracy curve when the number of labeled target subject trials increased from 3 to 30. Each algorithm had AUCs. The results are shown in Table V, which indicate that LACSPLDA significantly outperformed EACSPLDA, and LATSSVM significantly outperformed EATSSVM.
EACSPLDA  EATSSVM  

LACSPLDA  0.0196  
LATSSVM  0.0010 
Question 2: Can LA be integrated with other DA approaches to further improve the classification performance?
Again, we combined Raw, EA, LA with different DA approaches and obtained 12 algorithms to be compared. Fig. 8 shows their performances on the 12 subdataset combinations, and the average across the 12 experiments. Observe that:

When was large, LABL always outperformed RawBL and EABL, LAJDA always outperformed RawJDA and EAJDA, LAJGSA always outperformed RawJGSA and EAJGSA, and LAMEDA always outperformed RawMEDA. These suggest that LA was effective regardless of whether additional DA approaches were used or not.

When was large, LABL outperformed RawJDA, RawJGSA and RawMEDA in all 12 experiments, suggesting that LA can outperform classical DA approaches such as JDA, JGSA and MEDA.

Generally, LAJDA, LAJGSA and LAMEDA outperformed LABL, suggesting that it may be advantageous to integrate additional DA approaches with LA.
Table VI shows the results of paired ttests on the AUCs in Fig. 8. The conclusions in binary classification still hold in multiclass classification: LA significantly outperformed EA and classical DA approaches, and its performance can be further significantly improved when integrated with other DA approaches.
LABL  EAJDA  EAJGSA  EAMEDA  

LABL  0.0000  0.0000  0.0000  
LAJDA  0.0000  0.0000  
LAJGSA  0.0295  0.0000  
LAMEDA  0.0000  0.0000 
VE Scenario II: Same Feature Space and Completely Different Label Spaces
This subsection considers the scenario that the source and target subjects have the same feature space but completely different label spaces. As introduced in Section VA, we had six such subdataset combinations to be tested. in medoids clustering of LA in binary classification was used.
Because in Scenario II the source subjects had completely different labels from the target subject, the source labels and the target labels were randomly matched for LA. For example, in the experiment “1, 23, 4”, we could align the trials of Label 1 with those of Label 3, and the trials of Label 2 with those of Label 4. We could also align the trials of Label 2 with those of Label 3, and the trials of Label 1 with those of Label 4. Our experiments showed that LA was effective in both alignment strategies.
Question 1: Can LA be used as an effective preprocessing step before different feature extraction and classification algorithms?
We compared Raw, EA, and LA in the two classification pipelines. Fig. 9 shows the performances of the six algorithms on the six subdataset combinations, and the average. Observe that:

LACSPLDA always outperformed RawCSPLDA and EACSPLDA, and LATSSVM always outperformed RawTSSVM and EATSSVM. This suggests LA was effective in different feature extraction and classification algorithms.

Comparing the last subfigure in Fig. 5 with the last one in Fig. 9, we can observe that the performances of RawCSPLDA and RawTSSVM were lower in Fig. 9, which is intuitive, because the label spaces in Scenario II had larger discrepancies. However, the performances of LACSPLDA and LATSSVM did not change much, suggesting that LA can cope well with large label space discrepancies.
For the most extreme case that only one labeled target subject trial from each class is available, the average classification accuracies across the nine subjects in the nine experiments are given in Table VII. LA achieved the best performances in all six experiments, regardless of which feature extraction and classification algorithm was used.
Experiment  Approach  Raw  EA  LA 

1, 2 3, 4  CSPLDA  55.48  56.42  58.84 
TSSVM  54.38  53.13  56.42  
3, 4 1, 2  CSPLDA  50.70  56.81  58.37 
TSSVM  52.19  54.77  58.53  
1, 3 2, 4  CSPLDA  53.99  60.02  62.44 
TSSVM  60.09  66.67  68.78  
2, 4 1, 3  CSPLDA  53.29  61.50  62.21 
TSSVM  56.49  60.49  64.79  
1, 4 2, 3  CSPLDA  52.35  57.75  61.50 
TSSVM  55.87  60.17  65.81  
2, 3 1, 4  CSPLDA  51.10  64.01  69.95 
TSSVM  56.65  64.32  71.75 
We also performed paired ttests on the AUCs in Fig. 9. Each algorithm had AUCs. The values are shown in Table VIII, where the statistically significant ones are marked in bold. LACSPLDA significantly outperformed EACSPLDA, and LATSSVM significantly outperformed EATSSVM.
EACSPLDA  EATSSVM  

LACSPLDA  0.0000  
LATSSVM  0.0000 
Question 2: Can LA be integrated with other DA approaches to further improve the classification performance?
Again, we considered the case when there were additional DA approaches after LA. The results are shown in Fig. 10. Observe that:

LABL always outperformed RawBL and EABL, LAJDA always outperformed RawJDA and EAJDA, LAJGSA always outperformed RawJGSA and EAJGSA, and LAMEDA always outperformed RawMEDA and EAMEDA. These suggest that LA was effective regardless of whether an additional DA approach was used or not.

LABL outperformed RawJDA, RawJGSA and RawMEDA in all six experiments, suggesting that LA can outperform classical DA approaches such as JDA, JGSA and MEDA.

Generally, LAJDA, LAJGSA, LAMEDA outperformed LABL, suggesting again that it may be advantageous to integrate an additional DA approach with LA.
The results of paired ttests on the AUCs in Fig. 10 are shown in Table IX, which are consistent with those in the last two subsections: LA significantly outperformed EA and classical DA approaches, and the classification performance can be further significantly improved when LA was integrated with other DA approaches.
LABL  EAJDA  EAJGSA  EAMEDA  

LABL  0.0000  0.0000  0.0000  
LAJDA  0.0109  0.0000  
LAJGSA  0.0116  0.0000  
LAMEDA  0.0009  0.0000 
VF Scenario III: Different Feature Spaces and Different Label Spaces
This subsection considers the most challenging scenario: the source and target subjects have different feature spaces and also completely different label spaces. We chose “Classes 3, 4” (“feet” and “tongue”) from Dataset 2a as the target dataset, and Dataset 1 as the source dataset. Each time we picked one subject from “Classes 3, 4” as the target subject, and all seven subjects from Dataset 1 as the source subjects. In this scenario, the source dataset and target dataset were collected from different EEG headsets with different numbers of channels at different locations, so they had different feature spaces. In addition, for Dataset 1, Subjects 1 and 6 performed “left hand” and “feet” tasks, whereas other subjects performed “left hand” and “right hand” tasks. So, the source and target subjects also had partially or completely different label spaces.
Question 1: Can LA be used as an effective preprocessing step before different feature extraction and classification algorithms?
We selected the source EEG channels as those closest to the target EEG channels [14], and compared different algorithms. Fig. 11 shows the experimental results when LA was used before different feature extraction and classification algorithms, and Table X shows the values of paired ttests on the AUCs. LACSPLDA significantly outperformed EACSPLDA, and LATSSVM significantly outperformed EATSSVM. These suggest that LA was effective in different feature extraction and classification algorithms.
EACSPLDA  EATSSVM  

LACSPLDA  0.0082  
LATSSVM  0.0006 
Question 2: Can LA be integrated with other DA approaches to further improve the classification performance?
Fig. 12 shows the experimental results with and without additional DA approaches. Generally, LA was effective regardless of whether additional DA approaches were used or not. Table XI shows the values of paired ttests on the AUCs in Fig. 12. LABL significantly outperformed EAJDA and EAMEDA, LAJDA significantly outperformed EAJDA, and LAMEDA significantly outperformed EAMEDA. However, unlike before, the integration of LA with other DA approaches did not significantly outperform LABL. Nevertheless, LA did not degrade the performance of these DA approaches, either.
LABL  EAJDA  EAJGSA  EAMEDA  

LABL  0.0017  0.1335  0.0011  
LAJDA  0.3733  0.0026  
LAJGSA  0.8449  0.0691  
LAMEDA  0.9777  0.0012 
Vi Conclusions and Future Research
Transfer learning, or domain adaptation, has been successfully used to reduce the subjectspecific calibration effort in BCIs. However, most existing DA approaches require the source subjects share the same feature space and also the same label space as the target subject, which may not always hold in realworld applications. This paper has proposed a simple yet effective LA approach to cope with different label spaces. Our experiments demonstrated that: 1) LA only needs as few as one labeled sample from each class of the target subject; 2) LA can be used as a preprocessing step before different feature extraction and classification algorithms; and, 3) LA can be integrated with other DA approaches to achieve even better classification performance.
The current LA may still have some limitations, which will be addressed in our future research:

The estimation of each class mean in the target domain is very important to the performance of LA. Currently LA uses medoids clustering to select a few trials to label, which could be improved.

LA copes well with different labels spaces, but does not pay special attention to different feature spaces (although it can also be used in this case). This may explain why there were relatively less performance improvements when integrated with other DA approaches in Scenario III. We will specifically consider different feature spaces in the future.

The current LA approach was specifically designed for EEG trials, and uses 2D covariance matrices as the input features. We will extend it to 1D features so that it can have broader applications in other domains.
References
 [1] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan, “Braincomputer interfaces for communication and control,” Clinical Neurophysiology, vol. 113, no. 6, pp. 767–791, 2002.
 [2] B. J. Lance, S. E. Kerick, A. J. Ries, K. S. Oie, and K. McDowell, “Braincomputer interface technologies in the coming decades,” Proc. of the IEEE, vol. 100, no. 3, pp. 1585–1599, 2012.
 [3] Z. J. Koles, M. S. Lazar, and S. Z. Zhou, “Spatial patterns underlying population differences in the background EEG,” Brain Topography, vol. 2, no. 4, pp. 275–284, 1990.
 [4] J. MüllerGerking, G. Pfurtscheller, and H. Flyvbjerg, “Designing optimal spatial filters for singletrial EEG classification in a movement task,” Clinical Neurophysiology, vol. 110, no. 5, pp. 787–798, 1999.
 [5] H. Ramoser, J. MullerGerking, and G. Pfurtscheller, “Optimal spatial filtering of single trial EEG during imagined hand movement,” IEEE Trans. on Rehabilitation Engineering, vol. 8, no. 4, pp. 441–446, 2000.
 [6] H. He and D. Wu, “Spatial filtering for brain computer interfaces: A comparison between the common spatial pattern and its variant,” in Proc. IEEE Int’l. Conf. on Signal Processing, Communications and Computing, Qingdao, China, Sep. 2018, pp. 1–6.
 [7] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Multiclass braincomputer interface classification by Riemannian geometry,” IEEE Trans. on Biomedical Engineering, vol. 59, no. 4, pp. 920–928, 2012.
 [8] F. Yger, M. Berar, and F. Lotte, “Riemannian approaches in braincomputer interfaces: a review,” IEEE Trans. on Neural Systems and Rehabilitation Engineering, vol. 25, no. 10, pp. 1753–1762, 2017.
 [9] B. He, B. Baxter, B. J. Edelman, C. C. Cline, and W. W. Ye, “Noninvasive braincomputer interfaces based on sensorimotor rhythms,” Proc. of the IEEE, vol. 103, no. 6, pp. 907–925, 2015.
 [10] G. Pfurtscheller, G. R. MüllerPutz, R. Scherer, and C. Neuper, “Rehabilitation with braincomputer interface systems,” Computer, vol. 41, no. 10, pp. 58–65, 2008.
 [11] L. F. NicolasAlonso and J. GomezGil, “Brain computer interfaces, a review,” Sensors, vol. 12, no. 2, pp. 1211–1279, 2012.
 [12] J. van Erp, F. Lotte, and M. Tangermann, “Braincomputer interfaces: Beyond medical applications,” Computer, vol. 45, no. 4, pp. 26–34, 2012.
 [13] V. Jayaram, M. Alamgir, Y. Altun, B. Scholkopf, and M. GrosseWentrup, “Transfer learning in braincomputer interfaces,” IEEE Computational Intelligence Magazine, vol. 11, no. 1, pp. 20–31, 2016.
 [14] D. Wu, V. J. Lawhern, W. D. Hairston, and B. J. Lance, “Switching EEG headsets made easy: Reducing offline calibration effort using active weighted adaptation regularization,” IEEE Trans. on Neural Systems and Rehabilitation Engineering, vol. 24, no. 11, pp. 1125–1137, 2016.
 [15] D. Wu, V. J. Lawhern, S. Gordon, B. J. Lance, and C.T. Lin, “Driver drowsiness estimation from EEG signals using online weighted adaptation regularization for regression (OwARR),” IEEE Trans. on Fuzzy Systems, vol. 25, no. 6, pp. 1522–1535, 2017.
 [16] D. Wu, “Online and offline domain adaptation for reducing BCI calibration effort,” IEEE Trans. on HumanMachine Systems, vol. 47, no. 4, pp. 550–563, 2017.
 [17] D. Wu, “Active semisupervised transfer learning (ASTL) for offline BCI calibration,” in Proc. IEEE Int’l. Conf. on Systems, Man and Cybernetics, Banff, Canada, October 2017.
 [18] H. He and D. Wu, “Transfer learning enhanced common spatial pattern filtering for brain computer interfaces (BCIs): Overview and a new approach,” in Proc. 24th Int’l. Conf. on Neural Information Processing, Guangzhou, China, November 2017.
 [19] H. Kang, Y. Nam, and S. Choi, “Composite common spatial pattern for subjecttosubject transfer,” Signal Processing Letters, vol. 16, no. 8, pp. 683–686, 2009.
 [20] F. Lotte and C. Guan, “Learning from other subjects helps reducing braincomputer interface calibration time,” in Proc. IEEE Int’l. Conf. on Acoustics Speech and Signal Processing, Dallas, TX, March 2010.
 [21] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.

[22]
M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning
with joint distribution adaptation,” in
Proc. IEEE Int’l. Conf. on Computer Vision
, Sydney, Australia, Dec. 2013, pp. 2200–2207.  [23] J. Zhang, W. Li, and P. Ogunbona, “Joint geometrical and statistical alignment for visual domain adaptation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, Jul. 2017, pp. 1859–1867.
 [24] J. Wang, W. Feng, Y. Chen, H. Yu, M. Huang, and P. S. Yu, “Visual domain adaptation with manifold embedded distribution alignment,” in Proc. 26th ACM Int’l Conf. on Multimedia, Seoul, South Korea, Oct. 2018, pp. 402–410.
 [25] J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, and G. Zhang, “Transfer learning using computational intelligence: a survey,” KnowledgeBased Systems, vol. 80, pp. 14–23, 2015.
 [26] P. Zanini, M. Congedo, C. Jutten, S. Said, and Y. Berthoumieu, “Transfer learning: a Riemannian geometry framework with applications to braincomputer interfaces,” IEEE Trans. on Biomedical Engineering, vol. 65, no. 5, pp. 1107–1116, 2018.
 [27] H. He and D. Wu, “Transfer learning for braincomputer interfaces: A Euclidean space data alignment approach,” IEEE Trans. on Biomedical Engineering, 2019, in press.
 [28] O. Day and T. M. Khoshgoftaar, “A survey on heterogeneous transfer learning,” Journal of Big Data, vol. 4, no. 1, p. 29, 2017.
 [29] F. Liu, J. Lu, and G. Zhang, “Unsupervised heterogeneous domain adaptation via shared fuzzy equivalence relations,” IEEE Trans. on Fuzzy Systems, vol. 26, no. 6, pp. 3555–3568, 2018.
 [30] P. P. Busto and J. Gall, “Open set domain adaptation,” in Proc. IEEE Int’l Conf. on Computer Vision, Venice, Italy, Oct. 2017, pp. 754–763.

[31]
K. Saito, S. Yamamoto, Y. Ushiku, and T. Harada, “Open set domain adaptation by backpropagation,” in
Proc. European Conf. on Computer Vision, Munich, Germany, Sep. 2018, pp. 153–168.  [32] Z. Fang, J. Lu, F. Liu, J. Xuan, and G. Zhang, “Open set domain adaptation: Theoretical bound and algorithm,” arXiv preprint arXiv:1907.08375, 2019.
 [33] K. You, M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Universal domain adaptation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, Jun. 2019, pp. 2720–2729.
 [34] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola, “A kernel method for the twosampleproblem,” in Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, Dec. 2007, pp. 513–520.
 [35] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Proc. 30th AAAI Conf. on Artificial Intelligence, vol. 6, no. 7, Phoenix, AZ, Feb. 2016, pp. 2058–2065.
 [36] H. Shimodaira, “Improving predictive inference under covariate shift by weighting the loglikelihood function,” Journal of Statistical Planning and Inference, vol. 90, no. 2, pp. 227–244, 2000.
 [37] M. Sugiyama, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe, “Direct importance estimation with model selection and its application to covariate shift adaptation,” in Proc. 21th Annual Conf. on Neural Information Processing Systems, Vancouver, Canada, Dec. 2007, pp. 1433–1440.
 [38] P. E. Utgoff, Machine learning: An artificial intelligence approach. CA: Morgan Kaufmann, 1986, vol. 2, ch. Shift of bias for inductive concept learning, pp. 107–148.

[39]
V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “Geometric means in a novel vector space structure on symmetric positivedefinite matrices,”
SIAM Journal on Matrix Analysis and Applications, vol. 29, no. 1, pp. 328–347, 2007.  [40] B. Blankertz, G. Dornhege, M. Krauledat, K. R. Muller, and G. Curio, “The noninvasive Berlin braincomputer interface: Fast acquisition of effective performance in untrained subjects,” NeuroImage, vol. 37, no. 2, pp. 539–550, 2007.

[41]
A. Delorme and S. Makeig, “EEGLAB: an open source toolbox for analysis of singletrial EEG dynamics including independent component analysis,”
Journal of Neuroscience Methods, vol. 134, pp. 9–21, 2004.  [42] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and K. R. Muller, “Optimizing spatial filters for robust EEG singletrial analysis,” IEEE Signal Processing Magazine, vol. 25, no. 1, pp. 41–56, 2008.
 [43] L. van der Maaten and G. Hinton, “Visualizing data using tSNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
 [44] G. Dornhege, G. C. B. Blankertz, and K.R. Muller, “Boosting bit rates in noninvasive EEG singletrial classifications by feature combination and multiclass paradigms,” IEEE Trans. on Biomedical Engineering, vol. 51, no. 6, pp. 993–1002, 2004.
Comments
There are no comments yet.