MS-MDA: Multisource Marginal Distribution Adaptation for Cross-subject and Cross-session EEG Emotion Recognition

07/16/2021 ∙ by Hao Chen, et al. ∙ 0

As an essential element for the diagnosis and rehabilitation of psychiatric disorders, the electroencephalogram (EEG) based emotion recognition has achieved significant progress due to its high precision and reliability. However, one obstacle to practicality lies in the variability between subjects and sessions. Although several studies have adopted domain adaptation (DA) approaches to tackle this problem, most of them treat multiple EEG data from different subjects and sessions together as a single source domain for transfer, which either fails to satisfy the assumption of domain adaptation that the source has a certain marginal distribution, or increases the difficulty of adaptation. We therefore propose the multi-source marginal distribution adaptation (MS-MDA) for EEG emotion recognition, which takes both domain-invariant and domain-specific features into consideration. First, we assume that different EEG data share the same low-level features, then we construct independent branches for multiple EEG data source domains to adopt one-to-one domain adaptation and extract domain-specific features. Finally, the inference is made by multiple branches. We evaluate our method on SEED and SEED-IV for recognizing three and four emotions, respectively. Experimental results show that the MS-MDA outperforms the comparison methods and state-of-the-art models in cross-session and cross-subject transfer scenarios in our settings. Codes at



There are no comments yet.


page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Emotion as physiological information, unlike widely studied logical intelligence, is central to the quality and range of daily human communications [1, 2]. In the human-computer interaction (HCI), emotion is crucial in influencing situation assessment and belief information, from cue identification to situation classification, with decision selection for building a friendly user interface [3]. For example, affective brain-computer interfaces (aBCIs), acting as a bridge between the emotions extracted from the brain and the computer, which has shown potential for rehabilitation and communication [4, 5, 6]. Besides, many studies have shown a strong correlation between emotions and mental illness. Barrett et al. [7] studies the relation between emotion differentiation and emotion regulation. Joormann et al. [8] finds that depression is strongly associated with the use of emotion regulation strategies. Bucks et al. [9]

investigates the identification of non-verbal communicative signals of emotion in people that are suffering from Alzheimer’s disease. To quantify emotion, most researchers have focused on using conventional methods such as classifying emotions with facial expression or language

[10]. In recent years, with the advantage of reliability, easy accessibility, and high precision, non-invasive BCIs such as electroencephalogram (EEG) are widely used for brain signal acquisition, and analysis of psychological disorders [11, 12, 13, 14]. With EEG signals, many works also investigate the rehabilitation methods for psychological disorders, such as [15] of using spatial information of EEG signals to classify depressions, and Zhang et al. [16] proposes a brain functional network framework for major depressive disorder by using the EEG signals. Besides, Hosseinifard et al. [17] investigates the nonlinear features from EEG signals for classifying depression patients and normal subjects. The flow of an EEG-based affective BCI (aBCI) for emotion recognition is introduced in Section. III-A

Due to the non-stationary between individual sessions and subjects of EEG signals [12], it is still challenging to get a model that is shareable to different subjects and sessions in EEG-based emotion recognition scenarios, which elicits two scenarios: cross-subject and cross-session (i.e., data collected from the same subject at the same session can be very biased, detailed description is given in Section III-B

). Besides, the analysis and classification of the collected signals are time-consuming and labor-intensive, so it is important to make use of the existing labeled data to analyze new signals in the EEG-based BCIs. With this purpose, domain adaptation is widely used in research works. As a sub-field of machine learning, DA improves the learning in the unlabeled target domain through the transfer of knowledge from the source domains, which can significantly reduce the number of labeled samples

[18]. In practice, we often face the situation that contains multiple source domain data (i.e., data from different subjects or sessions). Due to the shift between domains, adopting DA for EEG data especially when facing multiple sources is difficult. In recent years, the researchers tend to merge all source domains into one single source and then use DA to align the distribution (Source-combine DA in Fig. 1). This simple approach may improve the performance because it expands the training data for the model, but it ignores the non-stationary of each EEG source domain itself and disrupts it (i.e., EEG data of different people obey different marginal distributions), besides, directly merging into one new source domain cannot determine whether its new marginal distribution still obeys EEG-data distribution, thus brings a larger bias.

Fig. 1: Two strategies of multi-source domain adaptation. a) is a single-branch strategy while b) is a multi-branch strategy. In a), all source domains are combined into one new big source and then been used to align distribution with the target domain, while in b), multiple sources are being aligned at the same time, and are divided into multiple branches to adopt DA with the target domain. In short, a) is one source, one branch with one-to-one DA; b) is multiple sources, multiple branches with one-to-one DA. The figure is best viewed in color.

To solve the multi-source domain adaptation problems in EEG-based emotion recognition, we propose a multi-source marginal distribution adaptation for cross-subject and cross-session EEG emotion recognition (MS-MDA, as illustrated in Fig. 1). First, we assume all the EEG data share low-level features, especially those taken from the same device, the same subject and the same session. Based on this, we construct a simple common feature extractor to extract domain-invariant features. Then for multiple sources, since each of them has some specific features, we pair every single source domain with the target domain to form a branch for one-to-one DA, and align the distribution and extract domain-specific features. After that, a classifier is trained for each branch, and the final inference is made by these multiple classifiers from multiple branches. The details of MS-MDA are given in Section IV.

In summary, we make three following contributions:

  1. We propose MS-MDA for EEG-based emotion recognition in a new multi-source adaptation way to avoid disrupting the marginal distributions of EEG data.

  2. Extensive experiments demonstrate that our method outperforms the comparison methods on SEED and SEED-IV, and additional experiments also illustrate that our method generalizes well.

  3. During the experiments, we also notice the importance of normalizing the EEG data, thus we design and evaluate few normalization approaches for EEG data in the domain adaptation scenarios and draw corresponding conclusions. To our knowledge, we are the first to investigate the normalization methods for EEG data, which we believe can be taken as a guide for other future works, and be applied to all data in EEG-based datasets and EEG-related domains.

In the remainder of this paper, we first review related works on domain adaptation in the field of EEG-based emotion recognition in Section II. Section III introduces the materials, including the diagram of EEG-based affective BCI with transfer scenarios, datasets and pre-processing methods. The details of MS-MDA are given in Section IV, whereas Section V demonstrates the settings, results, and additional experiments. Section VII discusses the results of the experiment and our findings, as well as problems and solutions. Finally, Section VII concludes the work and outlines the future extension.

Ii Related Work

In recent years, the research of affective computing has become one of the trends of machine learning, neural systems, and rehabilitation study. Among those works, emotions are usually characterized into two types of emotion model: discrete categories (basic emotional states, e.g., happy, sad, neutral [19]) or continuous values (e.g., in 3D space of arousal, valence and dominance [20]). With domain adaptation techniques, many works have achieved significant performance in the field of affective computing.

Zheng et al. [21] first applies Transfer Component Analysis [22] and Kernel Principle Analysis based methods on SEED dataset to personalize EEG-based affective models and demonstrates the feasibility of adopting DA in EEG-based aBCIs. Chai et al. proposes adaptive subspace feature matching [23] to decrease the marginal distribution discrepancy between two domains, which requires no labeled samples in the target domain. To solve cross-day binary classification, Lin et al. [24]

extends robust principal component analysis (rPCA)

[25] to their filtering strategy which can capture EEG oscillations of relatively consistent emotional responses. Li et al., different from the above, considering the multi-source scenario, and proposes a Multi-source Style Transfer Mapping (MS-STM) [26] framework for cross-subject transfer. They first take a few labeled training data to learn multiple STMs, which are then being used to map the target domain distribution to the space of the sources. Their method is similar to our MS-MDA, but they do not take the domain-invariant features into consideration, thus losing the low-level information.

In recent years, with the development of deep learning techniques and its usability, many works of EEG-based decoding with neural networks have been proposed. Jin 

et al. [27], and Li et al. [28] adopts deep adaptation network (DAN) [29] to EEG-based emotion recognition, which takes maximum mean discrepancy (MMD) [30] as a measure of the distance between the source and the target domain, and training to reduce it on multiple layers. Extending the original method, Chai et al. proposes subspace alignment auto-encoder (SAAE) [31]

which first projects both source and target domains into a domain-invariant subspace using an auto-encoder, and then kernel PCA, graph regularization and MMD are used to align the feature distribution. To adapt the joint distribution, Li 

et al. [32] propose a domain adaptation method for EEG-based emotion recognition by simultaneously adapting marginal distributions and conditional distributions, they also present a fast online instance transfer (FOIT) for improved EEG emotion recognition [33]. Zheng et al. extends SEED dataset to SEED-IV dataset and presents EmotionMeter [34]

, a multi-modal emotion recognition framework that combines two modalities of eye movements and EEG waves. With the concept of attention-based convolutional neural network (CNN)

[35], Fahimi et al. [36] develops an end-to-end deep CNN for cross-subject transfer and fine-tunes it by using some calibration data from the target domain. To tackle the requirement of amassing extensive EEG data, Zhao et al. [37] proposes a plug-and-play domain adaptation method for shortening the calibration time within a minute while maintaining the accuracy. Wang et al. [38] present a domain adaptation SPD matrix network (daSPDnet) to help cut the demand of calibration data for BCIs.

These aBCI works have gained significant improvement in their respective directions, transfer scenarios, and on multiple benchmark databases. However, many of them focus on combing multiple sources into one and adopt one-to-one DA, which ignores the differences of the marginal distribution of different EEG domains (source-combine DA in Fig. 1). This operation may compromise the effectiveness of downstream tasks, and although it somehow extends the training data, the trained models do not generalize well enough. Therefore, inspired by [39], a novel multi-source transfer framework, we propose MS-MDA

(multi-source marginal distribution alignment for EEG-based emotion recognition), which transfers multiple source domains to the target domain separately, thus avoiding the destruction of the marginal distribution of the multiple EEG source domains; and also takes the domain-invariant features into consideration. Due to the sensitivity of the EEG data and intuition, we do not adopt complex networks, but just a combination of few multi-layer perceptrons (MLPs)

[40], and thus makes our method computationally efficient, and easy to expand.

Iii Materials

Iii-a Diagram

Fig. 2: The flowchart

of EEG-based BCI for emotion recognition. The emotions are first evoked and encoded into EEG data, then the EEG data are pre-processed and extracted to various forms of features for subsequent pattern recognition.

The flow of one EEG-based aBCI for emotion recognition is shown in Fig. 2, which involves five steps:

  • Stimulating emotions. The subjects are first stimulated with stimuli that correspond to a target emotion. The most commonly used stimuli are movie clips with sound, which can better stimulate the desired emotion because they mix sound with images and actions. After each clip, self-assessment is also applied for the subject to ensure the consistency of the evoked emotion and the target emotion.

  • EEG signal acquisition and recording. The EEG data are collected using the dry electrodes on the BCI, and then be labeled with the target emotion.

  • Signal pre-processing.

    Since the EEG data is a mixture of various kinds of information containing much noise, it is required to pre-process the EEG signal to get cleaner data for subsequent recognition. This step often includes down-sampling, band-pass filtering, temporal filtering, and spatial filtering to improve the signal-to-noise ratio (SNR).

  • Feature extraction.

    In this step, features of the pre-processed signals are extracted in various ways. Most of the current research works are to extract features in the time or frequency domain.

  • Pattern recognition. The use of machine learning techniques to classify or regress data according to specific application scenarios.

Iii-B Scenarios

Considering the sensitivity of the EEG, domain adaptation in emotion recognition can be divided into several cases: 1) Cross-subject transfer. In one session, new EEG data from a new subject is taken as the target domain, and the rest of existing EEG data from other subjects are taken as the source domains for DA. 2) Cross-session transfer. For one subject, data collected in the previous sessions can be used as the source domain for DA, and data collected in the new session are taken as the target domain.

In our work, since the datasets we evaluate on contains 3 session and 15 subjects (refer to Section III-C

for details), we take the first 2 session data from one subject as the source domains for cross-session transfer, and take the first 14 subjects data from one session as the source domains for cross-subject transfer. The results of cross-session scenarios are averaged over 15 subjects, and the results of cross-subject are averaged over 3 sessions. Standard deviations are also calculated.

Iii-C Datasets

The database we evaluate on are: SEED [19] [41] and SEED-IV [34], both are established by the BCMI laboratory led by Prof. Bao-Liang Lu from Shanghai Jiao Tong University.

The SEED database contains emotion-related EEG signals that are evoked by 15 film clips (with positive, neutral, and negative emotions) from 15 subjects with 3 sessions each. The signals are recorded by a 62-channel ESI neuroscan system.

The SEED-IV is an evolution of SEED, which contains 3 sessions, each has 15 subjects and 24 film clips. Comparing to the SEED with EEG signals only, this database also includes eye movement features recorded by SMI eye-tracking glasses.

Iii-D Pre-processing

After collecting EEG raw data, pre-processing on signals and feature extractions will be adopted. For both SEED and SEED-IV, to increase the SNR, the raw EEG signals are first down-sampled to a 200 Hz sampling rate, then been processed with a band-pass filter between 1 Hz to 75 Hz. After that, features are then being extracted.


Recent works extract features from EEG data on the time domain, frequency domain, and time-frequency domain. Among them, Differential Entropy (DE) as in (1), has the ability to distinguish patterns from different bands [42], thus we choose to take DE features as the input data of our model. For SEED and SEED-IV, extracted DE features at five frequency bands of delta (1-4 Hz), theta (4-8 Hz), alpha (8-14 Hz) and gamma (31-50 Hz) are provided.

One data from one subject in one session for both databases is in the form of channel (62) trial (15 for SEED, 24 for SEED-IV) band (5), we then merge the channel with the band, and the form becomes trial 310 (62 5). For SEED, 15 trials contain 3394 samples in total for each session. For SEED-IV, 24 trials contain 851/832/822 samples for three sessions, respectively. In the end, all data are formed into 3394 310 (SEED), or 851/832/822

310 (SEED-IV) with corresponding generated label vectors in the form of 3394

1, or 851/832/822 1.

Iv Method

For simplicity of demonstration, we list the symbols and their definition in Table I that will be used in the following sections.

symbol definition
Instance set (matrix)
Label set (matrix)
Source domain
Target domain
number of source domains
Common feature
Domain-specific feature
Predicted label (matrix)
Mapping function
Reproducing kernel Hilbert space
Common feature extractor
Domain-specific feature extractor
Domain-specific classifier
Feature vector
Label vector
Feature vector after CFE
Feature vector after DSFE
Predicted label vector
TABLE I: Notation Table
Fig. 3: The architecture of our proposed method. Our network consists of a common feature extractor, domain-specific feature extractor, and domain-specific classifier. For each source domain, a branch of DSFE and DSC is conducted for pair-wise domain adaptation. The model receives multiple source domains and leverages their knowledge to transfer to the target domain.

Given a set of pre-existing EEG data and a newly collected EEG data, our goal is to learn a model that is trained on these multiple independent source domain data using DA, and thus has a better prediction on the newly collected data than simply combining the existed data into one source domain. The architecture of the proposed method is illustrated in Fig. 3.

As shown in the figure, the input to the MS-MDA are independent source domain data and a target domain data , and then these data are fed into a common feature extractor module to get the domain-invariance features and . Then for each domain-specific feature extractor, extracted common features will be fed into one branch with and get their domain-specific features: and , and on top of that, the MMD value is calculated, which is a measure of the distance of the current source and the target domain. Next, the target domain features and all the source domain features extracted from the last step will get to the domain-specific classifiers to get the corresponding classification predictions: and , then the results of the source domain are taken to calculate the classification loss. Since the target domain will be fed into all the source domain classifiers, multiple target domain predictions are generated. These predictions are taken to calculate the discrepancy loss. In the end, the average of these target-domain predictions is taken as the output of the model. Details of these modules are given below.

Common Feature Extractor in the MS-MDA is used to map the source and target domain data from the original feature spaces to a common sharing latent space, and then common representations of all domains are extracted. This module can help to extract some low-level domain-invariant features.

Domain-specific Feature Extractor follows the Common Feature Extractor (CFE). After obtaining the features of all domains, we set up single fully connected layers to correspond to

source domains. For each pair of source and target domain, we map the data to a unique latent space via the corresponding Domain-specific Feature Extractor (DSFE), respectively, and then obtain the domain-specific features in each branch. To apply DA and bring the two domains close in the latent space, we choose the MMD to estimate the distance between these two domains. MMD is widely used in the DA and can be formulated in (

2). In the process of training, MMD loss is decreased to narrow the source domain and the target domain in the feature space, which helps make better predictions for the target domain. This module aims to learn multiple domain-specific features.


Domain-specific Classifier uses the features extracted from the DSFE to predict the result. In Domain-specific Classifier (DSC), there are single softmax classifiers that correspond to each source domain. For each classifier training, we choose cross-entropy to estimate the classification loss, as shown in (3). Besides, since there are classifiers in this module, and these classifiers are trained on

source domains, if their predictions are simply averaged as the final result, the variance will be high, especially when the target domain samples are at the decision boundary, which will have a significant negative impact on the results. To reduce this variance, a metric called discrepancy loss is introduced to make the predictions of the

classifiers converge, which is shown in (4). The average of the predictions of the classifiers is taken as the final result.


In summary, MS-MDA accepts source domain EEG data and one target domain EEG data, and then includes a common feature extractor to get source domain features and one target domain feature. Next, domain-specific feature extractors are used to pairwise compute the MMD loss of one individual source with the target domain and extract their domain-specific features. Finally, a domain-specific classifier is used to do the classification task, which also calculates the classification loss of the classifiers using the features, with the discrepancy loss of the classifiers for the features of the target domain data after the previous feature extractors.


The training is based on the (5) and following the algorithm as shown in Algorithm. 1. For the three losses, minimizing MMD loss can get domain-invariant features for each pair of the source and target domains; minimizing classification loss will bring more accurate classifiers for predicting the source domain data; minimizing discrepancy loss will get more convergent multiple classifiers.

0:     Iteration , source domain data and target domain data
1:  for t = 1,…,  do
2:     Take samples from source domains and from target domain.
8:     Update model by minimizing the total loss
9:  end for
10:  return  ;
10:    Prediction of target domain data, ;
Algorithm 1 Overview of MS-MDA

V Experiments

We perform substantial experiments in the task of classification of emotions on two datasets SEED and SEED-IV, with the normalization study to the EEG data for domain adaptation. Besides, we also conduct some exploratory experiments in addition to the evaluation of our proposed methods and comparison methods.

V-a Implementation Details

As mentioned in the Section. IV, there are many details in the three modules of MS-MDA. First, for the Common Feature Extractor (CFE), since we do not take raw data (i. e. EEG signals) but the extracted DE features as vectors, complex deep models such as deep convolutional neural networks are not suitable for this module, thus we choose 3-layer MLP for simplicity which reduces feature dimensions from 310-dimension (62 5, channel band) to 64-D. In CFE, every linear layer is followed by a LeakyReLU [43]

layer. We also evaluate the effort of the ReLU

[44]activation function, but due to the sensitivity of the EEG data, much information would be lost if using ReLU since the value less than zero would be dropped, so we choose LeakyReLU as a compromise. Next, for both domain-specific feature extractor (DSFE) and domain-specific classifier (DSC), there is a single linear which reduces 64-D to 32-D and 32-D to the corresponding number of categories (3 for SEED, 4 for SEED-IV), respectively. In DSFE, same as the settings in CFE, a LeakyReLU layer is followed after the linear layer, while in DSC, there is only one linear layer without any activation function. The network is trained using an Adam [45]

optimizer with an initial learning rate of 0.01, and train for 200 epoch. The batch size we choose is 256, which means we take 256 samples from each domain in every iteration (we also evaluate different settings of batch size and epoch in Section

V-E). The whole model is trained under the (5), for domain adaptation loss, we choose MMD as the metric of the distance between two domains in the feature space (CORAL loss has a similar effect). As for the discrepancy loss, L1 regularization is being used, we also evaluate this loss in Section V-E. Besides, we dynamically adjust the coefficients to achieve the effect of focusing on the classification results first, and then start aligning MMD and the convergence between the classifiers (). As for the training data, we take the DE features and reform one sample to a 310-D vector as illustrated in the Section III-D. Before feeding into the model, we normalize all the data in electrode-wise, refer to Section V-D for details.

V-B Results

Dataset Method Cross-session Cross-subject
SEED DDC 81.53 6.83 68.99 3.23
DAN 79.93 7.06 65.84 2.25
DAN [28] - 83.81 8.56
DCORAL 76.86 7.61 66.29 4.53
DANN [28] - 79.19 13.14
PPDA [37] - 86.70 7.10
MS-MDA (Ours) 88.56 7.80 89.63 6.79
SEED-IV DDC 57.63 11.28 37.41 6.36
DAN 55.14 12.79 32.44 9.02
DCORAL 44.63 11.38 37.43 3.08
MS-MDA (Ours) 61.43 15.71 59.34 5.48
TABLE II: Comparison results on SEED and SEED-IV

Experiment results of comparison methods and our proposed method on SEED and SEED-IV are listed in Table. II, all the hyper-parameters are the same, except for those results taken directly from the original papers. It should be noticed that since many previous works do not make their codes public available, we then customize the comparison methods (in the deep learning domain adaptation field) that are described in their papers with our settings, and also including some typical deep learning domain adaptation models for better comparison (DDC [46], DCORAL [47]). The results indicate that our method largely outperforms the comparison methods in most transfer scenarios. For SEED dataset, our method has a minimum of  7% and  3% improvement in cross-session and cross-subject scenarios, respectively. While in SEED-IV dataset, our method has a minimum of  7% and  18% for two transfer scenarios. The results also show that our method outperforms comparison methods significantly in cross-subject, the reason for that may be that in the cross-subject scenario, the number of sources is 14, much bigger than the number of 2 in cross-session, and thus maximizes the effect of taking multiple sources as multiple individuals in domain adaptation rather than concatenating them.

Dataset Method Cross-session Cross-subject
SEED Ours full 88.56 7.80 89.63 6.79
w/o MMD loss 82.20 14.33 67.65 17.65
w/o disc. loss 86.27 9.14 87.27 5.70
w/o MMD + disc. loss 83.19 9.69 80.48 5.76
SEED-IV Ours full 61.43 15.71 59.34 5.48
w/o MMD loss 56.51 18.48 49.71 5.82
w/o disc. loss 61.63 17.62 55.37 14.38
w/o MMD + disc. loss 62.66 16.07 55.81 4.17
TABLE III: Ablation study of MS-MDA on SEED and SEED-IV

V-C Ablation Study

To understand the effect of each module in the MS-MDA, we remove them one at a time and evaluate the performance of the ablated model, the results are shown in Table. III. The first row of SEED and SEED-IV shows the performance of the full model (the same as in Table II). The second row ablates the MMD loss in the training process, which makes the model focuses only on the classification loss and discrepancy loss. The significant drop compared to the full model indicates the important effect of domain adaptation. Notice that even the results without MMD loss are better than many comparison methods, showing the importance of taking multiple sources as multiple individuals during training. The third row of taking out the discrepancy loss shows that this loss will affect the performance but the impact is minimal, the reason is that we want this discrepancy loss to be the icing on the cake rather than having a dominant effect on the model. The fourth row only considers the classification loss, thus reduces losses (2) and (4).

V-D Normalization

Fig. 4: Three normalization methods. The dark blue box stands for the sample-wise normalization, while the light blue box stands for the electrode-wise normalization. The big gray box stands for the global-wise normalization.
Fig. 5: Small blue matrices are data from different subjects, and (1), (2) are two operations. The basic process is: multi-source data →(1) →(2). In order A, (1) in the figure stands for the normalization, and (2) stands for the concatenate. In order B: (1) stands for concatenating while (2) is for normalization.
Model Normalization type SEED SEED-IV
Cross-session Cross-subject Cross-session Cross-subject
DAN w/o normalization 33.96 0.23 33.91 0.09 27.23 4.78 27.15 1.31
electrode-wise 79.93 7.06 65.84 2.25 55.14 12.79 32.44 9.02
sample-wise 52.51 11.92 51.77 12.61 27.34 2.45 32.03 4.24
global-wise 54.02 9.29 49.12 12.06 31.72 6.46 29.31 2.40
DAN w/o normalization 33.96 0.23 33.91 0.09 27.23 4.78 27.15 1.31
electrode-wise 79.78 6.97 62.57 5.31 52.18 10.53 34.26 7.98
sample-wise 52.51 11.92 51.77 12.61 27.34 2.45 32.03 4.24
global-wise 53.07 10.50 50.22 3.66 31.01 7.56 31.77 2.08
MS-MDA w/o normalization 80.62 12.22 60.92 3.58 30.11 6.47 29.64 7.26
electrode-wise 86.94 8.68 86.93 8.24 64.07 14.36 55.21 6.30
sample-wise 81.84 13.72 74.09 5.79 34.71 10.94 30.25 5.20
global-wise 81.80 12.75 78.89 10.38 33.87 10.99 31.88 7.73
TABLE IV: Normalization Study of MS-MDA and DAN. DAN stands for the order A while DAN stands for the order B.
training percentage weight SEED SEED-IV
Cross-session Cross-subject Cross-session Cross-subject
w/o disc. loss 86.27 9.14 87.27 5.70 61.63 17.62 55.37 14.38
0.2 1 86.94 8.68 86.93 8.24 64.07 14.36 55.21 6.30
0.2 0.1 86.99 9.23 87.37 7.64 64.75 13.36 53.15 11.88
0.2 0.01 86.87 9.30 87.09 8.02 64.04 13.72 50.54 15.59
0.2 0.001 86.91 9.35 86.93 8.24 64.14 13.88 53.25 9.55
1 1 85.58 8.19 63.42 2.15 61.88 16.71 57.34 9.07
1 0.1 85.80 10.05 81.13 11.19 62.42 15.99 56.34 10.18
1 0.01 88.56 7.80 89.63 6.79 61.43 15.71 59.34 5.48
1 0.001 86.36 8.68 84.84 3.49 64.41 17.58 48.01 8.66
TABLE V: Performance of MS-MDA on SEED and SEED-IV with different settings of . Training percentage stands for when to add this loss into the training, 1 means whole training process while 0.2 stands for the last 20% of the training process. Weight represents the ratio compared to .

During the experiments, we also find that different normalization to data can significantly impact the outcomes, and also the order of whether first concatenating multiple sources or first normalize each session individually. Thus we design diagrams and conduct extensive experiments to investigate the effects of different normalization strategies on the input data, i. e., extracted feature vectors from two datasets. Since we have reformed the origin 4-D matrices (session channel trial band) into 3-D matrices (session trial (channel*band)), for each session, there is a 2-D matrix of trial 310. Following the common machine learning normalization approaches and the prior knowledge and intuition of EEG data (i. e., the data acquired by the same electrode are more consistent with the same distribution), the normalization methods to these 2-D matrices can be categorized into three, as shown in Fig. 5. Besides, since we also take the multi-source situation into consideration, the order of normalization may also influence the performance, as shown in Fig. 5.

We evaluate three normalization methods and two normalization orders on SEED and SEED-IV with our proposed method MS-MDA and representative domain adaptation model DAN [29]. The results are listed in Table. IV. In all three sets, the normalization of electrode-wise outperforms the other three normalization types significantly. Comparing DAN with DAN, the results indicate that the first normalization order of normalizing the data first and then concatenating them is better. In the third set of MS-MDA, we find that all the results of four normalization types are better than those in the first and second sets, and the improvement is significant. Row w/o normalization in MS-MDA, for example, has a top of  47% improvement, which also indicates the generalization of our proposed method in different normalization types, and the positive effects of taking multiple sources as individual branches for DA.

V-E Additions

V-E1 Coefficient Study

After multiple sets of experiments, we find that easy to control the MMD loss and it plays an influential role in the training as shown in Table III. However, for the disc. loss, it remains many problems. Adding this loss to the model too early will affect the overall effect, and too late will lose the impact of learning convergence. Too large a weight would cause the training to focus on convergence, thus the few correct ones might follow the many incorrect ones; too small may not have enough influence on the model. Also, for better use and simplicity mentioned earlier, we do not make many tests on the , but simply compared the effects on only a few sets of , and the results are shown in Table V. From which we can see that compared to row one (w/o disc. loss), introducing discrepancy loss increases the performance in most cases, especially when training for the whole process in cross-subject for SEED-IV. We then choose the weight of 0.01 and training discrepancy loss for the whole process according to the results.

V-E2 Hyper-parameters and Data Visualization

To better investigating our proposed method, we evaluate it with different hyper-parameters, besides, we also take the representative method DAN as the comparison. The results are shown in Fig. 10 and Fig. 13. From them we can see that, with the increase of batch size, both models show a drop in performance, especially when the batch size is 512, which has a significant decrease compared to 256 on SEED-IV. Besides, with the training epoch increases, neither model has a substantial improvement, especially MS-MDA, but our method achieves moderate accuracy and converges faster. Comparing cross-subject experiments on two datasets, it can be significantly seen that MS-MDA has a clear advantage over DAN, which indirectly shows that our approach has a more significant performance improvement for multiple source domain adaptation in EEG-based emotion recognition.

For a better understanding of the effect of our proposed method, we randomly pick 100 EEG samples from each subject (domain) in the scenario of cross-subject to visualize with t-SNE [48], as displayed in Fig. 14. We only plot the cross-subject since this transfer scenario has more sources that will maximize visualization. In the Fig. 14, each color stands for a source domain, and the target domain are in black. To better plotting, we transparent the target sample to avoid overlap. It should be noticed that in the lower left figure, we pick 1400 samples since we concatenate all sources into one.

(a) MS-MDA with different batch size
(b) DAN with different batch size
(c) MS-MDA with different epochs
(d) DAN with different epochs
Fig. 10: Evaluation of MS-MDA and DAN with different batch size and epochs. Each bar stands for one cross scenario for one dataset.
(a) Batch size
(b) Epoch
Fig. 13: Evaluation of MS-MDA and DAN with different settings of batch size and epochs. Each line stands for one cross scenario for one dataset.

Fig. 14: Visualization with t-SNE for raw data (upper left), normalization data (upper right), data using DAN (lower left), and data using MS-MDA (lower right). The input data of the last fully-connected (DSC) layer are used for the computation of the t-SNE. Target data are in the shape of X with black, all other 14 source data are in 14 colors. Notice that since we have concatenated all the source domains, the lower left figure has only one color for the source domain. All four figures are best viewed in color.

Vi Discussion

As can be seen from Table II, comparing the results of selective methods and prior works, our proposed method has a significant improvement, especially for cross-subject DA in which the number of source domains is large. The ablation experiments from Table V-C also show that our proposed method requires both MMD and discrepancy loss in most cases. Eliminating the MMD loss has a significant performance drop on both datasets, confirming the importance of DA, and eliminating disc. loss does not have as large an impact as MMD loss, but also verifies the help of multi-source convergence. Also, during the experiments, we find that the type of normalization of the data has a significant impact on the overall results, so we also design experiments and explore the normalization of EEG data in DA to help improve the performance of our model. As can be seen in Table V-D, there is not much difference between the two normalization orders, and it is most appropriate to do data normalization on the electrode-wise, which has a crushing performance improvement compared to the other three methods; for our method, which does not concatenate data, electrode normalization is also the most effective. This conclusion is in line with our intuition that data collected from the same electrode are relatively more regular or conform to a certain distribution, while data collected from different electrodes are very different. In addition, during the experiments, we find that the disc. loss needs to be carefully adjusted, otherwise it is easy to cause harmful effects, which we guess is because this loss introduces a convergence effect on multiple classifiers in the model (in other words, smooth the inferences made from multiple classifiers), and if most of the classifiers are wrong, this convergence effect will cause the correct classifiers to error. Therefore, we also test and evaluate the impact of the disc. loss coefficients on the model at different settings, and from Table V, we can see that the disc. loss achieves the best results if it is set to 0.01 times the MMD loss coefficient and is being used in the full model training.

After exploring the internal details of the model, we also evaluated the performance of the model under different hyper-parameters. For better comparison, we chose a representative DAN as the comparison method. From Fig. 6 and Fig. 7, we can see that both models have a significant decrease as the batch size is increased. The reason for this we assume is that small batch size tends to fall into local optimal overfitting. The performance of both models increases slightly with epoch. From Figs. 6 and 7, we can also clearly see that MS-MDA has a significant advantage over DAN in cross-subject DA where the number of multiple source domains is large, which also confirms the importance of constructing multiple branches for multiple source domains to adopt DA separately.

Although it is clear from the results that our proposed method has a significant performance improvement, we also found that the training time consumed increases linearly with the number of source domains, i.e., the larger the number of source domains and the larger the model, the longer the training takes, unlike concatenating all source data into one, where there is only additional time due to the increase in the amount of data. For this problem, our current idea is to discard some less relevant source domains selectively and not build DA branches for them, allowing the disc. loss to play a more prominent role because there is less negative information. In addition, the encoders in the current model are the simplest MLP, and many literature and works have verified the usability of LSTM for EEG data [49, 50, 51], and we will consider switching to use LSTM as the encoders in future works.

Vii Conclusion

In this paper, we propose MS-MDA, an EEG-based emotion recognition domain adaptation method, which is applicable to multiple source domain situations. Through experimental evaluation, we find that this method has a better ability to adapt to multiple source domains, which is validated by comparison with the selective approaches and the SOTA models, especially for cross-subject experiments where our proposed method consists of up to 20% improvement. In addition, we also explore the impact of different normalization methods for EEG data in domain adaptation, which we believe can serve as an inspiration for other EEG-based works while improving the effectiveness of the models. As for our future work, the current model for multiple source domains is to construct a DA branch for each of them without selection, which will increase the model size and training time exponentially, and also introduces information from the source domain that is not relevant to the target into the model. A more efficient approach may be to selectively build DA branches from a reservoir of source domains, allowing the model to be more efficient while only focusing on the source domain information that is relevant to the target domain.


  • [1] R. J. Dolan, “Emotion, cognition, and behavior,” science, vol. 298, no. 5596, pp. 1191–1194, 2002.
  • [2] C. M. Tyng, H. U. Amin, M. N. Saad, and A. S. Malik, “The influences of emotion on learning and memory,” Frontiers in psychology, vol. 8, p. 1454, 2017.
  • [3] M. Jeon, “Emotions and affect in human factors and human–computer interaction: Taxonomy, theories, approaches, and methods,” in Emotions and affect in human factors and human-computer interaction.   Elsevier, 2017, pp. 3–26.
  • [4] N. Birbaumer, “Breaking the silence: brain–computer interfaces (bci) for communication and motor control,” Psychophysiology, vol. 43, no. 6, pp. 517–532, 2006.
  • [5] S.-H. Lee, M. Lee, J.-H. Jeong, and S.-W. Lee, “Towards an eeg-based intuitive bci communication system using imagined speech and visual imagery,” in 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC).   IEEE, 2019, pp. 4409–4414.
  • [6] A. Frisoli, C. Loconsole, D. Leonardis, F. Banno, M. Barsotti, C. Chisari, and M. Bergamasco, “A new gaze-bci-driven control of an upper limb exoskeleton for rehabilitation in real-world tasks,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp. 1169–1179, 2012.
  • [7] L. F. Barrett, J. Gross, T. C. Christensen, and M. Benvenuto, “Knowing what you’re feeling and knowing what to do about it: Mapping the relation between emotion differentiation and emotion regulation,” Cognition & Emotion, vol. 15, no. 6, pp. 713–724, 2001.
  • [8] J. Joormann and I. H. Gotlib, “Emotion regulation in depression: Relation to cognitive inhibition,” Cognition and Emotion, vol. 24, no. 2, pp. 281–298, 2010.
  • [9] R. S. Bucks and S. A. Radford, “Emotion processing in alzheimer’s disease,” Aging & mental health, vol. 8, no. 3, pp. 222–232, 2004.
  • [10] P. Ekman, “Facial expression and emotion.” American psychologist, vol. 48, no. 4, p. 384, 1993.
  • [11] B. Ay, O. Yildirim, M. Talo, U. B. Baloglu, G. Aydin, S. D. Puthankattil, and U. R. Acharya, “Automated depression detection using deep representation and sequence learning with eeg signals,” Journal of medical systems, vol. 43, no. 7, pp. 1–12, 2019.
  • [12] S. Sanei and J. A. Chambers, EEG signal processing.   John Wiley & Sons, 2013.
  • [13] U. R. Acharya, V. K. Sudarshan, H. Adeli, J. Santhosh, J. E. Koh, and A. Adeli, “Computer-aided diagnosis of depression using eeg signals,” European neurology, vol. 73, no. 5-6, pp. 329–336, 2015.
  • [14] Y. Liu, H. Zhang, M. Chen, and L. Zhang, “A boosting-based spatial-spectral model for stroke patients’ eeg analysis in rehabilitation training,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 24, no. 1, pp. 169–179, 2015.
  • [15] C. Jiang, Y. Li, Y. Tang, and C. Guan, “Enhancing eeg-based classification of depression patients using spatial information,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 29, pp. 566–575, 2021.
  • [16] B. Zhang, G. Yan, Z. Yang, Y. Su, J. Wang, and T. Lei, “Brain functional networks based on resting-state eeg data for major depressive disorder analysis and classification,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 29, pp. 215–229, 2020.
  • [17] B. Hosseinifard, M. H. Moradi, and R. Rostami, “Classifying depression patients and normal subjects using machine learning techniques and nonlinear features from eeg signal,” Computer methods and programs in biomedicine, vol. 109, no. 3, pp. 339–345, 2013.
  • [18] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
  • [19] W.-L. Zheng and B.-L. Lu, “Investigating critical frequency bands and channels for eeg-based emotion recognition with deep neural networks,” IEEE Transactions on Autonomous Mental Development, vol. 7, no. 3, pp. 162–175, 2015.
  • [20] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras, “Deap: A database for emotion analysis; using physiological signals,” IEEE transactions on affective computing, vol. 3, no. 1, pp. 18–31, 2011.
  • [21] W.-L. Zheng and B.-L. Lu, “Personalizing eeg-based affective models with transfer learning,” in

    Proceedings of the twenty-fifth international joint conference on artificial intelligence

    , 2016, pp. 2732–2738.
  • [22] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE transactions on neural networks, vol. 22, no. 2, pp. 199–210, 2010.
  • [23] X. Chai, Q. Wang, Y. Zhao, Y. Li, D. Liu, X. Liu, and O. Bai, “A fast, efficient domain adaptation technique for cross-domain electroencephalography (eeg)-based emotion recognition,” Sensors, vol. 17, no. 5, p. 1014, 2017.
  • [24] Y.-P. Lin, P.-K. Jao, and Y.-H. Yang, “Improving cross-day eeg-based emotion classification using robust principal component analysis,” Frontiers in computational neuroscience, vol. 11, p. 64, 2017.
  • [25] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, pp. 1–37, 2011.
  • [26] J. Li, S. Qiu, Y.-Y. Shen, C.-L. Liu, and H. He, “Multisource transfer learning for cross-subject eeg emotion recognition,” IEEE transactions on cybernetics, vol. 50, no. 7, pp. 3281–3293, 2019.
  • [27] Y.-M. Jin, Y.-D. Luo, W.-L. Zheng, and B.-L. Lu, “Eeg-based emotion recognition using domain adaptation network,” in 2017 international conference on orange technologies (ICOT).   IEEE, 2017, pp. 222–225.
  • [28] H. Li, Y.-M. Jin, W.-L. Zheng, and B.-L. Lu, “Cross-subject emotion recognition using deep adaptation networks,” in International conference on neural information processing.   Springer, 2018, pp. 403–413.
  • [29] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International conference on machine learning.   PMLR, 2015, pp. 97–105.
  • [30] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola, “Integrating structured biological data by kernel maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.
  • [31] X. Chai, Q. Wang, Y. Zhao, X. Liu, O. Bai, and Y. Li, “Unsupervised domain adaptation techniques based on auto-encoder for non-stationary eeg-based emotion recognition,” Computers in biology and medicine, vol. 79, pp. 205–214, 2016.
  • [32] J. Li, S. Qiu, C. Du, Y. Wang, and H. He, “Domain adaptation for eeg emotion recognition based on latent representation similarity,” IEEE Transactions on Cognitive and Developmental Systems, vol. 12, no. 2, pp. 344–353, 2019.
  • [33] J. Li, H. Chen, and T. Cai, “Foit: Fast online instance transfer for improved eeg emotion recognition,” in 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).   IEEE, 2020, pp. 2618–2625.
  • [34] W.-L. Zheng, W. Liu, Y. Lu, B.-L. Lu, and A. Cichocki, “Emotionmeter: A multimodal framework for recognizing human emotions,” IEEE transactions on cybernetics, vol. 49, no. 3, pp. 1110–1122, 2018.
  • [35] W. Yin, H. Schütze, B. Xiang, and B. Zhou, “Abcnn: Attention-based convolutional neural network for modeling sentence pairs,” Transactions of the Association for Computational Linguistics, vol. 4, pp. 259–272, 2016.
  • [36] F. Fahimi, Z. Zhang, W. B. Goh, T.-S. Lee, K. K. Ang, and C. Guan, “Inter-subject transfer learning with an end-to-end deep convolutional neural network for eeg-based bci,” Journal of neural engineering, vol. 16, no. 2, p. 026007, 2019.
  • [37] L.-M. Zhao, X. Yan, and B.-L. Lu, “Plug-and-play domain adaptation for cross-subject eeg-based emotion recognition,” in Proceedings of the 35th AAAI Conference on Artificial Intelligence.   sn, 2021.
  • [38] Y. Wang, S. Qiu, X. Ma, and H. He, “A prototype-based spd matrix network for domain adaptation eeg emotion recognition,” Pattern Recognition, vol. 110, p. 107626, 2021.
  • [39] Y. Zhu, F. Zhuang, and D. Wang, “Aligning domain-specific distribution and classifier for cross-domain classification from multiple sources,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 5989–5996.
  • [40]

    M. W. Gardner and S. Dorling, “Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences,”

    Atmospheric environment, vol. 32, no. 14-15, pp. 2627–2636, 1998.
  • [41] R.-N. Duan, J.-Y. Zhu, and B.-L. Lu, “Differential entropy feature for eeg-based emotion classification,” in 2013 6th International IEEE/EMBS Conference on Neural Engineering (NER).   IEEE, 2013, pp. 81–84.
  • [42] M. Soleymani, S. Asghari-Esfeden, Y. Fu, and M. Pantic, “Analysis of eeg signals and facial expressions for continuous emotion detection,” IEEE Transactions on Affective Computing, vol. 7, no. 1, pp. 17–28, 2015.
  • [43] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.
  • [44]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    Icml, 2010.
  • [45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [46] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014.
  • [47] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adaptation,” in

    European conference on computer vision

    .   Springer, 2016, pp. 443–450.
  • [48] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
  • [49] Y. Jiao, Y. Deng, Y. Luo, and B.-L. Lu, “Driver sleepiness detection from eeg and eog signals using gan and lstm networks,” Neurocomputing, vol. 408, pp. 100–111, 2020.
  • [50] L.-Y. Tao and B.-L. Lu, “Emotion recognition under sleep deprivation using a multimodal residual lstm network,” in 2020 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2020, pp. 1–8.
  • [51] J. Ma, H. Tang, W.-L. Zheng, and B.-L. Lu, “Emotion recognition using multimodal residual lstm network,” in Proceedings of the 27th ACM international conference on multimedia, 2019, pp. 176–183.