Emotion recognition is an important subarea of affective computing, which focuses on recognizing human emotions based on a variety of modalities, such as audio-visual expressions, body language, physiological signals, etc. Compared to other modalities, physiological signals, such as electroencephalogram (EEG), electrocardiogram (ECG), electromyogram (EMG), galvanic skin response (GSR), etc., have the advantage of being difficult to hide or disguise. In recent years, due to the rapid development of noninvasive, easy-to-use and inexpensive EEG recording devices, EEG-based emotion recognition has received an increasing amount of attention in both research  and applications .
Emotion models can be broadly categorized into discrete models and dimensional models. The former categorizes emotions into discrete entities, e.g., anger, disgust, fear, happiness, sadness, and surprise in Ekman’s theory . The latter describes emotions using their underlying dimensions, e.g., valence, arousal and dominance , which measures emotions from unpleasant to pleasant, passive to active, and submissive to dominant, respectively.
EEG signals measure voltage fluctuations from the cortex in the brain and have been shown to reveal important information about human emotional states . For example, greater relative left frontal EEG activity has been observed when experiencing positive emotions . The voltage fluctuations on different brain regions are measured by electrodes attached to the scalp. Each electrode collects EEG signals in one channel. The collected EEG signals are often analyzed in specific frequency bands for each channel, namely delta (1-4 Hz), theta (4-7 Hz), alpha (8-13 Hz), beta (13-30 Hz), and gamma (>30 Hz).
Many existing EEG-based emotion recognition methods are primarily based on the supervised machine learning approach wherein features are extracted from preprocessed EEG signals in each channel over a time window and then a classifier is trained on the extracted features to recognize emotions. Wanget al. 
compared power spectral density features (PSD), wavelet features and nonlinear dynamical features with a Support Vector Machine (SVM) classifier. Zheng and Lu investigated critical frequency bands and channels using PSD, differential entropy (DE) 
and PSD asymmetry features, and obtained robust accuracy using deep belief networks (DBN). However, most existing EEG-based emotion recognition approaches do not address the following three challenges: 1) the topological structure of EEG signals are not effectively exploited to learn more discriminative EEG representations; 2) EEG signals vary significantly across different subjects, which hinders the generalizability of the trained classifiers; and 3) participants may not always generate the intended emotions when watching emotion-eliciting stimuli. Consequently, the emotion labels in the collected EEG data are noisy and may not be consistent with the actual elicited emotions.
incorporated spatial relations in EEG signals using convolutional neural networks (CNN) and recurrent neural networks (RNN), respectively. However, their approaches require a 2D representation of EEG channels on the scalp, which may cause information loss during flattening because channels are actually arranged in the 3D space. In addition, their approach of using CNNs and RNNs to capture inter-channel relations has difficulty in learning long-range dependencies. Graph neural networks (GNN) has been applied in  to capture inter-channel relations using an adjacency matrix. However, similar to CNNs and RNNs, their approach only considers relations between the nearest channels, which thus may lose valuable information between distant channels, such as PSD asymmetry between channels on the left and right hemispheres in the frontal region, which has been shown as informative in valence prediction . A recent work applies RNNs to learn EEG representations in the two hemispheres separately and then adopts the asymmetric differences between them to recognize emotions . However, their approach is limited to using only the bi-hemispherical discrepancies and ignores other useful features such as neuronal activities recorded from each channel.
In recent years, several studies [73, 11] investigated the transferability of EEG-based emotion recognition models across subjects. Lan et al.  compared several domain adaptation techniques such as maximum independence domain adaptation (MIDA), transfer component analysis (TCA), subspace alignment (SA), etc. They found that the subject-independent classification accuracy can be improved by around 10%. Li et al.  applied domain adversarial learning to lower the influence of individual subject on EEG data and obtained improved performance as well. However, their approaches do not exploit any graph structure and only leads to small performance improvement (see Section 7.1).
To the best of our knowledge, no attempt has been made to address the problem of noisy labels in EEG-based emotion recognition.
In this paper, we propose a regularized graph neural network (RGNN) aiming to address all three aforementioned challenges. Graph analysis for human brain has been studied extensively in the neuroscience literature [19, 21]. However, making an accurate connectome is still an open question and subject to different scales . Inspired by [9, 56], we consider each channel in EEG signals as a node in our graph. Our RGNN model extends the simple graph convolution network (SGC)  and leverages the topological structure of EEG signals, i.e., according to the economy of brain network organization , we propose a biologically supported sparse adjacency matrix to capture both local and global inter-channel relations. Local inter-channel relations connect nearby groups of neurons and may reveal anatomical connectivity at macroscale [15, 21]. Global inter-channel relations connect distant groups of neurons between the left and right hemispheres and may reveal emotion-related functional connectivity [52, 38].
In addition, we propose a node-wise domain adversarial training (NodeDAT) to regularize our graph model for better generalization in subject-independent classification scenarios. Different from the domain adversarial training adopted by [22, 38], our NodeDAT gives a finer-grained regularization by minimizing the domain discrepancies between features in the source and target domains for each channel/node. Moreover, we propose an emotion-aware distribution learning (EmotionDL) method to address the problem of noisy labels in the datasets. Prior studies have shown that noisy labels can adversely impact classification accuracy . Instead of learning single-label classification, our EmotionDL learns a distribution of labels of the training data and thus acts as a regularizer to improve the robustness of our model against noisy labels. Finally, we conduct extensive experiments to validate the effectiveness of our proposed model and investigate emotion-related informative neuronal activities.
In summary, the main contributions of this paper are as follows:
We propose a regularized graph neural network (RGNN) model to recognize emotions based on EEG signals. Our model is biologically supported and captures both local and global inter-channel relations.
We propose two regularizers: a node-wise domain adversarial training (NodeDAT) and an emotion-aware distribution learning (EmotionDL), which aim to improve the robustness of our model against cross-subject variations and noisy labels, respectively.
We conduct extensive experiment in both subject-dependent and subject-independent classification settings on two public EEG datasets, namely SEED  and SEED-IV . Experimental results demonstrate the effectiveness of our proposed model and regularizers. In addition, our RGNN achieves superior performance over the state-of-the-art baselines in most experimental settings.
We investigate the neuronal activities and the results reveal that pre-frontal, parietal and occipital regions may be the most informative regions for emotion recognition. In addition, global inter-channel relations between the left and right hemispheres are important and local inter-channel relations between (FP1, AF3), (F6, F8) and (FP2, AF4) may also provide useful information.
2 Related Work
In this section, we review related work in the fields of EEG-based emotion recognition, graph neural networks, unsupervised domain adaptation and learning with noisy labels.
2.1 EEG-Based Emotion Recognition
EEG feature extractors and classifiers are the two fundamental components in the machine learning approach of EEG-based emotion recognition. EEG features can be broadly divided into single-channel features and multi-channel ones . The majority of existing features are single-channel features such as statistical features [59, 61], fractal dimension (FD) , PSD , differential entropy (DE) , and wavelet features . A few features are computed on multiple channels to capture the inter-channel relations, e.g., the asymmetry features of PSD  and functional connectivity [65, 35]
, where common indices such as correlation, coherence and phase synchronization were used estimate brain functional connectivity between channels. However, leveraging functional connectivity require labor-intensive manual connectivity analysis for each subject and may not be ideal for real-time applications.
EEG classifiers can be broadly divided into topology-invariant classifiers and topology-aware ones. The majority of existing classifiers are topology-invariant classifiers such as SVM, k-Nearest Neighbors (KNN), DBNs and RNNs , which do not take the topological structure of EEG features into account when learning the EEG representations. In contrast, topology-aware classifiers such as CNNs [5, 36, 68, 34] and GNNs  consider the inter-channel topological relations and learn EEG representations for each channel by aggregating features from nearby channels using convolutional operations either in the Euclidean space or in the non-Euclidean space. However, as discussed in Section 1, existing CNNs and GNNs have difficulty in learning the dependencies between distant channels, which may reveal important emotion-related information. Recently, Zhang et al.  and Li et al.  proposed to use RNNs to learn spatial topological relations between channels by scanning electrodes in both vertical and horizontal directions. However, their approaches do not fully exploit the topological structure of EEG channels. For example, two topologically close channels may be far away from each other in the scanning sequence.
2.2 Graph Neural Networks
Graph neural networks (GNN) is a class of neural networks dealing with data in the graph domains, e.g., molecular structures, social networks and knowledge graphs. One early work on GNNs  aimed to learn a converged static state embedding for each node in the graph using a transition function applied to its neighborhood. Later, inspired by the convolutional operation of CNN in Euclidean domains, Bruna et al.  combined spectral graph theory  with neural networks and defined convolutional operations in graph domains using the spectral filters computed from the normalized graph Laplacian. Following this line of research, Defferrard et al.  proposed fast localized convolutions by using a recursive formulation of the -order Chebyshev polynomials to approximate the filters. The resulting representation for each node is an aggregation of its -order neighborhood. Kipf and Welling  further limited and proposed the standard graph convolutional network (GCN) with a faster localized graph convolutional operation. The convolutional layers in GCN can be stacked K times to effectively convolve the -order neighborhood of a node. Recently, Wu et al. 
simplified GCN by removing the nonlinearities between convolutional layers in GCN and proposed the simple graph convolution network (SGC), which effectively behaves like a linear feature transformation followed by a logistic regression. SGC performs orders of magnitude faster than GCNs with comparable classification accuracy. In this paper, we extend SGC to model EEG signals and propose a biologically supported adjacency matrix and two regularizers for robust EEG-based emotion recognition.
2.3 Unsupervised Domain Adaptation
Unsupervised domain adaptation aims to mitigate the domain shift in knowledge transfer from a supervised source domain to an unsupervised target domain. The most common approaches are instance re-weighting, domain-invariant feature learning, domain mapping and normalization statistics. Instance re-weighting methods  aim to infer the resampling weight directly by feature distribution matching across source and target domains in a non-parametric manner. Domain-invariant feature leaning methods align features from both source and target domains to a common feature space. The alignment can be achieved by minimizing divergence , maximizing reconstruction  or adversarial training 6]. Normalization statistics are based on the assumption that the batch norm statistics learn domain knowledge. Cariucci et al.  performed domain adaptation by modulating the batch norm layers’ statistics from source to target domain. Our proposed NodeDAT regularizer extends the domain adversarial training  to graph neural networks and achieves finer-grained regularization by minimizing the discrepancies between features in source and target domains for each channel/node individually.
2.4 Learning with Noisy Labels
Commonly adopted approaches to learning with noisy labels are based on the noise transition matrix and robust loss functions. The noise transition matrix specifies the probabilities of transition from each ground true label to each noisy label and is often applied to modify the cross-entropy loss. The matrix can be pre-computed asa prior  or estimated from noisy data . A few studies tackle noisy labels by using noise-tolerant robust loss functions, such as unhinged loss  and ramp loss . Several other approaches include bootstrap that leverages predicted labels to generate training targets  and alternatively updating network parameters and labels during training . Our proposed EmotionDL regularizer is inspired by , which applies distribution learning to learn labels with ambiguity in the computer vision domain.
In this section, we introduce the preliminaries of the simple graph convolution network (SGC)  and its spectral analysis, which is the basis of our RGNN model.
3.1 Simple Graph Convolution Network (SGC)
Given a graph , where denotes a set of nodes and denotes a set of edges between nodes in . Data on can be represented by a feature matrix , where denotes the number of nodes and denotes the input feature dimension. The edge set can be represented by a weighted adjacency matrix with self-loops, i.e., , . In general, GNNs learn a feature transformation function for and produces output , where denotes the output feature dimension.
Between adjacent layers in GNNs, the feature transformation can be written as
where , denotes the number of layers, , , and denotes the function we want to learn. A simple definition of would be
where denotes a non-linear function and denotes a weight matrix at layer . For each node , function simply sums up all node features in its neighborhood including
itself, followed by a non-linear transformation. However, one major limitation ofin (2) is that repeatedly applying along multiple layers may lead to with overly large values due to summation. Kipf and Welling  alleviated this limitation by proposing the graph convolution network (GCN) as follows:
where denotes the diagonal degree matrix of , i.e., . The normalized adjacency matrix prevents from growing overly large. If we ignore and temporarily and expand (3), the hidden state for node , , can be computed via
Note that each neighboring is now normalized by both the degrees of and . Therefore, essentially, for each node, the feature transformation function in GCN is a non-linear transformation of the weighted sum of node features of itself and its neighborhood. Successively applying graph convolutional layers aggregates node features within a neighborhood of size .
To further accelerate training while keeping comparable performance, Wu et al.  proposed SGC by removing the non-linear function in (3) and reparameterizing all linear transformations across all layers into one linear transformation as follows:
where , and . Essentially, SGC computes a topology-aware linear transformation , followed by one final linear transformation .
3.2 Spectral Graph Convolution
We analyze GCN from the perspective of spectral graph theory . Graph Fourier analysis relies on the graph Laplacian or the normalized graph Laplacian . Since is a symmetric positive semidefinite matrix, it can be decomposed as , where
is the orthonormal eigenvector matrix ofand
is the diagonal matrix of corresponding eigenvalues. Given graph data
, the graph Fourier transform ofis , and the inverse Fourier transform of is . Hence, the graph convolution between and a filter is computed as follows:
where denotes element-wise multiplication, and denotes a diagonal matrix with spectral filter coefficients.
To reduce the current learning complexity of to that of conventional CNN, i.e., , (6) can be approximated using the th order polynomials as follows:
where denotes coefficients. To further reduce computational cost, Defferrard et al.  proposed to use Chebyshev polynomials to approximate the filtering operation as follows:
where denotes learnable parameters, denotes the scaled normalized Laplacian with its eigenvalues lying within , and denotes the Chebyshev polynomials recursively defined as with and .
The GCN proposed in  made a few approximations to simplify the filtering operation in (8): 1) use ; 2) set ; and 3) set . The resulted GCN arrives at (3). Essentially, the graph convolutional operations defined in (3) and (5) behave like a low-pass filter by smoothing the features of each node on the graph using node features in its neighborhood.
4 Regularized Graph Neural Network
In this section we present our regularized graph neural network (RGNN), specifically, the biologically supported adjacency matrix, and RGNN with two regularizers, i.e., node-wise domain adversarial training (NodeDAT) and emotion-aware distribution learning (EmotionDL).
4.1 Adjacency Matrix in RGNN
The adjacency matrix in RGNN represents the topological structure of EEG channels, where denotes the number of channels in EEG signals or nodes on the graph. Each entry in the adjacency matrix indicates the weight of connection between channels and . Note that contains self-loops. To reduce overfitting, we model as a symmetric matrix by using only number of parameters instead of . Salvador et al.  observed that the strength of connection between brain regions decays as an inverse square or gravity-law function of physical distance. Hence, we initialize the local inter-channel relations in our adjacency matrix as follows:
where , , denotes the physical distance between channels and , computed from the data sheet of the recording device, and denotes a sparsity hyper-parameter controlling the decay rate of the connection between channels.
Bullmore and Sporns  proposed that the brain organization is shaped by an economic trade-off between minimizing wiring costs and network running costs. Minimizing wiring costs encourages local inter-channel connections as modelled in (9). However, minimizing network running costs encourages certain global inter-channel connections for high efficiency of information transfer across the network as a whole. To this end, we add several global connections to our adjacency matrix. The global connections are subject to the specific EEG channel placement adopted in experiments. Fig. 1 depicts the global connections in both SEED  and SEED-IV . The selection of global channels is supported by prior studies showing that the asymmetry in neuronal activities between the left and right hemispheres is informative in valence and arousal predictions [17, 52, 70]. To leverage the differential asymmetry information, we initialize the global inter-channel relations in to as follows:
where denotes the indices of empirically selected symmetric channel pairs that balance wiring cost and global efficiency : (FP1, FP2), (AF3, AF4), (F5, F6), (FC5, FC6), (C5, C6), (CP5, CP6), (P5, P6), (PO5, PO6), and (O1, O2). Note that our adjacency matrix obtained in (10) aims to represent the brain network which combines both local anatomical connectivity and emotion-related global functional connectivity.
The last step in constructing the adjacency matrix is finding an optimal value of to regularize the weights of connections between local channels. Achard and Bullmore  observed that sparse fMRI networks, comprising around 20% of all possible connections, typically maximize the efficiency of the network topology. Thus, we choose such that around 20% of entries in are larger than in absolute values. We empirically pick as the threshold of having negligible connections between channels.
4.2 Dynamics of RGNN
Our RGNN model extends the SGC model . The architecture of RGNN is illustrated in Fig. 2. Given EEG features and labels , where denotes the number of training samples, denotes the number of nodes or channels, denotes the input feature dimension, denotes the label index, and denotes the number of classes. Our model aims to minimize the following cross-entropy loss:
where denotes the model parameters we want to optimize, and denotes the L1 sparse regularization strength of our adjacency matrix .
By passing each feature matrix into our RGNN, the output probability of class can be computed as
where , and follow the definitions in (5), , denotes the output weight matrix, and
denotes the sum pooling across all nodes on the graph. We choose sum pooling because it demonstrated more expressive power than mean pooling and max pooling. Note that we use the absolute values of to compute the degree matrix because has negative elements, e.g., global connections.
4.2.1 Node-wise Domain Adversarial Training
EEG signals vary significantly across different subjects, which hinders the generalizability of trained classifiers. To improve subject-independent classification performance, we extend the domain adversatial training  by proposing a node-wise domain adversarial training (NodeDAT) to reduce the discrepancies between source and target domains, i.e., training and testing sets, respectively. Specifically, a domain classifier is proposed to classify each node representation into either source domain or target domain. Compared to , which only regularizes the pooled representation in the last layer, our NodeDAT has finer-grained regularization because it explicitly regularizes each node representation before pooling (see Section 7.1). During optimization, our model aims to confuse the domain classifier by learning domain-invariant representations for each node.
Specifically, given source/training data (in this subsection, we denote by for better clarity) and unlabelled target/testing data , where in practice can be either oversampled or donwsampled to have the same number of samples as , the domain classifier aims to minimize the sum of the following two binary cross-entropy losses:
where and denote source and target domains, respectively. Intuitively the domain classifier aims to classify source data as 0 and target data as 1. The domain probabilities for node are computed as
where denote the th node representation in , and denotes the model parameters in the domain classifier. Essentially, our NodeDAT encourages learning domain invariant node presentation by trying to confuse the domain classifier.
Note that our domain classifier implements a gradient reversal layer (GRL) 
to reverse the gradients of the domain classifier during backpropagation. The gradients are further scaled by a GRL scaling factorwhich gradually increases from 0 to 1 as the training progresses. The gradually increasing allows our domain classifier to be less sensitive to noisy inputs at the early stages of the training process. Specifically, as suggested in , we let , where denotes the training progress.
4.2.2 Emotion-aware Distribution Learning
Participants may not always generate the intended emotions when watching emotion-eliciting stimuli. To address the problem of noisy emotion labels in the datasets, we propose an emotion-aware distribution learning method (EmotionDL) to learn a distribution of classes instead of one single class for each training sample. Specifically, we convert each training label
into a prior probability distribution of all classes, where denotes the probability of class c in . The conversion is dataset-dependent. In SEED, there are three classes: negative, neutral, and positive with corresponding class indices 0, 1, and 2, respectively. We convert as follows:
where denotes a hyper-parameter controlling the noise level in the training labels. This conversion mechanism is based on our assumption that participants are unlikely to generate opposite emotions when watching emotion-eliciting stimuli. Therefore, the converted class distribution centers on the original class and has non-zero and zero probabilities at its nearest and opposite classes, respectively.
In SEED-IV, there are four classes: neutral, sad, fear, and happy with corresponding class indices 0, 1, 2, and 3, respectively. We can convert as follows:
The intuition behind this conversion is based on the distances between the four emotions on the valence-arousal plane. Specifically, in the self-reported ratings , neutral, sad, fear, and happy movie ratings cluster in the zero valence zero arousal, negative valence negative arousal, negative valence positive arousal, and positive valence positive arousal regions, respectively. Thus, we assume that participants are likely to generate emotions that have similar ratings in either valence or arousal dimensions, e.g., both angry and happy have high arousal, but unlikely to generate emotions that are far away in both dimensions, e.g., sad and happy are different in both valence and arousal.
where denotes the output probability distribution computed via (12). Note that our EmotionDL is different from label smoothing, which simply adds uniform noise to other classes.
4.2.3 Optimization of RGNN
Combining both NodeDAT and EmotionDL, the overall loss function of RGNN is computed as follows:
The detailed algorithm for training RGNN is presented in Algorithm 1.
5 Experimental Settings
In this section, we present the datasets, classification settings and model settings in our experiments.
We use both SEED and SEED-IV datasets in our experiments. The SEED dataset  comprises EEG data of 15 subjects (7 males) recorded in 62 channels using the ESI NeuroScan System111https://compumedicsneuroscan.com/. The EEG data was collected when participants watch emotion-eliciting movies in three types of emotions, namely negative, neutral and positive. Each movie lasts around 4 minutes. There are three sessions of data collected and each session comprises 15 trials/movies for each subject. To make a fair comparison with existing studies, we directly use the pre-computed differential entropy (DE) features smoothed by linear dynamic systems (LDS) [54, 72]
in SEED. DE extends the idea of Shannon entropy and measures the complexity of a continuous random variable. For a fixed length EEG segment, DE features are computed as the logarithm energy spectrum in a certain frequency band. In SEED, DE features are pre-computed over five frequency bands (delta, theta, alpha, beta and gamma) for each second of EEG signals (without overlapping) in each channel.
The SEED-IV dataset  comprises EEG data of 15 subjects (7 males) recorded in 62 channels222SEED-IV also contains eye movement data, which we do not use in our experiment.. The recording device is the same as the one used in SEED. The EEG data were collected when participants watch emotion-eliciting movies in four types of emotions, namely, neutral, sad, fear, and happy. Each movie lasts around 2 minutes. There are three sessions of data collected and each session comprises 24 trials/movies for each subject. Similar to SEED, we adopt the pre-computed DE features from SEED-IV.
5.2 Classification Settings
We conduct both subject-dependent and subject-independent classifications on both SEED and SEED-IV to evaluate our model.
5.2.1 Subject-Dependent Classification
For SEED, we follow the experimental settings in [72, 56, 38] to evaluate our RGNN model using subject-dependent classification, i.e., we evaluate our model for individual subjects. Specifically, for each subject, we train our model using the first 9 trials as the training set and the remaining 6 trials as the testing set. We evaluate the model performance by using the accuracy averaged across all subjects over two sessions of EEG data in SEED . For SEED-IV, we follow the experimental settings in [71, 37] to evaluate our RGNN model using subject-dependent classification. Specifically, for each subject, the first 16 trials are used for training and the remaining 8 trials containing all emotions (each emotion with two trials) are used for testing. We evaluate our model using data from all three sessions.
5.2.2 Subject-Independent Classification
For SEED, we follow the experimental settings in [73, 56, 38] to evaluate our RGNN model using subject-independent classification. Specifically, we adopt leave-one-subject-out cross-validation, i.e, during each fold, we train our model on 14 subjects and test on the remaining subject. We evaluate the model performance using the accuracy averaged cross all test subjects over one session of EEG data in SEED . For SEED-IV, we follow the experimental settings in  to evaluate our RGNN model using subject-independent classification. We evaluate our model using data from all three sessions.
|Model||delta band||theta band||alpha band||beta band||gamma band||all bands||all bands|
|BiHDM  (SOTA)||-||-||-||-||-||93.12/06.06||74.35/14.09|
|RGNN (Our model)||76.17/07.91||72.26/07.25||75.33/08.85||84.25/12.54||89.23/08.90||94.24/05.95||79.37/10.54|
Subject-dependent classification accuracy (mean/standard deviation) on SEED and SEED-IV
|Model||delta band||theta band||alpha band||beta band||gamma band||all bands||all bands|
|BiHDM  (SOTA)||-||-||-||-||-||85.40/07.53||69.03/08.66|
|RGNN (Our model)||64.88/06.87||60.69/05.79||60.84/07.57||74.96/08.94||77.50/08.10||85.30/06.72||73.84/08.02|
5.3 Model Settings in RGNN
For our RGNN in all experiments, we empirically set the number of convolutional layers , dropout rate  of at the output fully-connected layer, and batch size of . We use Adam optimization  with default values, i.e., and . We only tune the output feature dimension , label noise level , learning rate , L1 regularization factor , and L2 regularization for each experiment. Note that we only adopt NodeDAT in subject-independent classification experiments. We compare our model with several baselines, which are cited from published results [56, 38, 69, 37].
6 Performance Evaluations
In this section we present model evaluation results in both subject-dependent and subject-independent classification settings on both datasets. We also investigate critical frequency bands and confusion matrix of our model.
6.1 Subject-Dependent Classification
Table I presents the subject-dependent classification accuracy (mean/standard deviation) of our RGNN model and all baselines on both SEED and SEED-IV using the pre-computed DE features. The performance on SEED using DE feature in the individual delta, theta, alpha, beta, and gamma bands is reported as well. It is encouraging to see that our model achieves superior performance on both datasets as compared to all baselines including the state-of-the-art BiHDM when DE features from all frequency bands are used. It is worth noting that our model improves the accuracy of the state-of-the-art model on SEED-IV by around 5%. In particular, our model performs better than DGCNN, which is another GNN-based model that leverages the topological structure in EEG signals. Besides the proposed two regularizers (see Table III), the main performance improvement can be attributed to two factors: 1) our adjacency matrix incorporates the global inter-channel asymmetry relation between the left and right hemispheres; and 2) our model has less concern of overfitting by extending SGC, which is much simpler than ChebNet  used in DGCNN.
6.2 Subject-Independent Classification
Similar to Table I, Table II presents the subject-independent classification results. When using features from all frequency bands, our model performs marginally worse than BiHDM on SEED but much better than BiHDM on SEED-IV (nearly 5% improvement). In addition, our model achieves the lowest standard deviation in accuracy compared to all baselines on both datasets, demonstrating the robustness of our model.
, we find that the accuracy obtained in subject-independent settings is consistently worse than the accuracy obtained in subject-dependent settings by around 5% to 30% for every model. This finding is unsurprising because the variability of EEG signals across subjects makes subject-independent classification more challenging. However, the interesting part is that the performance gap between these two settings is gradually decreasing from around 27% on SEED and 19% on SEED-IV using SVM to around 9% on SEED and 6% on SEED-IV using our model. One possible reason for the diminishing gap is that recent deep learning models in subject-independent settings are becoming better at leveraging a larger amount of data and learning more subject-invariant EEG representations. This observation seems to indicate that transfer learning may be a necessary tool for emotion recognition in cross-subject settings. With the increasing amount of data available from different subjects and a proper transfer learning tool, it would not be surprising that subject-independent classification accuracy will surpass the subject-dependent classification accuracy in the future.
6.3 Performance Comparison of Frequency Bands
We further compare the performance of our model and all baselines using features from different frequency bands, as reported in Tables I and II. In subject-dependent experiments on SEED, STRNN achieves the highest accuracy in delta, theta and alpha bands, BiDANN performs best in beta band, and our model performs best in gamma band. In subject-independent experiments on SEED, BiDANN-S achieves the highest accuracy in theta and alpha bands, and our model performs best in delta, beta and gamma bands.
We investigate the critical frequency bands for emotion recognition. For both subject-dependent and subject-independent settings on SEED, we compare the performance of each model across different frequency bands. In general, most models including our model achieve better performance on beta and gamma bands than delta, theta and alpha bands, with one exception of STRNN, which performs the worst on gamma band. This observation is consistent with the literature [47, 72]. One subtle difference between our model and other models is that our model performs consistently better in gamma band than beta band, whereas other models perform comparably in both bands, indicating that gamma band may be the most discriminative band for our model.
6.4 Confusion Matrix
We present the confusion matrix of our model in Fig. 3. For both subject-dependent and subject-independent settings on SEED, our model can recognize better for positive and neutral emotions than negative emotion. By combining training data from other subjects (see Fig. 3 (a) and (b)), our model is getting much worse at detecting negative emotion, indicating that participants are likely to generate distinct EEG patterns when experiencing negative emotion. Similar phenomenon is observed in SEED-IV for sad emotion as well (see Fig. 3 (c) and (d)). For SEED-IV, our model performs significantly better on sad emotion than all other emotions in both classification settings. We notice that fear is the only emotion that performs better in subject-independent classification than in subject-dependent classification. This finding indicates that participants watching horror movies may generate similar EEG patterns.
7 Model Analysis on RGNN
In this section we conduct ablation study and sensitivity analysis for model.
7.1 Ablation Study
|- global connection||82.42/08.24||71.13/08.78|
|- symmetric adjacency matrix||83.69/07.92||72.02/08.66|
|- NodeDAT + DAT||83.51/08.11||72.40/08.54|
We conduct ablation study to investigate the contribution of each key component in our model. Table III reports the results obtained in subject-independent setting on both datasets. The two major designs in our adjacency matrix , i.e., global connection and symmetric adjacency matrix designs, are helpful in recognizing emotions. The global connection models the asymmetric difference between neuronal activities in the left and right hemispheres and have been shown to reveal certain emotions [17, 52, 70]. The symmetric adjacency matrix design is mostly motivated to reduce the number of model parameters and prevent overfitting, especially in subject-dependent classifications where lesser training data is available.
Our NodeDAT regularizer has a noticeable positive impact on the performance of our model, which demonstrates that domain adaptation is significantly helpful in cross-subject classification. To further investigate the impact of our node-level domain classifier, we further experimented with replacing NodeDAT with a generic domain classifier (DAT)  that operates after the pooling operation, i.e., (-NodeDAT + DAT) in Table III
. The clear performance gap between (-NodeDAT + DAT) and our RGNN model indicates that our NodeDAT can better regularize the model by learning subject-invariant representation at node level than graph level. In addition, if NodeDAT is removed, the performance of our model has a greater variance, demonstrating the importance of NodeDAT in improving the robustness of our model against cross-subject variations.
Our EmotionDL regularizer improves performance of our model by around 3% in accuracy on both datasets. This performance gain validates our assumption that participants are not always generating the intended emotions when watching emotion-eliciting stimuli. In addition, our EmotionDL can be easily adopted by other deep learning models.
7.2 Sensitivity Analysis
We analyze the performance of our model across varying L1 sparsity coefficient (see (11)) and noise coefficient in EmotionDL (see (15) and (16)), as illustrated in Fig. 4. For subject-dependent classification, increasing from 0 to 0.1 will generally increase the model performance. However, for subject-independent classification, increasing beyond a certain threshold, i.e, 0.01 in Fig. 4(a), will decrease the model performance. One possible explanation for the difference in model behaviors is that there is much less training data in subject-dependent classification, which requires a stronger regularization to reduce overfitting, whereas for subject-independent classification where the number of training data is less of a concern, adding stronger regularization may introduce bias and hinder the learning efficacy.
As illustrated in Fig. 4(b), our model behaves consistently across different experimental settings with varying noise coefficient . Specifically, by increasing , the performance of our model first increases and then decreases. In particular, our model usually performs best when is set to 0.2, demonstrating the existence of label noises and the necessity of addressing them on both datasets. Introducing excessive noise in EmotionDL causes performance drop, which is expected because excessive noise weakens the true learning signals.
8 Neuronal Activity Analysis for Emotion Recognition
In this section we analyze and identify important neuronal activities for emotion recognition.
8.1 Activation Maps of Channels
Fig. 5 shows the heatmap of the diagonal elements in our learned adjacency matrix . Conceptually, as shown in (4), the diagonal values in represents the contribution of each channel in computing the final EEG representation. It is clear from Fig. 5 that there are strong activations on the pre-frontal, parietal, and occipital regions, indicating that these regions may be strongly related to the emotion processing of the brain. Our finding is consistent with existing studies, which observed that asymmetrical frontal and parietal EEG activity may reflect changes on both valence and arousal [52, 40]. The synchronization between frontal and occipital regions has also been reported to be related to positive and fear emotion [14, 42]. The symmetry pattern on the activation map of channels indicate again that the asymmetry in EEG activity between the left and right hemispheres is critical for emotion recognition.
8.2 Inter-channel Relations
Fig. 6 shows the top 10 connections between channels having the largest edge weights in our adjacency matrix . Note that all global connections remain among the strongest connections after is learned, demonstrating again that global inter-channel relations are essential for emotion recognition. It is obvious from Fig. 6 that there are both similarities and differences between these two plots, indicating that our initialization strategy presented in (9) can capture local inter-channel relations to a certain degree. One notable difference between the two plots is that a few strong connections are gone in Fig. 6(a), e.g., (POZ, PO3), (PO6, PO8), and (P3, P5), indicating that these connections may not be critical for emotion recognition. In addition, it is clear from Fig. 6(b) that the connection between the channel pair (FP1, AF3) is the strongest, followed by (F6, F8), (FP2, AF4), and (PO8, CB2), indicating that local inter-channel relations in the frontal region may be important for emotion recognition.
In this paper, we propose a regularized graph neural network for emotion recognition based on EEG signals. Our model is biologically supported to capture both local and global inter-channel relations. In addition, we propose two regularizers, namely NodeDAT and EmotionDL, to improve the robustness of our model against cross-subject EEG variations and noisy labels. We evaluate our model in both subject-dependent and subject-independent classification settings on two public datasets SEED and SEED-IV. Our model obtains better performance than a few competitive baselines such as SVM, DBN, DGCNN, BiDANN, and the state-of-the-art BiHDM in most classification settings. Notably, our model achieves accuracy of 79.37% and 73.84% in subject-dependent and subject-independent classifications on SEED-IV, respectively, outperforming the current state-of-the-art model by around 5%. Our model analysis demonstrates that our proposed biologically supported adjacency matrix and two regularizers contribute consistent and significant gain to the performance of our model. Investigations on the neuronal activities reveal that pre-frontal, parietal and occipital regions may be the most informative regions in emotion recognition. In addition, global inter-channel relations between the left and right hemispheres are important and local inter-channel relations between (FP1, AF3), (F6, F8) and (FP2, AF4) may also provide useful information.
In the future, we plan to investigate how to apply our model to EEG signals that have a smaller number of channels. A simpler version of our model may be necessary to avoid overfitting on these datasets. In addition, how to incorporate global connections on these smaller graphs may be worth exploring.
-  (2007) Efficiency and cost of economical brain functional networks. PLoS Computational Biology 3 (2), pp. e17. Cited by: §4.1.
-  (2015) Computer-aided diagnosis of depression using EEG signals. European neurology 73 (5-6), pp. 329–336. Cited by: §1.
-  (2002) Comparison of wavelet transform and FFT methods in the analysis of EEG signals. Journal of Medical Systems 26 (3), pp. 241–247. Cited by: §2.1.
-  (2017) Emotions recognition using EEG signals: a survey. IEEE Transactions on Affective Computing. Cited by: §1.
-  (2015) Learning representations from EEG with deep recurrent-convolutional neural networks. arXiv preprint arXiv:1511.06448. Cited by: §2.1.
-  (2017) One-sided unsupervised domain mapping. In Advances in Neural Information Processing Systems, pp. 752–762. Cited by: §2.3.
-  (2011) Support vector machines with the ramp loss and the hard margin loss. Operations Research 59 (2), pp. 467–479. Cited by: §2.4.
-  (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §2.2.
-  (2012) The economy of brain network organization. Nature Reviews Neuroscience 13 (5), pp. 336. Cited by: §1, §4.1.
-  (2017) Autodial: automatic domain alignment layers. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5077–5085. Cited by: §2.3.
-  (2017) A fast, efficient domain adaptation technique for cross-domain electroencephalography (EEG)-based emotion recognition. Sensors 17 (5), pp. 1014. Cited by: §1.
-  (1997) Spectral graph theory. American Mathematical Soc.. Cited by: §2.2, §3.2.
-  (2006) Large scale transductive svms. Journal of Machine Learning Research 7, pp. 1687–1712. Cited by: TABLE II.
-  (2006) EEG phase synchronization during emotional response to positive and negative film stimuli. Neuroscience Letters 406 (3), pp. 159–164. Cited by: §8.1.
-  (2013) Imaging human connectomes at the macroscale. Nature methods 10 (6), pp. 524. Cited by: §1.
-  (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844–3852. Cited by: §2.2, §3.2, §6.1.
-  (1976) Differing emotional response from right and left hemispheres. Nature 261 (5562), pp. 690. Cited by: §4.1, §7.1.
-  (1997) Universal facial expressions of emotion. Segerstrale U, P. Molnar P, eds. Nonverbal communication: Where nature meets culture, pp. 27–46. Cited by: §1.
-  (1960) On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5 (1), pp. 17–60. Cited by: §1.
-  (2013) Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2960–2967. Cited by: TABLE II.
-  (2013) Graph analysis of the human connectome: promise, progress, and pitfalls. Neuroimage 80, pp. 426–444. Cited by: §1.
-  (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §1, §2.3, §4.2.1, §4.2.1, §4.2.1, §7.1.
-  (2017) Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing 26 (6), pp. 2825–2838. Cited by: §2.4.
-  (2016) Deep reconstruction-classification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pp. 597–613. Cited by: §2.3.
-  (2012) A kernel two-sample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §2.3.
-  (2007) Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems, pp. 601–608. Cited by: §2.3.
-  (2014) Feature extraction and selection for emotion recognition from EEG. IEEE Transactions on Affective Computing 5 (3), pp. 327–339. Cited by: §2.1.
-  (2018) Deep physiological affect network for the recognition of human emotions. IEEE Transactions on Affective Computing. Cited by: §2.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.3.
-  (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §2.2, §3.1, §3.2.
-  (1951) On information and sufficiency. The Annals of Mathematical Statistics 22 (1), pp. 79–86. Cited by: §4.2.2.
-  (2018) Domain adaptation techniques for EEG-based emotion recognition: a comparative study on two public datasets. IEEE Transactions on Cognitive and Developmental Systems 11 (1), pp. 85–94. Cited by: §1.
-  (2018) Cross-subject emotion recognition using deep adaptation networks. In Proceedings of the International Conference on Neural Information Processing, pp. 403–413. Cited by: TABLE II.
-  (2018) Hierarchical convolutional neural networks for EEG-based emotion recognition. Cognitive Computation, pp. 1–13. Cited by: §2.1.
-  (2019) EEG based emotion recognition by combining functional connectivity network and local activations. IEEE Transactions on Biomedical Engineering. Cited by: §2.1.
-  (2016) Emotion recognition from multi-channel EEG data through convolutional recurrent neural network. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 352–359. Cited by: §2.1.
-  (2019) A novel bi-hemispheric discrepancy model for EEG emotion recognition. arXiv preprint arXiv:1906.01704. Cited by: §1, §2.1, §5.2.1, §5.2.2, §5.3, TABLE I, TABLE II.
-  (2018) A bi-hemisphere domain adversarial neural network model for EEG emotion recognition. IEEE Transactions on Affective Computing. Cited by: §1, §1, §1, §5.2.1, §5.2.2, §5.3, TABLE I, TABLE II.
-  (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.2.
-  (2010) EEG-based emotion recognition in music listening. IEEE Transactions on Biomedical Engineering 57 (7), pp. 1798–1806. Cited by: §2.1, §8.1.
-  (2013) Real-time fractal-based valence level recognition from EEG. In Transactions on Computational Science XVIII, pp. 101–120. Cited by: §2.1.
-  (2016) Timing of emotion representation in right and left occipital region: evidence from combined tms-EEG. Brain and Cognition 106, pp. 13–22. Cited by: §8.1.
-  (1996) Pleasure-arousal-dominance: a general framework for describing and measuring individual differences in temperament. Current Psychology 14 (4), pp. 261–292. Cited by: §1.
-  (2010) Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22 (2), pp. 199–210. Cited by: TABLE II.
-  (2013) On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pp. 1310–1318. Cited by: §1.
Making deep neural networks robust to label noise: a loss correction approach.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944–1952. Cited by: §2.4.
-  (1985) EEG alpha activity reflects attentional demands, and beta activity reflects emotional and cognitive processes. Science 228 (4700), pp. 750–752. Cited by: §6.3.
-  (2014) Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596. Cited by: §2.4.
-  (2005) Neurophysiological architecture of functional magnetic resonance images of human brain. Cerebral cortex 15 (9), pp. 1332–1342. Cited by: §4.1.
-  (2014) We are not all equal: personalizing models for facial expression analysis with transductive parameter transfer. In Proceedings of the 22nd ACM International Conference on Multimedia, pp. 357–366. Cited by: TABLE II.
-  (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §2.2.
-  (2001) Frontal brain electrical activity (EEG) distinguishes valence and intensity of musical emotions. Cognition & Emotion 15 (4), pp. 487–500. Cited by: §1, §1, §1, §4.1, §7.1, §8.1.
-  (2013) Differential entropy feature for EEG-based vigilance estimation. In 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 6627–6630. Cited by: §1, §2.1, §5.1.
-  (2010) Off-line and on-line vigilance estimation based on linear dynamical system and manifold learning. In 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 6587–6590. Cited by: §5.1.
-  (2019) MPED: a multi-modal physiological emotion database for discrete emotion recognition. IEEE Access 7, pp. 12177–12191. Cited by: TABLE I, TABLE II.
-  (2018) EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Transactions on Affective Computing. Cited by: §1, §1, §2.1, §5.2.1, §5.2.2, §5.3, TABLE I, TABLE II.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §5.3.
-  (2014) Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080. Cited by: §2.4.
-  (2003) Remarks on emotion recognition from multi-modal bio-potential signals. In SMC’03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme-System Security and Assurance (Cat. No. 03CH37483), Vol. 2, pp. 1654–1659. Cited by: §2.1.
-  (2018) Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5552–5560. Cited by: §2.4.
-  (2017) EEG-based emotion recognition via fast and robust feature smoothing. In International Conference on Brain Informatics, pp. 83–92. Cited by: §2.1.
-  (2015) Learning with symmetric label noise: the importance of being unhinged. In Advances in Neural Information Processing Systems, pp. 10–18. Cited by: §2.4.
-  (2014) Emotional state classification from EEG data using machine learning approach. Neurocomputing 129, pp. 94–106. Cited by: §1.
-  (2019) Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning, pp. 6861–6871. Cited by: §1, §2.2, §2.2, §3.1, §3, §4.2.
-  (2019) Identifying functional brain connectivity patterns for EEG-based emotion recognition. In 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 235–238. Cited by: §2.1.
-  (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.2.
-  (2019) How powerful are graph neural networks?. In International Conference on Learning Representations (ICLR), Cited by: §4.2.
Cascade and parallel convolutional recurrent neural networks on EEG-based intention recognition for brain computer interface.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.1.
-  (2018) Spatial-temporal recurrent neural network for emotion recognition. IEEE Transactions on Cybernetics (99), pp. 1–9. Cited by: §1, §2.1, §5.3, TABLE I.
-  (2018) Frontal EEG asymmetry and middle line power difference in discrete emotions. Frontiers in Behavioral Neuroscience 12. Cited by: §4.1, §7.1.
-  (2018) Emotionmeter: a multimodal framework for recognizing human emotions. IEEE Transactions on Cybernetics (99), pp. 1–13. Cited by: item 3, §4.1, §4.2.2, §5.1, §5.2.1, TABLE I.
-  (2015) Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Transactions on Autonomous Mental Development 7 (3), pp. 162–175. Cited by: item 3, §1, §2.1, §4.1, §5.1, §5.2.1, TABLE I, §6.3.
-  (2016) Personalizing EEG-based affective models with transfer learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2732–2738. Cited by: §1, §5.2.2.
-  (2014) EEG-based emotion classification using deep belief networks. In 2014 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §2.1.
-  (2016) Multichannel EEG-based emotion recognition via group sparse canonical correlation analysis. IEEE Transactions on Cognitive and Developmental Systems 9 (3), pp. 281–290. Cited by: TABLE I.
-  (2004) Class noise vs. attribute noise: a quantitative study. Artificial intelligence review 22 (3), pp. 177–210. Cited by: §1.