Emotion strongly influences in our daily activities such as interactions between people, decision making, learning, and working. To endow a computer with emotion perception, understanding, and regulation abilities, Picard et al. developed the concept of affective computing, which aims to be used to study and develop systems and devices that can recognize, interpret, process, and simulate human affects [38, 37]. Human emotion recognition is a current hotspot in affective computing research. Since emotion recognition is critical for applications such as affective brain-computer interaction, emotion regulation and the diagnosis of emotion-related diseases, it is necessary to build a reliable and accurate model for recognizing human emotions.
Traditional emotion recognition systems are built with speech signals , facial expressions , and non-physiological signals . However, in addition to clues from external appearances, emotions contain reactions from the central and peripheral nervous systems. Moreover, an obvious drawback of using behavioral modalities for emotion recognition is the uncertainty that arises in the case of individuals who either consciously regulate their emotional manifestations or are naturally suppressive. In contrast, EEG-based emotion recognition has been proven to be a reliable method because of its high recognition accuracy, objective evaluation and stable neural patterns [62, 63, 57, 58].
For the above reasons, researchers have tended to study emotions through physiological signals in recent years. These signals are more accurate and difficult to deliberately change by users. Lin and colleagues evaluated music-induced emotion recognition with EEG signals and attempted to use as few electrodes as possible 
. Wang and colleagues used EEG signals to classify positive and negative emotions and compared different EEG features and classifiers. Kim and André showed that electromyogram, electrocardiogram, skin conductivity, and respiration changes were reliable signals for emotion recognition . Võ et al. studied the relationship between emotions and eye movement features, and they found that pupil diameters were influenced by both emotion and age .
Emotions are complex cognitive processes that involve subjective experience, expressive behaviors, and psychophysiological changes. Due to the rich characteristics of human emotions, it is difficult for single-modality signals to describe emotions comprehensively. Therefore, recognizing emotions with multiple modalities has become a promising method for building emotion recognition systems with high accuracy [39, 64, 48, 47, 32, 46]. Multimodal data can reflect emotional changes from multiple perspective, which is conducive to building a reliable and accurate emotion recognition model.
Multimodal fusion is one of the key aspects in taking full advantage of multimodal signals. In the past few years, researchers have utilized various methods to fuse different modalities. Lu and colleagues employed feature-level concatenation, MAX fusion, SUM fusion, and fuzzy integral fusion to merge EEG and eye movement features, and they found the complementary properties of EEG and eye movement features in emotion recognition tasks . Koelstra and colleagues evaluated the feature-level concatenation of EEG features and peripheral physiological features, and they found that participant ratings and EEG frequencies were significantly correlated and that decision fusion achieved the best emotion recognition results . Sun et al. built a hierarchical classifier by combining both feature-level and decision-level fusion for emotion recognition tasks in the wild. The method was evaluated on several datasets and made very promising achievements on the validation and test sets .
Currently, with the rapid development of deep learning, researchers are applying deep learning models to fuse multiple modalities. Deep-learning-based multimodal representation frameworks can be classified into two categories: multimodal joint representation and multimodal coordinated representation . Briefly, the multimodal joint representation framework takes all the modalities as input, and each modality starts with several individual neural layers followed by a hidden layer that projects the modalities into a joint space. The multimodal coordinated representation framework, instead of projecting the modalities together into a joint space, learns separate representations for each modality and coordinates them into a hyperspace with constraints between different modalities. Various multimodal joint representation frameworks have been applied to emotion recognition in very recent years [30, 52, 28, 59]. However, the multimodal coordinated representation framework has not yet been fully studied.
In this paper, we introduce a coordinated representation model named Deep Canonical Correlation Analysis (DCCA) [1, 40] to multimodal emotion recognition. The basic idea behind DCCA is to learn separate but coordinated representations for each modality under canonical correlation analysis (CCA) constraints. Since the coordinated representations are of the same dimension, we denote the coordinated hyperspace by .
Compared with our previous work , the main contributions of this paper on multimodal emotion recognition can be summarized as follows:
We introduce DCCA to multimodal emotion recognition and evaluate the effectiveness of DCCA on five benchmark datasets: the SEED, SEED-IV, SEED-V, DEAP, and DREAMER datasets. Our experimental results on these five datasets reveal that different emotions are disentangled in the coordinated hyperspace , and the transformation process of DCCA preserves emotion-related information and discards unrelated information.
We examine the robustness of DCCA and the existing methods on the SEED-V dataset under different levels of noise. The experimental results show that DCCA has higher robustness than the existing methods under most noise conditions.
By adjusting the weights of different modalities, DCCA allows users to fuse different modalities with greater flexibility such that various modalities contribute differently to the fused features.
, we introduce the algorithms for the canonical correlation analysis, DCCA, the baseline models utilized in this paper, and the mutual information neural estimation (MINE) algorithm. The experimental settings are reported in SectionIV. Section V presents and analyzes the experimental results. Finally, conclusions are given in Section VI.
Ii Related Work
One of the key problems in multimodal deep learning is how to fuse data from different modalities. Multimodal fusion has gained increasing attention from researchers in diverse fields due to its potential for innumerable applications such as emotion recognition, event detection, image segmentation, and video classification [24, 5]. According to the level of fusion, traditional fusion strategies can be classified into the following three categories: 1) feature-level fusion (early fusion), 2) decision-level multimodal fusion (late fusion), and 3) hybrid multimodal fusion. With the rapid development of deep learning, an increasing number of researchers are employing deep learning models to facilitate multimodal fusion. In the following, we introduce these multimodal fusion types and their subtypes.
Ii-a Feature-level fusion
Feature-level fusion is a common and straightforward method to fuse different modalities. The features extracted from the various modalities are first combined into a high-dimensional feature and then sent as a whole to the models[13, 23, 31, 35, 33].
The advantages of feature-level fusion are two-fold: 1) it can utilize the correlation between different modalities at an early stage, which better facilitates task accomplishment, and 2) the fused data contain more information than a single modality, and thus, a performance improvement is expected. The drawbacks of feature-level fusion methods mainly reside in the following: 1) it is difficult to represent the time synchronization between different modality features, 2) this type of fusion method might suffer the curse of dimensionality on small datasets, and 3) larger dimensional features might stress computational resources during model training.
Ii-B Decision-level fusion
Decision-level fusion focuses on the usage of small classifiers and their combination. Ensemble learning is often used to assemble these classifiers. The term decision-level fusion describes a variety of methods designed to merge the outcomes and ensemble them into a single decision.
Rule-based fusion methods are most adopted in multimodal emotion recognition. Lu and colleagues utilized MAX fusion, SUM fusion, and fuzzy integral fusion for multimodal emotion recognition, and they found the complementary nature of EEG and eye movement features by analyzing confusion matrices . Although rule-based fusion methods are easy to use, the difficulty facing rule-based fusion is how to design good rules. If rules are too simple, they might not reveal the relationships between different modalities.
The advantage of decision-level fusion is that the decisions from different classifiers are easily compared and each modality can use its best suitable classifier for the task.
Ii-C Hybrid fusion
Hybrid fusion is a combination of feature-level fusion and decision-level fusion. Sun and colleagues built a hierarchical classifier by combining both feature-level and decision-level fusion methods for emotion recognition . Guo et al. built a hybrid classifier by combining fuzzy cognitive map and SVM to classify emotional states with compressed sensing representation .
Ii-D Deep-learning-based fusion
For deep learning models, different types of multimodal fusion methods have been developed, and these methods can be grouped into two categories based on the modality representation: multimodal joint representation and multimodal coordinated representation .
The multimodal joint representation framework takes all the modalities as input, and each modality starts with several individual neural layers followed by a hidden layer that projects the modalities into a joint space. Both transformation and fusion processes are achieved automatically by black-box models and users do not know the meaning of the joint representations. The multimodal joint representation framework has been applied to emotion recognition [30, 52]34].
The multimodal coordinated representation framework, instead of projecting the modalities together into a joint space, learns separate representations for each modality but coordinates them through a constraint. The most common coordinated representation models enforce similarity between modalities. Frome and colleagues proposed a deep visual semantic embedding (DeViSE) model to identify visual objects 
. DeViSE is initialized from two pre-trained neural network models: a visual object categorization network and a skip-gram language model. DeViSE combines these two networks by the dot-product and hinge rank loss similarity metrics such that the model is trained to produce a higher dot product similarity between the visual model output and the vector representation of the correct label than that between the visual output and other randomly chosen text terms.
The deep canonical correlation analysis (DCCA) method, which is another model under the coordinated representation framework, was proposed by Andrew and colleagues . In contrast to DeViSE, DCCA adopts traditional CCA as a similarity metric, which allows us to transform data into a highly correlated hyper-space.
In this section, we first provide a brief description of traditional canonical correlation analysis (CCA) in Section III-A. Based on CCA, we present the building processes of DCCA in Section III-B. The baseline methods used in this paper are described in Section III-C. Finally, the mutual information neural estimation (MINE) algorithm is given in Section III-D, which is utilized to analyze the properties of transformed features implemented by DCCA in the coordinated hyperspace .
Iii-a Canonical Correlation Analysis
Canonical correlation analysis (CCA) was proposed by Hotelling 
. It is a widely used technique in the statistics community to measure the linear relationship between two multidimensional variables. Hardoon and colleagues applied CCA to machine learning.
Let denote random vectors with covariance matrices
and cross-variance matrix
. CCA attempts to find linear transformations of, , which are maximally correlated:
where we assume the projections are constrained to have unit variance.
To find multiple results of , subsequent projections are also constrained to be uncorrelated with previous projections, i.e., for . Combining the top projection vectors into a matrix as column vectors and similarly placing into , we then identify the top projections:
To solve this objective function, we first define , and we let and be the matrices of the first left singular and right singular vectors of , respectively. Then the optimal objective value is the sum of the top singular values of , and the optimum is obtained at . This method requires the covariance matrices and to be nonsingular, which is usually satisfied in practice.
For the original CCA, the representations in the latent space are obtained by linear transformations, which limit the scope of application of CCA. To address this problem, Lai and Fyfe  proposed kernel CCA, in which kernel methods are introduced for nonlinear transformations. Klami and colleagues developed probabilistic canonical correlation analysis (PCCA) ; later, they extended PCCA to a Bayesian-based CCA named inter-battery factor analysis 
. There are many other extensions of CCA such as tensor CCA, sparse CCA , and cluster CCA .
Iii-B Deep Canonical Correlation Analysis
In this paper, we introduce deep canonical correlation analysis (DCCA) to multimodal emotion recognition. DCCA was proposed by Andrew and colleagues , and it computes representations of multiple modalities by passing them through multiple stacked layers of nonlinear transformations. Figure 1 depicts the structure of DCCA used in this paper.
Let be the instance matrix for the first modality and be the instance matrix for the second modality. Here, is the number of instances, and and are the dimensions of the extracted features for these two modalities, respectively. To transform the raw features of two modalities nonlinearly, we build two deep neural networks for the two modalities as follows:
where and denote all parameters for the non-linear transformations, and are the outputs of the neural networks, and denotes the output dimension of DCCA. The goal of DCCA is to jointly learn the parameters and for both neural networks such that the correlation of and is as high as possible:
We use the backpropagation algorithm to updateand . The solution to calculating the gradients of the objective function in Eq. (6) was developed by Andrew and colleagues . Let be the centered output matrix (similar to ). We define , . Here, is a regularization constant (similar to ). The total correlation of the top components of and is the sum of the top singular values of matrix . In this paper, we take , and the total correlation is the trace of :
Finally, we calculate the gradients with the singular decomposition of ,
and has a symmetric expression.
After the training of the two neural networks, the transformed features are in the coordinated hyperspace . In the original DCCA , the authors did not explicitly describe how to use transformed features for real-world applications via machine learning algorithms. Users need to design a strategy to take advantage of the transformed features according to their application.
In this paper, we use a weighted sum fusion method to obtain the fused features as follows:
where and are weights satisfying . The fused features are used to train the classifiers to recognize different emotions. In this paper, an SVM classifier is adopted.
According to the construction processes mentioned above, DCCA brings the following advantages to multimodal emotion recognition:
By transforming different modalities separately, we can explicitly extract transformed features for each modality ( and ) so that it is convenient to examine the characteristics and relationships of modality-specific transformations.
With specified CCA constraints, we can regulate the non-linear mappings ( and ) and make the model preserve the emotion-related information.
By using a weighted sum fusion (under the condition ), we can assign different priorities to these modalities based on our priori knowledge. A larger weight represents a larger contribution of the corresponding modality to the fusion features.
Iii-C Baseline methods
Iii-C1 Concatenation Fusion
The concatenation fusion is a type of feature-level fusion. The feature vectors from two modalities are denoted as and , and the fused features can be calculated with the following equation:
Iii-C2 MAX Fusion
The MAX fusion method is a type of decision-level fusion method that chooses the class of the maximum probability as the prediction result. Assuming that we haveclassifiers and
categories, there is a probability distribution for each sample, and , where is a sample, is the predicted label, and is the probability of sample belonging to class generated by the -th classifier. The MAX fusion rule can be expressed as follows:
Iii-C3 Fuzzy Integral Fusion
The fuzzy integral fusion is also a type of decision-level fusion [9, 26]. A fuzzy measure on the set is a function: , which satisfies the two axioms: 1) and 2) . In this paper, we use the discrete Choquet integral to fuse the multimodal features. The discrete Choquet integral of a function with respect to is defined by
where indicates that the indices have been permuted such that , , and .
In this paper, we utilize the algorithm proposed by Tanaka and Sugeno  to calculate the fuzzy measure. The algorithm attempts to find the fuzzy measure , which minimizes the total squared error of the model. Tanaka and Sugeno proved that the minimization problem can be solved through a quadratic programming method.
Iii-C4 Bimodal Deep AutoEncoder (BDAE)
A building block of BDAE is the restricted Boltzmann machine (RBM). The RBM is an undirected graph model, which has a visible layer and a hidden layer. Connections exist only between the visible layer and hidden layer , and there are no connections in the visible layer or in the hidden layer. In this paper, we adopted theBernoulliRBM in Scikit-learn111https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.BernoulliRBM.html . The visible variables are binary stochastic units of dimension
, which means that the input data should be either binary or real valued between 0 and 1, signifying the probability. The hidden variables also satisfy a Bernoulli distribution. The energy is calculated with the following function:
where are parameters, is the symmetric weight between the visible unit and the hidden unit , and and
are the bias terms of the visible unit and hidden unit, respectively. With an energy function, we can obtain the joint distribution over the visible and hidden units:
where is the normalization constant. Given a set of visible variables , the derivative of the log-likelihood with respect to the weight can be calculated from Eq. (16):
The BDAE training procedure includes encoding and decoding. In the encoding phase, we train two RBMs for EEG features and eye movement features, and the hidden layers are denoted as and . These two hidden layers are concatenated together, and the concatenated layer is used as the visual layer of a new upper RBM. In the decoding stage, we unfold the stacked RBMs to reconstruct the input features. Finally, we use a back-propagation algorithm to minimize the reconstruction error.
Iii-D Mutual Information Neural Estimation
Mutual information is a fundamental quantity for measuring the relationship between variables. The mutual information quantifies the dependence of two random variablesand with the following equation:
where is the joint probability distribution, and and are the marginals.
The mutual information neural estimation (MINE) was proposed by Belghazi and colleagues . MINE is linearly scalable in dimensionality as well as in sample size, trainable through a back-propagation algorithm, and strongly consistent.
The idea behind MINE is to choose to be the family of functions parameterized by a deep neural network with parameters . Then, the deep neural network is used to update the estimated mutual information,
where is defined as
The expectations in Eq. (21) are estimated using empirical samples from and or by shuffling the samples from the joint distribution, and MINE is defined as
where is the empirical distribution associated with samples. The details on the implementation of MINE are provided in Algorithm 1.
Iv Experimental settings
To evaluate the effectiveness of DCCA for multimodal emotion recognition, five multimodal emotion recognition datasets are selected for experimental study in this paper.
Iv-A1 SEED dataset444http://bcmi.sjtu.edu.cn/home/seed/index.html
The SEED dataset was developed by Zheng and Lu . A total of 15 Chinese film clips of three emotions (happy, neutral and sad) were chosen from a pool of materials as stimuli used in the experiments. Before the experiments, the participants were told the procedures of the entire experiment. During the experiments, the participants were asked to watch the selected 15 movie clips, and report their emotional feelings. After watching a movie clip, the subjects were given 45 seconds to provide feedback and 15 seconds to rest. In this paper, we use the same subset of the SEED dataset as in our previous work [31, 30, 52] for the comparison study.
The SEED dataset contains EEG signals and eye movement signals. The EEG signals were collected with an ESI NeuroScan system at a sampling rate of 1000 Hz from a 62-channel electrode cap. Eye movement signals were collected with SMI eye tracking glasses666https://www.smivision.com/eye-tracking/product/eye-tracking-glasses/.
Iv-A2 SEED-IV dataset
The SEED-IV dataset was first proposed in . The experimental procedure was similar to that of the SEED dataset, and 72 film clips were chosen as stimuli materials. The dataset contains emotional EEG signals and eye movement signals of four different emotions, i.e., happy, sad, neutral, and fear. A total of 15 subjects (7 male and 8 female) participated in the experiments. For each participant, three sessions were performed on different days, and each session consisted of 24 trials. In each trial, the participant watched one of the movie clips.
Iv-A3 SEED-V dataset
The SEED-V dataset was proposed in . The dataset contains EEG signals and eye movement signals for five emotions (happy, sad, neutral, fear, and disgust). A total of 16 subjects (6 male and 10 female) were recruited to participate in the experiment, and each of them performed the experiment three times. During the experiment, the subject were required to watch 15 movie clips (3 clips for each emotion). The same devices were used in the SEED-V dataset as in the SEED and SEED-IV datasets. The SEED-V dataset used in this paper will be freely available to the academic community as a subset of SEED777http://bcmi.sjtu.edu.cn/home/seed/index.html.
Iv-A4 DEAP dataset
The DEAP dataset was developed by Koelstra and colleagues  and is a multimodal dataset for the analysis of human affective states. The EEG signals and peripheral physiological signals (EOG, EMG, GSR, respiration belt, and plethysmograph) of 32 participants were recorded as each watched 40 one-minute long excerpts of music videos. Participants rated each video in terms of the levels of arousal, valence, like/dislike, dominance, and familiarity.
Iv-A5 DREAMER dataset
The DREAMER dataset is a multimodal emotion dataset developed by Katsigiannis and Ramzan . The DREAMER dataset consists of 14-channel EEG signals and 2-channel ECG signals of 23 subjects (14 males and 9 females). During the experiments, the participants watched 18 film clips to elicit 9 different emotions including amusement, excitement, happiness, calmness, anger, disgust, fear, sadness, and surprise. After watching a clip, the self-assessment manikins were used to acquire subjective assessments of valence, arousal, and dominance.
Iv-B Feature extraction
Iv-B1 EEG feature extraction
For EEG signals, we extract differential entropy (DE) features using short-term Fourier transforms with a 4-second Hanning window without overlapping[6, 43]
. The differential entropy feature is used to measure the complexity of continuous random variables. Its calculation formula can be written as follows:
where is a random variable and
is the probability density function of. For the time series , obeying the Gauss distribution , its differential entropy can be calculated as follows:
Shi and colleagues 
proved that EEG signals within a short time period in different frequency bands are subject to a Gaussian distribution by the Kolmogorov-Smirnov test, and the DE features can be calculated by Eq. (24).
We extract DE features from EEG signals (from the SEED, SEED-IV and SEED-V datasets) in five frequency bands for all channels: delta (1-4 Hz), theta (4-8 Hz), alpha (8-14 Hz), beta (14-31 Hz), and gamma (31-50 Hz). There are in total dimensions for 62 EEG channels. Finally we adopt the linear dynamic system method to filter out noise and artifacts .
For the DEAP dataset, the raw EEG signals were downsampled to 128 Hz and preprocessed with a bandpass filter from 4 to 75 Hz. We extract the DE features from four frequency bands (theta, alpha, beta, and gamma). As a result, there are 128 dimensions for the DE features.
Iv-B2 ECG feature extraction
In previous ECG-based emotion recognition studies, researchers extracted time-domain features, frequency-domain features, and time-frequency-domain features from ECG signals for emotion recognition [16, 15, 61]. Katsigiannis and Ramzan extracted power spectral density (PSD) features of low frequency and high frequency from ECG signals . Hsu and colleagues extracted power for three frequency bands: a very-low-frequency range (0.0033 – 0.04 Hz), a low-frequency range (0.04 – 0.15 Hz), and a high-frequency range (0.15 – 0.4 Hz) .
However, previous studies have shown that ECG signals have a much wider frequency range. In the early stage of ECG research, Scher and Young showed that ECG signals contain frequency components as high as 100 Hz . Recently, Shufni and Mashor also showed that there are high-frequency components (up to 600 Hz) in ECG signals . Tereshchenko and Josephson reviewed studies on ECG frequencies and noted that “the full spectrum of frequencies producing the QRS complex has not been adequately explored” .
Since there are no standard frequency separation methods for ECG signals , we extract the logarithm of the average energy of five frequency bands (1– 4 Hz, 4 – 8 Hz, 8 – 14 Hz, 14 – 31 Hz, and 31 – 50 Hz) from two ECG channels of the DREAMER dataset. As a result, we extract 10-dimensional features from the ECG signals.
Iv-B3 Eye movement features
The eye movement data in the SEED dataset recorded using SMI ETG eye-tracking glasses
provide various types of parameters such as pupil diameters, fixation positions and durations, saccade information, blink details, and other event statistics. Although emotional changes cause fluctuations in pupil diameter, environmental luminance is the main reason for pupil diameter changes. Consequently, we adopt a principal component analysis-based method to remove the changes caused by lighting conditions.
The eye movement signals acquired by SMI ETG eye-tracking glasses contain both statistical features, such as blink information, and computational features such as temporal and frequency features. Table I shows all 33 eye movement features used in this paper. Therefore, the total number of dimensions of the eye movement features is 33.
|Eye movement parameters||Extracted features|
|Pupil diameter (X and Y)||
Mean, standard deviation,
|DE in four bands|
|Disperson (X and Y)||Mean, standard deviation|
|Fixation duration (ms)||Mean, standard deviation|
|Blink duration (ms)||Mean, standard deviation|
|Saccade||Mean and standard deviation of|
|saccade duration(ms) and|
|Event statistics||Blink frequency,|
|fixation duration maximum,|
|fixation dispersion total,|
|fixation dispersion maximum,|
|saccade duration average,|
|saccade amplitude average,|
|saccade latency average.|
|Datasets||#Hidden Layers||#Hidden Units||Output Dimensions|
|SEED||6||40040, 20020, 15020, 12010, 6010, 202||20|
|SEED-IV||7||40040, 20020, 15020, 12010, 9010, 6010, 202||20|
|SEED-V||2||searching for the best numbers between 50 and 200||12|
|DEAP||7||150050, 75050, 50025, 37525, 13020, 6520, 3020||20|
|DREAMER||2||searching for the best numbers between 10 and 200||5|
Iv-B4 Peripheral physiological signal features
For peripheral physiological signals from the DEAP dataset, we calculate statistical features in the temporal domain, including the maximum value, minimum value, mean value, standard deviation, variance, and squared sum. Since there are 8 channels for the peripheral physiological signals, we extract 48 ()-dimensional features.
Iv-C Model training
For the SEED dataset, the DE features of the first 9 movie clips are used as training data, and those of the remaining 6 movie clips are used as test data. In this paper, we build subject-dependent models to classify three types of emotions (happy, sad, and neutral), which is the same as in our previous work [31, 30, 52].
A similar training-testing separation scheme is applied to the SEED-IV dataset. There are 24 trials for each session, and we use the data from the first 16 trials as the training data and the data from the remaining 8 trials as the test data . DCCA is trained to recognize four emotions (happy, sad, fear, and neutral)
For the SEED-V dataset, the training-testing separation strategy is the same as that used by Zhao et .al . We adopt three-fold cross-validation to evaluate the performance of DCCA on five emotion (happy, sad, fear, neutral, and disgust) recognition tasks. Since the participant watched 15 movie clips (the first 5 clips, the middle 5 clips and the last 5 clips) and participated in three sessions, we concatenate features of the first 5 clips from three sessions (i.e., we concatenate features extracted from 15 movie clips) as the training data for fold one (with a similar operation for folds two and three).
For the DEAP dataset, we build a subject-dependent model with a 10-fold cross-validation on two binary classification tasks and a four-emotion recognition task:
Binary classifications: arousal-level and valence-level classification with a threshold of 5.
Four-category classification: high arousal, high valence (HAHV); high arousal, low valence (HALV); low arousal, high valence (LAHV); and low arousal, low valence (LALV).
For the DREAMER dataset, we utilize leave-one-out cross-validation (i.e., 18-fold validation) to evaluate the performance of DCCA on three binary classification tasks (arousal, valence, and dominance), which is the same as that used by Song et al. .
For these five different datasets, DCCA uses different hidden layers, hidden units, and output dimensions. Table II summarizes the DCCA structures for these datasets. For all five datasets, the learning rate, batch size, and regulation parameter of DCCA are set to 0.001, 100, and , respectively.
V Experimental results
V-a SEED, SEED-IV, and DEAP Datasets
In this section, we summarize our previous results on SEED, SEED-IV, and DEAP datasets . Table III lists the results obtained by seven existing methods and DCCA on the SEED dataset. Lu and colleagues applied concatenation fusion, MAX fusion and fuzzy integral to fuse multiple modalities and demonstrated that the fuzzy integral fusion method achieved the accuracy of 87.59% . Liu et al.  and Tang et al.  improved multimodal methods, obtaining accuracies of 91.01% and 94.58%, respectively. Recently, Yang and colleagues  build a single-layer feedforward network (SLFN) with subnetwork nodes and achieved an accuracy of 91.51%. Song and colleagues  proposed DGCNN and obtained a classification accuracy of 90.40%. As seen from Table III, DCCA achieves the best result of 94.58% among the eight different methods.
|SLFN with subnetwork nodes ||91.51||–|
Table IV gives the results of five different methods on the SEED-IV dataset. We can observe from Table IV that for the SVM classifier, the four emotion states are recognized with a 75.88% mean accuracy rate, and the BDAE model improved the result to 85.11%. DCCA outperforms the aforementioned two methods, with an 87.45% mean accuracy rate.
Two classification schemes are adopted on the DEAP dataset. Table V shows the results of binary classifications. As we can observe, DCCA achieves the best results in both arousal classification (84.33%) and valence classification (85.62%) tasks.
proposed a three-stage decision framework that outperformed KNN and SVM with an accuracy rate of 70.04%. The DCCA model achieved a mean accuracy rate of 88.51%, which is more than 18% higher than the existing methods.
|Three-stage decision Framework ||70.04/-|
From the experimental results mentioned above, we can see that DCCA outperforms the existing methods on the SEED, SEED-IV, and DEAP datasets.
V-B SEED-V Dataset
We examine the effectiveness of DCCA on the SEED-V dataset, which contains multimodal signals of five emotions (happy, sad, fear, neutral, and disgust).
We perform a series of experiments to choose the best output dimension and fusion coefficients ( and in Eq. (11)) for DCCA. We adopt the grid search method with output dimensions ranging from 5 to 50 and coefficients for the EEG features ranging from 0 to 1, i.e. . Since , we can calculate the weight for the other modality via . Figure 2 shows the heat map of the experimental results of the grid search. Each row in Fig. 2 gives different output dimensions, and each column is the weight of the EEG features (). The numbers in blocks are the accuracy rates, which are rounded to integers for simplicity. According to Fig. 2, we set the output dimension to 12 and the weight of the EEG features to 0.7 (i.e., ).
adopted feature-level concatenation and the bimodal deep autoencoder (BDAE) for fusing multiple modalities, and achieved mean accuracy rates of 73.65% and 79.70%, respectively. In addition to feature-level concatenation, we also implement MAX fusion and fuzzy integral fusion strategies here. As shown in TableVII, the MAX fusion and fuzzy integral fusion yielded mean accuracy rates of 73.14% and 73.62%, respectively. The mean accuracy rate of DCCA is 83.08%, which is the best result among the five fusion strategies.
Figure 3 depicts the confusion matrices of the DCCA model and the models adopted by Zhao and colleagues . Figures. 3(a), (b) and (c) are the confusion matrices for the EEG features, eye movement features, and the BDAE model, respectively. Figure 3(d) depicts the confusion matrix for the DCCA model. From Figs. 3(a), (b), and (d), for each of the five emotions, DCCA achieves a higher accuracy, indicating that emotions are better represented and more easily classified in the coordinated hyperspace transformed by DCCA.
From Figs. 3(a) and (c), compared with the unimodal results of the EEG features, the BDAE model achieves worse classification results on the happy emotion, suggesting that the BDAE model might not take full advantage of different modalities for the happy emotion. Comparing Figs. 3(c) and (d), DCCA largely improved the classification results on disgust and happy emotion recognition tasks compared with the BDAE model, implying that DCCA is more effective in fusing multiple modalities.
To analyze the coordinated hyperspace of DCCA, we utilized the t-SNE algorithm to visualize the space of the original features and the coordinated hyperspace of the transformed features and fused features. Figure 4 presents a visualization of the features from three participants. The first row shows the original features, the second row depicts the transformed features, and the last row presents the fused features. The different colors stand for different emotions, and the different markers are different modalities. We can make the following observations:
Different emotions are disentangled in the coordinated hyperspace . For original features, there are more overlaps among different emotions (different colors presenting substantial overlap), which lead to poorer emotional representation. After the DCCA transformation, different emotions become relatively independent, and the overlapping areas are considerably reduced. This indicates that the transformed features have improved emotional representation capabilities compared with the original features. Finally, after multimodal fusion, different emotions (‘’ of different colors in the last row) are completely separated, and there is no overlapping area, indicating that the merged features also have good emotional representation ability.
Different modalities have homogeneous distributions in the coordinated hyperspace . To make this observation more obvious, we separate and plot the distributions of the EEG and eye movement features under the sad emotion in Fig. 5. From the perspectives of both inter-modality and intra-modality distributions, the original EEG features (‘’ marker) and eye movement features (‘’ marker) are separated from each other. After the DCCA transformation, the EEG features and the eye movement features have more compact distributions, indicating that the coordinated hyperspace preserves shared emotion-related information and discards irrelevant information.
Figures 4 and 5 qualitatively show that DCCA maps original EEG and eye movement features into a coordinated hyperspace where emotions are better represented since only emotion related information is preserved.
Furthermore, we calculated the mutual information of the original features and transformed features to support our claims quantitatively. Figure 6 presents the mutual information of three participants estimated by MINE. The green curves depict the mutual information of the original EEG and eye movement features, and the red curves are the estimated mutual information of the transformed features. The transformed features have more mutual information than the original features, indicating that EEG and eye movement features in the coordinated hyperspace provide more shared emotion-related information, which is consistent with observations from Figs. 4 and 5.
V-C Robustness Analysis on the SEED-V Dataset
EEG signals have a low signal-to-noise ratio (SNR) and are easily interfered with by external environmental noise. To compare the noise robustness of DCCA with that of the existing methods, we designed two experimental schemes on noisy datasets: 1) we added Gaussian noise of different variances to both the EEG and eye movement features. To highlight the influence of noise, we added noise to the normalized features since the directly extracted features are much larger than the generated noise (which is mostly less than 1). 2) Under certain extreme conditions, EEG signals may be overwhelmed by noise. To simulate this situation, we randomly replace different proportions (10%, 30%, and 50%) of EEG features with noise with a normal distribution (
), gamma distribution (
), and uniform distribution (). Specifically, for DCCA, we also examine the effect of different weight coefficients on the robustness of the model. In this paper, we compare the performance of three different combinations of coefficients, i.e., (DCCA-0.3), (DCCA-0.5), and (DCCA-0.7).
V-C1 Adding Gaussian noise
First, we investigate the robustness of different weight combinations in DCCA after adding Gaussian noise of different variances to both the EEG and eye movement features. Figure 7 depicts the results. Although the model achieves the highest classification accuracy when the EEG weight is set to 0.7, it is also more susceptible to noise. The robustness of the model decreases as the weight of the EEG features increases. Since a larger EEG weight leads to more EEG components in the fused features, we might conclude that EEG features are more sensitive to noise than are eye movement features.
Next, we compare the robustness of different models under Gaussian noise with different variances. Taking both classification performance and robustness into consideration, we use DCCA with an EEG weight set to 0.5. Figure 8 shows the performances of the various models. The performance decreases with increasing variances of the Gaussian noise. DCCA obtains the best performance when the noise is lower than or equal to . The performance of the fuzzy integral fusion strategy exceeds DCCA when the noise is stronger than or equal to . The BDAE model performs poorly under noisy conditions even when minimal noise is added to the training samples, the performance of the BDAE model is greatly reduced.
V-C2 Replacing EEG features with noise
Table VIII shows the detailed emotion recognition accuracies and standard deviations after replacing 10%, 30%, and 50% percent of the EEG features with different noise distributions. The recognition accuracies decrease with increasing noise proportions. In addition, the performances of seven different settings under different noise distributions are very similar, indicating that noise distributions have limited influences on the recognition accuracies.
To better observe the changing tendency, we plot the average recognition accuracies under different noise distributions with the same noise ratio. Figure 9 shows the average accuracies for DCCA with different EEG weights. It is obvious that the performances decrease with increasing noise percentages and that the model robustness is inversely proportional to the ratio of the EEG modality. This is the expected performance. Since we only randomly replace EEG features with noise, larger EEG weights will introduce more noises to the fused features, resulting in a decrease in model robustness.
Similar to Fig. 7, we also take DCCA-0.5, as a compromise between performance and robustness to compare with other multimodal fusion methods. Figure 10 depicts the trends of the accuracies of several models. It is obvious that DCCA performs the best, the concatenation fusion achieves a slightly better performance than the fuzzy integral fusion method, and the BDAE model again presents the worst performance.
As already discussed in previous sections, DCCA attemps to preserve emotion-related information and discard irrelevant information. This property prevents the model performance from rapidly deteriorating by neglecting negative information introduced by noise.
V-D DREAMER Dataset
For DCCA, we choose the best output dimensions and weight combinations with a grid search. We select the output dimension from the set and the EEG weight in for three binary classification tasks. Figures 11(a), (b), and (c) depict the heat maps of the grid search for arousal, valence, and dominance classifications, respectively. According to Fig. 11, we choose and for the arousal classification, and for the valence classification, and and for the dominance classification.
For BDAE, we select the best output dimensions from , and leave-one-out cross-validation is used to evaluate the BDAE model.
Table IX gives comparison results of the different methods. Katsigiannis and Ramzan released this dataset, and they achieved accuracy rates of 62.32%, 61.84%, and 61.84% on arousal, valence and dominance classification tasks, respectively . Song and colleagues conducted a series of experiments on this dataset with SVM, graphSLDA, GSCCA, and DGCNN. DGCNN achieved accuracy rates of 85.54% for arousal classification, 86.23% for valence classification, and 85.02% for dominance classification. From Table IX, we can see that BDAE and DCCA adopted in this paper outperform DGCNN. For BDAE, the recognition results for arousal, valence, and dominance are 88.57%, 86.64%, and 89.52%, respectively. DCCA achieves the best performance among all seven methods: 88.99%, 90.57%, and 90.67% for arousal, valence, and dominance level recognitions, respectively.
|Fusion EEG & ECG ||62.32/-||61.84/-||61.84/-|
In this paper, we have introduced deep canonical correlation analysis (DCCA) to multimodal emotion recognition. We have systematically evaluated the performance of DCCA on five multimodal emotion datasets (the SEED, SEED-IV, SEED-V, DEAP and DREAMER datasets) and compared DCCA with the existing emotion recognition methods. Our experimental results demonstrate that DCCA is superior to the existing methods for multimodal emotion recognition.
We have analyzed properties of the transformed features in the coordinated hyperspace . By applying t-SNE method, we have found qualitatively that: 1) different emotions are better represented since they are disentangled in the coordinated hyperspace; and 2) different modalities have compact distributions from both inter-modality and intra-modality perspectives. We have applied mutual information neural estimation (MINE) algorithm to compare the mutual information of original features and transformed features quantitatively. The experimental results show that the features transformed by DCCA have higher mutual information, indicating that DCCA transformation processes preserve emotion-related information and discard irrelevant information.
We have investigated the robustness of DCCA on noised datasets under two schemes. By adding Gaussian noise of different variances to both EEG and eye movement features, we have demonstrated that DCCA performs best when the noise is smaller than or equal to . After replacing 10%, 30%, and 50% percentage of EEG features with normal distribution, gamma distribution, and uniform distribution, we have shown that DCCA has the best performance for multimodal emotion recognition.
-  (2013) Deep canonical correlation analysis. In International Conference on Machine Learning, pp. 1247–1255. Cited by: §I, §II-D, §III-B, §III-B, §III-B.
-  (2017) Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis & Machine Intelligence 41 (2), pp. 423–443. Cited by: §I, §II-D.
-  (2018) Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062. Cited by: §III-D.
-  (2017) A three-stage decision framework for multi-subject emotion recognition using physiological signals. In IEEE International Conference on Bioinformatics & Biomedicine, Cited by: §V-A, TABLE VI.
-  (2015) A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys 47 (3), pp. 1–36. Cited by: §II.
-  (2013) Differential entropy feature for EEG-based emotion classification. In 2013 6th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 81–84. Cited by: §IV-B1.
-  (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognition 44 (3), pp. 572–587. Cited by: §I.
-  (2013) Devise: a deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pp. 2121–2129. Cited by: §II-D.
-  (2000) Application of the choquet integral in multicriteria decision making. Fuzzy Measures & Integrals, pp. 348–374. Cited by: §III-C3.
A hybrid fuzzy cognitive map/support vector machine approach for EEG-based emotion classification using compressed sensing. International Journal of Fuzzy Systems 21, pp. 263–273. External Links: Cited by: §II-C.
-  (2011) Sparse canonical correlation analysis. Machine Learning 83 (3), pp. 331–353. Cited by: §III-A.
-  (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Computation 16 (12), pp. 2639–2664. Cited by: §III-A.
-  (2018) Self-attentive feature-level fusion for multimodal emotion detection. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 196–201. Cited by: §II-A.
-  (1992) Relations between two sets of variates. In Breakthroughs in Statistics, pp. 162–190. Cited by: §III-A.
-  (2018) Automatic ecg-based emotion recognition in music listening. IEEE Transactions on Affective Computing (), pp. 1–16. External Links: Cited by: §IV-B2.
-  (2017) DREAMER: a database for emotion recognition through eeg and ecg signals from wireless low-cost off-the-shelf devices. IEEE Journal of Biomedical and Health Informatics 22 (1), pp. 98–107. Cited by: §IV-A5, §IV-B2, §V-D, TABLE IX.
-  (2008) Emotion recognition based on physiological changes in music listening. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, pp. 2067–2083. External Links: Cited by: §I.
Robust kernel density estimation. Journal of Machine Learning Research 13 (Sep), pp. 2529–2565. Cited by: 2nd item.
Tensor canonical correlation analysis for action classification.
2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §III-A.
-  (2008) Probabilistic approach to detecting dependencies between data sets. Neurocomputing 72 (1), pp. 39–46. Cited by: §III-A.
-  (2013) Bayesian canonical correlation analysis. Journal of Machine Learning Research 14 (Apr), pp. 965–1003. Cited by: §III-A.
-  (2018) A brief review of facial emotion recognition based on visual information. Sensors 18 (2), pp. 401. Cited by: §I.
-  (2012) DEAP: a database for emotion analysis; using physiological signals. IEEE Transactions on Affective Computing 3 (1), pp. 18–31. Cited by: §I, §II-A, §IV-A4.
-  (2015) Multimodal data fusion: an overview of methods, challenges, and prospects. Proceedings of the IEEE 103 (9), pp. 1449–1477. Cited by: §II.
-  (2000) Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10 (05), pp. 365–377. Cited by: §III-A.
-  (2012) Gender classification by combining clothing, hair and facial component classifiers. Neurocomputing 76 (1), pp. 18–27. Cited by: §III-C3.
-  (2019) Classification of five emotions from eeg and eye movement signals: discrimination ability and stability over time. In 9th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 607–610. Cited by: §IV-A3.
Emotion recognition from multi-channel EEG data through convolutional recurrent neural network. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 352–359. Cited by: §I.
-  (2011) Generalizations of the subject-independent feature set for music-induced emotion recognition. In 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 6092–6095. Cited by: §I.
-  (2016) Emotion recognition using multimodal deep learning. In International Conference on Neural Information Processing, pp. 521–529. Cited by: §I, §II-D, §III-C4, §IV-A1, §IV-C, §V-A, TABLE III, TABLE V.
Combining eye movements and EEG to enhance emotion recognition.
Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §I, §II-A, §II-B, §IV-A1, §IV-C, §V-A, TABLE III.
-  (2019-01) AffectNet: a database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10 (1), pp. 18–31. External Links: Cited by: §I.
-  (2012) Classification of affects using head movement, skin color features and physiological signals. In IEEE International Conference on Systems, Cited by: §II-A.
-  (2014) Unsupervised alignment of natural language instructions with video segments. In Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 1558–1564. Cited by: §II-D.
-  (2011) Multimodal deep learning. In International Conference on Machine Learning, pp. 689–696. Cited by: §II-A, §III-C4.
-  (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §III-C4.
-  (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Transactions on Pattern Analysis & Machine Intelligence (10), pp. 1175–1191. Cited by: §I.
-  (2000) Affective computing. MIT press. Cited by: §I.
-  (2017) A review of affective computing: from unimodal analysis to multimodal fusion. Information Fusion 37, pp. 98–125. External Links: Cited by: §I.
-  (2018) Multi-view emotion recognition using deep canonical correlation analysis. In International Conference on Neural Information Processing, pp. 221–231. Cited by: §I, §I, §V-A.
-  (2014) Cluster canonical correlation analysis. In Artificial Intelligence and Statistics, pp. 823–831. Cited by: §III-A.
-  (1960) Frequency analysis of the electrocardiogram. Circulation Research 8 (2), pp. 344–346. Cited by: §IV-B2.
-  (2013) Differential entropy feature for EEG-based vigilance estimation. In 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 6627–6630. Cited by: §IV-B1, §IV-B1.
-  (2010) Off-line and on-line vigilance estimation based on linear dynamical system and manifold learning. In 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 6587–6590. Cited by: §IV-B1.
-  (2015) ECG signals classification based on discrete wavelet transform, time domain and frequency domain features. In 2015 2nd International Conference on Biomedical Engineering (ICoBE), pp. 1–6. Cited by: §IV-B2.
-  (2016-01) Analysis of eeg signals and facial expressions for continuous emotion detection. IEEE Transactions on Affective Computing 7 (1), pp. 17–28. External Links: Cited by: §I.
-  (2012-04) Multimodal emotion recognition in response to videos. IEEE Transactions on Affective Computing 3 (2), pp. 211–223. External Links: Cited by: §I.
-  (2012) Multimodal emotion recognition in response to videos. IEEE Transactions on Affective Computing 3 (2), pp. 211–223. Cited by: §I, §IV-B3.
EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Transactions on Affective Computing. Cited by: §IV-C, §V-A, TABLE III, TABLE IX.
-  (2016) Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild. Journal on Multimodal User Interfaces 10 (2), pp. 125–137. Cited by: §I, §II-C.
-  (1991) A study on subjective evaluations of printed color images. International Journal of Approximate Reasoning 5 (5), pp. 213–222. Cited by: §III-C3.
-  (2017) Multimodal emotion recognition using deep neural networks. In International Conference on Neural Information Processing, pp. 811–819. Cited by: §I, §II-D, §IV-A1, §IV-C, §V-A, TABLE III, TABLE V.
-  (2015) Frequency content and characteristics of ventricular conduction. Journal of Electrocardiology 48 (6), pp. 933–937. Cited by: §IV-B2, §IV-B2.
-  (2008) The coupling of emotion and cognition in the eye: introducing the pupil old/new effect. Psychophysiology 45 (1), pp. 130–140. Cited by: §I.
-  (2014) Emotional state classification from eeg data using machine learning approach. Neurocomputing 129, pp. 94–106. Cited by: §I.
Current state of text sentiment analysis from opinion to emotion mining. ACM Computing Surveys (CSUR) 50 (2), pp. 25. Cited by: §I.
-  (2018) EEG-based emotion recognition using hierarchical network with subnetwork nodes. IEEE Transactions on Cognitive and Developmental Systems 10 (2), pp. 408–419. Cited by: §I, §V-A, TABLE III.
Cross-subject EEG feature selection for emotion recognition using transfer recursive feature elimination. Frontiers in Neurorobotics 11, pp. 19. Cited by: §I.
-  (2017) Recognition of emotions using multimodal physiological signals and an ensemble deep learning model. Computer Methods and Programs in Biomedicine 140, pp. 93–110. Cited by: §I, TABLE V.
-  (2019) Classification of five emotions from eeg and eye movement signals: complementary representation properties. In 9th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 611–614. Cited by: §IV-C, Fig. 3, §V-B, §V-B, TABLE VII.
-  (2016) Emotion recognition using wireless signals. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, pp. 95–108. Cited by: §IV-B2.
-  (2015) Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Transactions on Autonomous Mental Development 7 (3), pp. 162–175. Cited by: §I, §IV-A1.
-  (doi: 10.1109/TAFFC.2017.2712143) Identifying stable patterns over time for emotion recognition from EEG. IEEE Transactions on Affective Computing. Cited by: §I, §V-A, TABLE VI.
-  (2019-03) EmotionMeter: a multimodal framework for recognizing human emotions. IEEE Transactions on Cybernetics 49 (3), pp. 1110–1122. External Links: Cited by: §I, §IV-A2, §IV-C, TABLE IV.