Deep Auto-Encoders with Sequential Learning for Multimodal Dimensional Emotion Recognition

by   Dung Nguyen, et al.

Multimodal dimensional emotion recognition has drawn a great attention from the affective computing community and numerous schemes have been extensively investigated, making a significant progress in this area. However, several questions still remain unanswered for most of existing approaches including: (i) how to simultaneously learn compact yet representative features from multimodal data, (ii) how to effectively capture complementary features from multimodal streams, and (iii) how to perform all the tasks in an end-to-end manner. To address these challenges, in this paper, we propose a novel deep neural network architecture consisting of a two-stream auto-encoder and a long short term memory for effectively integrating visual and audio signal streams for emotion recognition. To validate the robustness of our proposed architecture, we carry out extensive experiments on the multimodal emotion in the wild dataset: RECOLA. Experimental results show that the proposed method achieves state-of-the-art recognition performance and surpasses existing schemes by a significant margin.


page 1

page 3

page 4

page 6

page 9


Learning Alignment for Multimodal Emotion Recognition from Speech

Speech emotion recognition is a challenging problem because human convey...

Joint Deep Cross-Domain Transfer Learning for Emotion Recognition

Deep learning has been applied to achieve significant progress in emotio...

Audio Visual Emotion Recognition with Temporal Alignment and Perception Attention

This paper focuses on two key problems for audio-visual emotion recognit...

GraphMFT: A Graph Network based Multimodal Fusion Technique for Emotion Recognition in Conversation

Multimodal machine learning is an emerging area of research, which has r...

Face-Focused Cross-Stream Network for Deception Detection in Videos

Automated deception detection (ADD) from real-life videos is a challengi...

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

Considerable attention has been paid for physiological signal-based emot...

Multi-Channel Auto-Encoder for Speech Emotion Recognition

Inferring emotion status from users' queries plays an important role to ...

I Introduction

Emotion recognition has become a core research field at the intersection of human communication and artificial intelligence. This research problem is challenging owing to emotions of human beings can be expressed in different forms such as visual, acoustic, and linguistic structures [14].

As shown in the literature, there are two main conceptualisations of emotions: categorical and dimensional conceptualisation. Categorical approach defines a small set of basic emotions (e.g., happiness, sadness, anger, surprise, fear, and disgust) relying on cross-culture studies that show humans perceive certain basic emotions in similar ways regardless of their culture [11]. Alternatively, dimensional approach represents emotions into a multidimensional space where each dimension captures a fundamental property of the emotions (e.g., appraising human emotional states, behaviours and reactions displayed in real-world settings). These fundamental properties can be accomplished using continuous dimensions in the “Circumplex Model of Affects” (CMA) [25] including valence (i.e., how positive or negative an emotion is) and arousal (i.e., the power of the activation of an emotion). Fig. 1 illustrates the CMA. However, this approach is more appropriate to represent subtle changes in emotions, which may not always happen in real-world conditions [11].

Fig. 1: Two dimensional valence and arousal space (from

Recently, deep neural networks have been proposed to effectively predict the continuous dimensions of emotions based on multimodal cues such as auditory and visual information [11, 25, 27, 10, 19, 5, 12]

. These works combine convolutional and recurrent neural networks for feature integration, taking advantages of automatic feature learning in convolutional networks, while encoding temporal dynamics via the sequential layers in recurrent networks 


. However, feature integration is performed simply by concatenating domain-dependent features extracted from individual modalities. This scheme is straightforward and simple to implement yet may not be able to effectively learn compact and representative multimodal features. To address this issue, we propose a novel deep neural network for multimodal dimensional emotion recognition using a two-stream auto-encoder incorporating with a long short term memory to perform joint learning temporal and compact-representative features in multimodal data. Our architecture can be end-to-end trainable and capable of learning multimodal representations, achieving state-of-the-art performance on benchmark dataset. Specifically, we make the following contributions,

  • A two-stream auto-encoder that is able to effectively learn multimodal features from multimodal data for emotion recognition. The constituted auto-encoders ensure compact-representative features to be learnt from individual domains while joint training the auto-encoders captures complementary features from the multiple domains.

  • A long short term memory that enables sequential learning to encode the long-range contextual and temporal information of multimodal features from data streams.

  • Extensive experiments on the multimodal emotion in the wild dataset: RECOLA. In our experiments, ablation studies on various aspects of the proposed architecture are conducted. In addition, important baselines including dimensional facial emotion recognition and dimensional speech emotion recognition, and other existing methods are thoroughly evaluated and analysed.

The remainder of this paper is organised as follows: Section II briefly reviews related work; Section III describes our proposed method and its related aspects such as data pre-processing, network architecture, training; Section IV presents experiments and results; and Section V concludes our paper with remarks.

Ii Related Work

Emotion recognition has been a well-studied research problem for several decades and numerous approaches have been proposed. In this section, we limit our review to recent multimodal dimensional emotion recognition methods using deep learning techniques such as convolutional neural networks and long short term memory due to their proven robustness and effectiveness in many applications.

Ii-a Multimodal emotion recognition

Inspired by the capability of automatic feature learning in deep learning frameworks, Zhang et al. [28]

proposed a hybrid deep learning system constructed by a convolutional neural network (CNN), a three-dimensional CNN (3DCNN), and a deep belief network (DBN) to learn audio-visual features for emotion recognition. In this work, the CNN was pre-trained on the large-scale ImageNet database 

[23] and used to learn audio features from speech signals. To capture the temporal information from video data, the 3DCNN model in [24] was adapted and fine-tuned on contiguous video frames. The learnt audio and visual features were subsequently fused into the DBN to generate audio-visual features that were finally fed to a linear SVM for emotion recognition.

To classify spontaneous multimodal emotional expressions, Barros et al. 

[2] proposed a so-called cross channel convolutional neural network (CC-CNN) to learn generic and specific features of emotions based on body movements and facial expressions. These features were further passed into cross-convolution channels to build cross-modal representations. Motivated by human perception in emotional expression, Barros and Wermter [3] developed a perception representation model capturing the correlation between different modalities. In this work, auditory and visual stimuli were firstly fused using the CC-CNN originally introduced in [2] to achieve a multimodal perception representation. A self-organising layer was then applied on top of the CC-CNN to further separate the perceived expression representations.

Fig. 2: Our proposed network architecture for multimodal dimensional emotion recognition.

Recently, Zheng et al. [30]

proposed a multimodal framework including two Restricted Boltzmann Machines (RBMs) to capture eye movements and EEG signal. The RBMs were unfolded into a bimodal deep auto-encoder (BDAE)

[18] to extract shared representations of the two modalities, which were finally fed to a linear SVM for emotion classification.

Conventionally, emotion recognition approaches classify person-independent emotions directly from observed data or determine the decreasing/increasing intensity of person-dependent emotions relatively by comparing video segments. However, Liang et al. [14] proposed to combine both approaches for emotion recognition from audio-visual data. In this work, the emotion recognition task was divided into three subtasks including local relative emotion intensities ranking, global relative emotion intensities ranking, and incorporation of emotion predictions from observed multimodal expressions and relative emotion ranks from local-global rankings for emotion recognition.

Ii-B Sequential Learning

When temporal information is available, sequential learning can be applied to improve the accuracy of emotion recognition. Long Short Term Memory (LSTM) is often used for sequential learning due to its capability of modelling human memory [9, 8, 1].

Technically, LSTMs are recurrent neural networks (RNNs) integrated with some special gating architectures to control the access to memory cells [8]. These gates can also be used to prevent modifying the contents in the cells. LSTMs are, therefore, able to encode much longer range of patterns and propagate errors better than original recurrent neural networks [20]. Apart from being able to control the access to the contents in memory cells, the gates can also learn to target on specific parts of input sequences and refuse other parts. This feature allows LSTMs to be able to capture temporal information in sequential patterns.

Inspired by the advantages of LSTMs, many LSTM-based techniques have been developed for human emotion understanding from streaming data. For instance, Chen and Jin [4] and Wöollmer et al. [26] proposed LSTM-based networks for emotion recognition from data streams. Pei et al. [21] introduced a deep bidirectional LSTM-RNN, which was capable of modelling nonlinear relations in long-term history to handle both multimodal and unimodal emotion recognition tasks. In [25, 13], the authors developed an affective recognition system consisting of several deep neural networks to learn features at discrete times and an LSTM to model the temporal evolution of features overtime.

Despite recent promising achievements, current developments lack of ability to learn compact and representative features on individual domains and effectively learn multimodal features from multimodal data streams. To overcome these limitations, we propose to learn compact and representative features from individual domains using auto-encoders and fuse those domain-dependent features into multimodal features, integrated with temporal information using LSTM.

Iii Proposed Method

In this section, we present an end-to-end system for recognition of dimensional emotion from multimodal data including visual and audio data streams.

Iii-a Data Pre-processing

Input of our system is a pair of video and audio streams. For the video stream, we apply the Single Stage Headless (SSH) detector proposed in [17] to detect human faces in video frames. After that we resize the cropped faces to . The colour intensities in the cropped images are then normalised to .

For the audio stream, we segment the raw waveform signals of the stream into 0.2s long sequences after normalising the time-sequences in order to obtain zero mean and unit variance. The normalisation aims at taking into account the variation in the loudness among different speakers. For a given input stream sampled at 16 kHz, each 0.2s long sequence consists of 5 audio frames, each of which takes 0.04s and is represented by a 640-dimensional vector. Note that each audio frame corresponds to a video frame in the input video stream.

Iii-B Network Architecture

The proposed network architecture is illustrated in Fig. 2. Our architecture consists of two network branches, a 2D convolutional auto-encoder (2DConv-AE) and a 1D convolutional auto-encoder (1DConv-AE), and a long short term memory (LSTM). Each network branch takes its corresponding input, e.g., the 2DConv-AE receives input as an image while the 1DConv-AE receives input as a 0.04s audio frame. The latent layers of these branches are fused into a multimodal representation, which is then fed to the LSTM for prediction of two dimensional emotion scores: arousal and valence.

Given an input video stream (of a speaker) including an image sequence and audio sequence, each image frame in the sequence is passed into the SSH detector [17] to detect the speaker’s face. The face image is then fed to the 2DConv-AE to learn visual features. Simultaneously, the corresponding audio frame is passed to the 1DConv-AE to learn audio features. The features extracted from the latent layers of these auto-encoders are compact yet representative for their individual domains. Those features are then combined via a fusion layer before being fed to the LSTM for sequential learning of the features (for every 0.2s long sequence) from input streams. The reason for the LSTM is to model the temporal variation of the audio and visual features that provides the contextual information of the input data.

The combination of auto-encoders and LSTM ensures that the learnt representations are compact, rich and complementary and thus make the architecture optimal and robust towards the recognition task from multimodal data streams.

Fig. 3: Architecture of the 2DConv-AE.

Layer Input size

Filter size, Stride, Out channels

Output size
Encoder Conv2D

FC 18432 - 2048
Decoder DeConv2D

TABLE I: Detailed description of the 2DConv-AE. Conv2D: 2D convolutional layer. DeConv2D: 2D deconvolutional layer. FC: Fully-connected layer.

Iii-B1 2DConv-AE

follows the common practice of 2D auto-encoders, e.g., [16]. The encoder of the 2DCov-AE is composed of two residual blocks as in the ResNet architecture [7]

, then enclosed by a fully-connected layer. These residual blocks play a central role in feature enrichment. The decoder of the 2DCov-AE includes two residual blocks, which are designed by stacking six 2D de-convolutional layers. Leaky ReLU (LReLU) activation function is adopted after each convolutional layer and de-convolutional layer in our design. Table 

I provides a detailed description of the 2DConv-AE branch while Fig. 3 visualises its architecture.

Layer Input size Filter size, Stride, Out channels Output size

Conv1D [, 1, 40]
Maxpooling [1, 2, 1]
Conv1D [, 1, 40]
Maxpooling [1, 10, 1]

FC 1280 - 640
FC 640 - 1280

DeConv1D [, 1, 40]
Upsampling [1, 2, 1]
DeConv1D [, 1, 1]
TABLE II: Detailed description of the 1DConv-AE. Conv1D: 1D convolutional layer, DeConv1D: 1D deconvolutional layer. FC: Fully-connected layer.
Fig. 4: Architecture of the 1DConv-AE.

Iii-B2 1DConv-AE

realises an 1D auto-encoder applied to audio signal sequences. The encoder of the 1DConv-AE is an 1D convolutional neural network. We adopt the network architecture proposed by Tzirakis et al. [25]

in our design for the encoder of the 1DConv-AE. Specifically, the 1DConv-AE’s encoder includes two 1D convolutional layers, each of which is followed by a max pooling layer. Two fully-connected layers are subsequently attached. The decoder of the 1DConv-AE is formed by stacking one 1D de-convolutional layer, followed by one upsampling layer and then another 1D de-convolutional layer. We summarise the architecture of the 1DConv-AE in Table 

II and Fig. 4.

Iii-B3 Lstm

has demonstrated a powerful capability of learning long-range contextual information in sequential patterns [29]. Motivated by this power, we adopt a 2-layer LSTM with 512 cells for each layer to model the temporal and contextual information in multimodal features learnt by the 2DConv-AE and 1DConv-AE. Readers are referred to [6] for more detail on LSTM implementation.

Iii-C Joint Learning

Our goal is to learn compact and representative features commonly shared in both visual and auditory domain for prediction of dimensional emotion scores.

For the ease of presentation, we first introduce important notations used in our method. We denote the encoder and decoder of the 2DConv-AE as and respectively. Similarly, the encoder and decoder of the 1DConv-AE are denoted as and respectively. Let denote the fusion layer which concatenates the features learnt by and . The LSTM is represented by , receiving input from the fusion layer . Let denote a facial image obtained by applying the SSH face detector on an input image frame at time step . Let denote the corresponding sound segment of the facial image . Given a pair of input , let and be the ground truth arousal and valence respectively, and and be the predicted arousal and valence of the input . The training procedure can be described as follows.

Given the input pair , to learn compact and representative visual and audio features in individual domains, the auto-encoders 2DConv-AE and 1DConv-AE make use of the 2D and 1D encoders and , and the 2D and 1D decoders and , and result in an output pair of reconstructed image frame and speech frame where,


The quality of an auto-encoder can be measured via the similarity between an original signal and its reconstructed version after being processed through the auto-encoder. Auto-encoders, thus, can ensure the representative quality of their encoded representations. In this work, we define the losses of our auto-encoders using -norm as,


where and is defined in Eq. (1) and Eq. (2) respectively, is the number of samples (i.e., image/speech frames) in a current training batch.

To learn multimodal features, the fusion layer is used to fuse features extracted from the latent layers of the auto-encoders. Specifically, the multimodal feature vector for the input pair is denoted as and defined as,


where represents a concatenating operator.

The multimodal feature vector is then fed to the LSTM

which estimates the arousal value

and valence value of based on its precedent observations, i.e.,


where is the number of precedent observations used to determine the arousal and valence value of the observation at time step .

To measure the quality of arousal and valence prediction, we adopt the Concordance Correlation Coefficient (CCC) proposed in [15]. CCC has been widely used in measuring the performance of dimensional emotion recognition systems [11]. It validates the agreement between two time series (e.g., predictions and their corresponding ground truth annotations) by scaling their correlation coefficient with their mean square difference. In this way, predictions that are well correlated with the annotations but shifted in the value are penalised proportionally to their deviations. CCC takes values in the range , where denotes perfect concordance and indicates perfect discordance. The higher the value of the CCC demonstrates the better the fit between predictions and ground truth annotations, and therefore high values are desired. Applying the CCC, we define the loss for emotion recognition as follows,


where and is the CCC of the arousal and valence respectively, calculated on a current training batch. In particular, we define,


where, for instance, is the covariance of the predictions and ground truth annotations of arousal , and is the variance of and respectively, and and is the mean of and respectively. Note that those statistics are calculated on a current training batch. Similar interpretations can be applied to .

Finally, the loss of entire network is defined as,


where , , and is presented in Eq. (3), Eq. (4), and Eq. (7), , , and are weights used to control the influence of sub networks and set empirically.

As shown in Eq. (10), the individual auto-encoders and the LSTM are jointly trained. This scheme makes the features learnt through the entire network compact, representative, and complementary from different domains.

Iv Experiments

Fig. 5: Arousal and valance annotations over a part of a video in the RECOLA dataset. Corresponding frames are also illustrated. This figure shows the in-the-wild nature of the emotion data in RECOLA dataset (with different emotional states, rapid emotional changes, occlusions).

Iv-a Dataset

We conducted our experiments on the REmote COLlaborative and Affective (RECOLA) dataset introduced by Ringeval et al. [22]. This dataset consists of spontaneous and natural emotions represented by continuous values of arousal and valence. The dataset has four modalities including electro-cardiogram, electro-dermal activity, audio, and visual modality. There are 46 French speaking subjects involved in recordings of 9.5 hours in total. The recordings are labelled for every 5 minutes by three male and three female French-speaking annotators. The dataset is balanced among various factors including mother tongue, age, and gender. The dataset includes a training set with 16 subjects and a validation set with 15 subjects. Each subject is associated with a recording (including visual and audio signal). Each recording consists of 7,500 frames for each visual and audio channel. Fig. 5 illustrates an example in the RECOLA dataset.

Iv-B Implementation Details

Our emotion recognition system receives input as a multimodal data stream including an image channel and a speech channel. The data stream of each channel is segmented into frames which are then passed to the proposed network architecture for processing.

Given a pair including an image frame and its corresponding speech frame at some time step , the image frame is passed to the visual network branch (2DConv-AE) to extract 2,048 visual features via . Similarly, the speech frame is forwarded to the speech network branch (1DConv-AE) to extract 1,280 auditory features via . These output features are concatenated to form a 3,328 dimensional multimodal representation as defined in Eq. (5). This representation is fed to the 2-layer LSTM (with 512 cells per layer) to extract long-range contextual information from the data stream. The output of the LSTM is finally attached to a fully-connected layer to predict the arousal and valence for the input data at time step .

To predict the arousal and valence for the input pair , the LSTM takes into account the last four time steps of , i.e., in Eq. (6) is set to 4.

Our proposed architecture was trained end-to-end via optimising the joint loss defined in Eq. (10) on the training set of the RECOLA database. In our implementation, we set the parameters in Eq. (10) as, , , and

. Adam optimiser (with default values) was adopted. Mini-batch size was set to 32 and learning rate was set to 0.0001. The number of training steps was set to 50,000. All experiments in this paper were implemented in TensorFlow and conducted on 10 computing nodes: 3780

64-bit Intel Xeon Cores. Our proposed model was trained within 45 hours and required 10,485,760KB of memory usage. Fig. 6 shows the learning curve of our model.

Fig. 6: Learning curve of our model.

Iv-C Evaluation Protocol

As commonly used in evaluation of dimensional emotion recognition [11], we measure the Root Mean Square Error (RMSE) of predicted dimensional emotion scores against the ground truth values as follow,


where and is the predicted arousal and its ground truth value respectively, and and is the predicted valence and its ground truth value respectively, is the total number of frames in an input sequence.

To further investigate the prediction performance in detail, we also calculate the RMSE on each individual emotion dimension as,


The RMSE gives us a rough indication of how the derived emotion model is behaving, providing a simple comparative evaluation metric. Small values of RMSE are desired.

Iv-D Evaluations and Comparisons

Since there are several technical contributions proposed in this work, e.g., auto-encoders for learning multimodal features, LSTM for sequential learning of multimodal data streams, etc., we evaluate each of these contributions in our experiments. In addition, for each dimension in the evaluations, we also compare our method with existing related works.

Iv-D1 Multimodal vs Unimodal

We first evaluate emotion recognition using multimodal and unimodal approach. Our proposed multimodal architecture combines two unimodal branches, each of which focuses on a single domain (visual or audio domain). To make a comprehensive comparison between muiltimodal and unimodal approaches, we respectively disabled one of the two branches in our architecture to achieve unimodal architectures. For instance, the unimodal architecture for the visual part is denoted as “2DConv-AE-LSTM” and obtained by taking off the speech network branch (see Fig. 2) while keeping the LSTM which receives input from only the latent layer of the 2DConv-AE. Similar action was applied to the audio part to create “1DConv-AE-LSTM.”

Table III compares unimodal and multimodal approaches. As shown in the table, there is an inconsistency in the performance of the unimodal architectures. Specifically, the audio architecture (1DConv-AE-LSTM) outperforms the visual one (2DConv-AE-LSTM) in prediction of arousal while the visual architecture shows better performance than its counterpart in prediction of valence. Compared with these sub-models, our multimodal architecture (“2D1DConv-AE-LSTM”) combining both the unimodal architectures shows superior performance on prediction of both emotion dimensions.

We also compared our multimodal architecture with other two unimodal architectures proposed in [25]. Specifically, the authors in [25] investigated the combination of CNNs and LSTM on visual and audio domain separately. As shown in Table III, our multimodal architecture outperforms all other unimodal ones on both arousal and valence prediction.

2DConv-AE-LSTM 0.538 0.214 0.579
1DConv-AE-LSTM 0.493 0.237 0.547
2D model in [25] 0.476 0.192 0.514
1D model in [25] 0.517 0.251 0.574
Our 2D1DConv-AE-LSTM 0.474 0.187 0.510
TABLE III: Comparison of multimodal and unimodal architectures. Best performances are highlighted.

Iv-D2 Feature Learning with Auto-Encoders

We observed that the auto-encoders (2DConv-AE and 1DConvAE) also boosted up the prediction performance of our architecture. In particular, we compared our architecture with the one proposed in [25], where only CNNs were used to learn the multimodal features. Note that the same LSTM was used in both the architectures.

We report the prediction performance of our model and the one in [25] in Table IV. As shown in our experimental results, compared with the network architecture proposed in [25], our model is slightly inferior in prediction of valence (see ) but significantly more prominent in prediction of arousal (see ), leading to a better overall performance.

To further investigate the impact of the auto-encoders on individual domains, we re-designed the unimodal branches by disabling the decoders in those branches and measured their performances. We observed that the 1DConv-AE improved the accuracy of the audio branch on both arousal and valence prediction. Specifically, the improvement was 4.6% on (from 0.517 to 0.493) and 5.6% on (from 0.251 to 0.237), leading to an overall improvement of 4.7% on (from 0.574 to 0.547). In contrast, the 2DConv-AE incurred a decrease of 11.5% on (from 0.476 to 0.538) and 10.3% on (from 0.192 to 0.214), and 11.2% on (from 0.514 to 0.579). However, as shown in our experimental results, the incarnation of both the 2DConv-AE and 1DConv-AE compensated the weakness of each individual component and achieved improved overall performance.

2D1DConv-LSTM [25] 0.488 0.184 0.522
Our 2D1DConv-AE-LSTM 0.474 0.187 0.510
TABLE IV: Evaluation of auto-encoders. Best performances are highlighted.

Iv-D3 Sequential Learning with LSTM

We propose the use of LSTM for learning long-range contextual and temporal information from streaming data. LSTM has also been used widely in dimensional emotion recognition from data streams. For instance, in [13], Kollias and Zafeiriou proposed to use two LSTMs, each of which for one unimodal data stream (e.g., visual or audio stream). We denote this approach as “2D1D-2SLSTM.” Unlike [13], in our architecture, we use only one LSTM receiving input from a fusion layer and producing predicted values for arousal and valence. Note that auto-encoders were not employed in [13]. Therefore, to better study the effect of using one LSTM vs two stream LSTM, we applied the two stream LSTM in [13] to our architecture, to create a so-called “2D1DConv-AE-2SLSTM” variant.

Table V compares different approaches using LSTM for sequential learning in dimensional emotion recognition. As shown in the table, our architecture achieves the best performance among all models on both emotion dimensions.

2D1DConv-2SLSTM [13] 0.493 0.187 0.527
2D1DConv-AE-2SLSTM 0.508 0.190 0.542
Our 2D1DConv-AE-LSTM 0.474 0.187 0.510
TABLE V: Evaluation of LSTM. Best performances are highlighted.

Iv-D4 Ablation Study

In this experiment, we study the effect of hidden nodes in the LSTM. Specifically, we investigate the prediction performance of our architecture with regard to various numbers of hidden nodes in each layer in the LSTM.

We report these results in Table VI. In general, there is a fluctuation in the prediction performance when varying the number of hidden nodes in the LSTM. Although the best configuration for prediction of valence is the LSTM with 256 nodes in each hidden layer, our current setting with 512 nodes in each hidden layer shows better performance in prediction of arousal and also achieves the best overall performance.

#Hidden nodesMetric
32 0.509 0.190 0.543
64 0.540 0.195 0.574
128 0.502 0.205 0.542
256 0.496 0.184 0.529
512 (our architecture) 0.474 0.187 0.510
TABLE VI: Prediction performance of our architecture when varying the number of hidden nodes used in each layer of the LSTM. Best performances are highlighted.

V Conclusion

This paper proposes a deep network architecture for end-to-end dimensional emotion recognition from multimodal data streams. The proposed architecture incarnates auto-encoders for learning multimodal features from visual and audio domains, and LSTM for learning contextual and temporal information from streaming data. Our proposed architecture enables learning compact, representative, and complementary features from multimodal data source. We implemented various baseline models and conducted extensive experiments on the benchmark RECOLA dataset. Experimental results confirmed the contributions of our work and the superiority of our proposed architecture over the state-of-the-art.


  • [1] A. G. G. Bahmani, M. Baktashmotlagh, S. Denman, S. Sridharan, D. N. Tien, and C. Fookes (2017) Deep discovery of facial motions using a shallow embedding layer. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), J. Luo, W. Zeng, and Y. J. Zhang (Eds.), pp. 1567–1571. Cited by: §II-B.
  • [2] P. Barros, C. Weber, and S. Wermter (2015) Emotional expression recognition with a cross-channel convolutional neural network for human-robot interaction. In IEEE-RAS International Conference on Humanoid Robots, pp. 582–587. Cited by: §II-A.
  • [3] P. Barros and S. Wermter (2016) Developing crossmodal expression recognition based on a deep neural model. Adaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems 24 (5), pp. 373–396. Cited by: §II-A.
  • [4] S. Chen and Q. Jin (2015) Multi-modal dimensional emotion recognition using recurrent neural networks. In International Workshop on Audio/Visual Emotion Challenge, pp. 49–56. Cited by: §II-B.
  • [5] D. Kollias and S. Zafeiriou (2018) Aff-wild2: extending the aff-wild database for affect recognition. arXiv preprint arXiv:1811.07770. Cited by: §I.
  • [6] A. Graves (2012) Supervised sequence labelling with recurrent neural networks. Studies in Computational Intelligence, Vol. 385, Springer. External Links: ISBN 978-3-642-24796-5 Cited by: §III-B3.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    IEEE International Conference on Computer Vision and Pattern Recognition

    pp. 770–778. Cited by: §III-B1.
  • [8] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §II-B, §II-B.
  • [9] N. Kalchbrenner, I. Danihelka, and A. Graves (2015) Grid long short-term memory. CoRR abs/1507.01526. Cited by: §II-B.
  • [10] D. Kollias, M. A. Nicolaou, I. Kotsia, G. Zhao, and S. Zafeiriou (2017) Recognition of affect in the wild using deep neural networks. In IEEE International Computer Vision and Pattern Recognition Workshops, pp. 1972–1979. Cited by: §I.
  • [11] D. Kollias, P. Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, and S. Zafeiriou (2019) Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision 127 (6), pp. 907–929. Cited by: §I, §I, §III-C, §IV-C.
  • [12] D. Kollias and S. Zafeiriou (2018) A multi-task learning & generation framework: valence-arousal, action units & primary expressions. arXiv preprint arXiv:1811.07771. Cited by: §I.
  • [13] D. Kollias and S. Zafeiriou (2019) Exploiting multi-cnn features in cnn-rnn based dimensional emotion recognition on the omg in-the-wild dataset. arXiv preprint arXiv:1910.01417. Cited by: §II-B, §IV-D3, TABLE V.
  • [14] P. P. Liang, A. Zadeh, and L. Morency (2018) Multimodal local-global ranking fusion for emotion recognition. In ACM International Conference on Multimodal Interaction, pp. 472–476. Cited by: §I, §II-A.
  • [15] L. I. Lin (1989) A concordance correlation coefficient to evaluate reproducibility. Biometrics 45 (1), pp. 255–268. External Links: ISSN 0006341X, 15410420 Cited by: §III-C.
  • [16] X. Mao, C. Shen, and Y. Yang (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems, pp. 2802–2810. Cited by: §III-B1.
  • [17] M. Najibi, P. Samangouei, R. Chellappa, and L. Davis (2017) SSH: single stage headless face detector. In IEEE International Conference on Computer Vision, pp. 4875–4884. Cited by: §III-A, §III-B.
  • [18] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011) Multimodal deep learning. In

    International Conference on International Conference on Machine Learning

    pp. 689–696. Cited by: §II-A.
  • [19] D. Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, and C. Fookes (2017) Deep spatio-temporal features for multimodal emotion recognition. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1215–1223. Cited by: §I.
  • [20] D. Nguyen, K. Nguyen, S. Sridharan, D. Dean, and C. Fookes (2018) Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Computer Vision and Image Understanding 174, pp. 33 – 42. Cited by: §II-B.
  • [21] E. Pei, L. Yang, D. Jiang, and H. Sahli (2015) Multimodal dimensional affect recognition using deep bidirectional long short-term memory recurrent neural networks. In IEEE International Conference on Affective Computing and Intelligent Interaction, pp. 208–214. Cited by: §II-B.
  • [22] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne (2013) Introducing the recola multimodal corpus of remote collaborative and affective interactions. In IEEE International Conference on Automatic Face and Gesture Recognition, pp. 1–8. Cited by: §IV-A.
  • [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §II-A.
  • [24] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In IEEE International Conference on Computer Vision, pp. 4489–4497. Cited by: §II-A.
  • [25] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou (2017-12) End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1301–1309. External Links: Document, ISSN 1932-4553 Cited by: §I, §I, §II-B, §III-B2, §IV-D1, §IV-D2, §IV-D2, TABLE III, TABLE IV.
  • [26] M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll (2013) LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing 31 (2), pp. 153–163. Cited by: §II-B.
  • [27] S. Zafeiriou, D. Kollias, M. A. Nicolaou, A. Papaioannou, G. Zhao, and I. Kotsia (2017) Aff-wild: valence and arousal ‘in-the-wild’challenge. In IEEE International Computer Vision and Pattern Recognition Workshops, pp. 1980–1987. Cited by: §I.
  • [28] S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian (2018) Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology 8 (10), pp. 3030–3043. Cited by: §II-A.
  • [29] Z. Zhang, F. Ringeval, J. Han, J. Deng, E. Marchi, and B. Schuller (2016)

    Facing realism in spontaneous emotion recognition from speech: feature enhancement by autoencoder with lstm neural networks

    In Interspeech, pp. 3593–3597. Cited by: §III-B3.
  • [30] W. Zheng, W. Liu, Y. Lu, B. Lu, and A. Cichocki (2019) EmotionMeter: a multimodal framework for recognizing human emotions. IEEE Transactions on Cybernetics 49 (3), pp. 1110–1122. Cited by: §II-A.