Temporal aggregation of audio-visual modalities for emotion recognition

Emotion recognition has a pivotal role in affective computing and in human-computer interaction. The current technological developments lead to increased possibilities of collecting data about the emotional state of a person. In general, human perception regarding the emotion transmitted by a subject is based on vocal and visual information collected in the first seconds of interaction with the subject. As a consequence, the integration of verbal (i.e., speech) and non-verbal (i.e., image) information seems to be the preferred choice in most of the current approaches towards emotion recognition. In this paper, we propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality. We show that our proposed method outperforms other methods from the literature and human accuracy rating. The experiments are conducted over the open-access multimodal dataset CREMA-D.



There are no comments yet.


page 3


Key-Sparse Transformer with Cascaded Cross-Attention Block for Multimodal Speech Emotion Recognition

Speech emotion recognition is a challenging and important research topic...

Multimodal Local-Global Ranking Fusion for Emotion Recognition

Emotion recognition is a core research area at the intersection of artif...

"AIded with emotions" - a new design approach towards affective computer systems

As technologies become more and more pervasive, there is a need for cons...

Variants of BERT, Random Forests and SVM approach for Multimodal Emotion-Target Sub-challenge

Emotion recognition has become a major problem in computer vision in rec...

Disentanglement for audio-visual emotion recognition using multitask setup

Deep learning models trained on audio-visual data have been successfully...

Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition

Automatic emotion recognition (AER) based on enriched multimodal inputs,...

Multimodal Observation and Interpretation of Subjects Engaged in Problem Solving

In this paper we present the first results of a pilot experiment in the ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automatic detection of human emotions has become an important area of research due to the technological development that occurred in the human-computer interaction domain (e.g., social robots [5, 9], monitoring systems for car drivers’ condition [12]). In order to increase the accuracy of emotion recognition systems, most of the currently developed methods incorporate multimodal information (e.g., facial and speech features) [14, 18, 3, 7]. Facial expressions represent one of the most important modes of communication through which people express their emotions and intentions. In addition to facial expressions, people also express their feelings through speech; e.g, speech inflection, vocal intensity are characteristics that contain information about the emotional state of a subject.

Each person is unique and can express emotions in their own characteristic way, depending on their culture, age, gender or previous life experiences [17]. Nevertheless, there are common characteristics that can be exploited in order to obtain an accurate classification system. In general, most recognition systems consider only 6 types of emotions (e.g., anger, happiness, surprise, disgust, contempt, anxiety) [6]. According to the Facial Action Coding System (FACS), each human emotion can be described through a combination of several Facial Action Units (FAUs) [16]. More precisely, the FACS refer to a combined set of facial muscle movements that correspond to a displayed emotion. The basic element in this coding system is Action Unit (AU) and each AU is related to the contraction of one or more facial muscles.

Emotion recognition systems using only visual information (i.e., video frames) can be mainly classified into static and dynamic methods depending on the feature representations. In static-based methods, the features are encoded with spatial information from singular frames without taking into consideration the temporal extent, whilst dynamic-based methods consider the temporal relation between continuous frames from the input sequence. In the case of static-based methods, state-of-the-art deep neural networks architectures (e.g., VGG

[15], ResNet [10]

) have been proposed for feature extraction, whilst the classification into emotion categories is performed using a Support Vector Machine (SVM) module


Due to the increased interest in developing real-world scenarios datasets and also the increased computer processing capabilities, recent approaches are based on deep learning techniques that are able to extract both facial and audio discriminant information. In a recent paper

[14], we have shown that not very deep convolutional neural network (CNN) architectures are able to extract meaningful information regarding emotion categories from both video frames and spectrograms of audio signals. By combining the audio-video information, we managed to achieve an increase of almost 7 % compared to the case when only video data is considered.

Considering the behavioral differences between people and the diverse modes of communicating their feelings, methods addressing person-specific affective understanding have been also developed. In [2], Grow-When-Required Networks and personalized affective memories are used to learn individualized aspects of emotions. However, the complexity of the proposed model limits the real-time usage of the proposed solution.

In order to include temporal dynamic characteristics between video frames, Beard et. al proposed a recurrent multi-attention (RMA) mechanism with shared external memory that is updated over multiple iterations of analysis [3]. This approach allows relevant memories to persists over multiple hops. The method achieved a maximum accuracy of 65 % on the CREMA-D dataset, comparable to the human rating accuracy reported for this dataset (i.e., 63.6 % [4]).

In an attempt to exploit the complementary information brought by diverse modalities (i.e., audio and video), a Multimodal Emotion Recognition Metric Learning (MERML) was defined in [8]

. The learned metric was further used by SVM with Radial Basis Function (RBF) kernel.

In this paper, we propose a novel multimodal architecture that combines visual and audio features extracted from random selections of analysis windows within individual temporal segments of the input video. Thus, the temporal aggregation of audio and video allows for asynchronous inputs of the two considered modalities. We tested our solution on the CREMA-D [4], a widely used audio-visual dataset in the multimodal emotion recognition field.

The rest of the paper is organized as follows. Section II introduces the multimodal model architecture. Section III describes the dataset used for experiments, whereas section IV presents the experimental results. Finally, section V concludes the paper.

Ii Proposed Approach

Inspired by the solution for action recognition presented in [13] and by how the human brain processes audio-visual information, we propose a temporal aggregation mechanism in order to combine modalities within a range of temporal offsets. In this mechanism we explore the fusion between audio and visual inputs within a temporal window, that will allow the model to be trained with asynchronous inputs from both modalities. Our proposed temporal aggregation mechanism is shown in Fig 1. In the following, we denote by the sampling rate of the video sequence and by (different from ) the sampling rate of the audio signal.

The input video sequence is divided into temporal segments of equal length. For each temporal segment, we randomly select a video frame and we randomly choose the center of the audio signal window between seconds. The audio signal used in the analysis of the current temporal segment is considered between seconds (i.e., samples between and ). Further, the video frame and the spectrogram of the audio signal are fed into an audio-visual network corresponding to the current temporal segment. It is worth noting that independent audio-visual networks are needed to build the entire emotion recognition architecture.

Fig. 1:

Our proposed temporal aggregation mechanism. FC label is a 6-dimensional vector of class probabilities obtained after passing the audio-visual features through the Fully Connected (FC) layer.

Following a similar approach to the method that has been recently proposed by the authors in [14]

, the core audio-visual network is composed of a sequence of convolutional blocks, which extracts the audio and visual features. After the concatenation of the audio and video feature vectors, the resulting feature vector is considered as input for a Fully Connected (FC) layer, followed by a SoftMax activation function which yields the class probabilities for the considered emotion categories. The solution proposed in


achieved approximately the human rating performance accuracy at low computational costs. However, in order to accelerate the training process and to increase the stability of the core audio-visual network, we inserted a Batch Normalization layer after each convolutional layer

[11]. The core audio-visual network architecture, which processes asynchronous multimodal information, is shown in Fig. 2.

After aggregating the emotion class probabilities for all the temporal segments composing the video, the class label for the entire video is the one for which the maximum score is achieved. As shown in Fig. 1, the final score is obtaining by summing up the emotion class probabilities over all the temporal segments.

Fig. 2: Our proposed audio-visual network used to retrieve emotion class probabilities for each temporal segment of the analysed video.

We mention that the frames of the original videos are pre-processed using the MTCNN algorithm [19]

, whose aim is to perform face detection and to remove the unnecessary information (i.e., background) with respect to the emotion recognition task.

Iii Database

Over the past years, several databases for the emotion recognition task have been proposed and the research in the affective computing domain focused on mixing different sources of information to achieve better performance. The CREMA-D multimodal database was published in 2015 [4] and contains 7442 clips of 91 actors (48 male and 43 female) with different ethnic backgrounds. The actors were asked to convey particular emotions while producing, with different intonations, 12 particular sentences that evoke the target emotions. Six labels have been used to discriminate among different emotion categories (i.e., neutral, happy, anger, disgust, fear, sad) with four different intensity levels (i.e., low, medium, high, unspecified). The labels corresponding to each recording were collected using crowd-sourcing. More precisely, 2443 participants were asked to label the perceived emotion and its intensity. The human accuracy achieved for this task was, on average, . It is worth mentioning that human training was achieved through participants’ previous experience.

Iv Experiments

(a) Loss
(b) Accuracy
Fig. 3: Performance over the training and validation sets.
Fig. 4: Confusion matrix for best performance.
Fig. 5: Variation of accuracy values with respect to the number of segments considered in the temporal aggregation.

In this section, the results achieved on the CREMA-D data set, as well as the experimental setup used in our approach, are provided. For all our experiments, we use an user-independent 10-fold cross validation technique to split the dataset into train and validation subsets. We divide the dataset into 10 different folds (i.e., none of the actors is introduced in more than one fold for generalization reasons) and compute the mean accuracy over all the folds.

Various evaluation metrics are used to assess the performance of the proposed solution for emotion recognition, namely mean overall accuracy, loss values and confusion matrix. These performance measures are shown in Fig. 

3 and Fig. 4 and are achieved for a temporal aggregation of 10 segments extracted from the video. As shown in Fig. 5, the overall accuracy increases with the number of segments considered. However, the time required for training a model and the inference time increase almost linearly with the number of segments, i.e., from 3 hours for a model with 3 segments to approximately 5.5 hours for a model with 10 segments. Moreover, the accuracy does not increase substantially for more than 8 segments.

Furthermore, the overall accuracy is compared with the performances achieved by other methods from the literature (e.g., CNN-based approach [14], RMA [3]). It is worth mentioning that the experiments using the approach proposed in [14] followed the same 10-fold cross validation technique.

The input frames were cut at pixels around the detected faces [19], whereas the spectrograms were resized to . The length of the audio signal was set to 1.28 seconds, whereas the offset

was set to 0.01 seconds. In order to train the models, we used the cross entropy loss and the stochastic gradient descent optimization method with 0.9 momentum. For each epoch, the learning rate was initially set to 1e-3 and decayed by a factor of 10 every 50 training steps. The batch size was set to 16.

Our solution outperformed the baseline approach proposed in [14] by , human accuracy rating by and also methods based on recurrent multi-attention [3]. The proposed method of early combining features and temporal aggregation of partial results leads to a better performance, which suggests that combining information coming from different sources in an asynchronous manner proves to be beneficial in the process of emotion understanding.

We mention that all the experiments were conducted over an Intel Xeon E5-1680v3, 8 cores @3.2 GHz, equipped with NVIDIA Quadro M4000 GPU with 8 GB RAM.

Method Accuracy [%]
Human accuracy [4] 63.6
CNN-based approach [14] 55.8
RMA [3] 65.0
MERML [8] 66.5
Proposed method 68.4
TABLE I: Average accuracy rate on CREMA-D

V Conclusion

In this paper, we proposed a new method to incorporate multimodal information to discriminate among categories of emotions. The methodology benefits from combining the audio and visual information in an asynchronous manner which allows a certain degree of flexibility between the analysis of the two modalities. Using a simple audio-visual neural network as core architecture for predicting the emotional states, the temporal aggregation of the multimodal information from various segments leads to a substantial increase in the accuracy of the recognition system and outperforms other approaches from the literature.


  • [1] S. Bargal, E. Barsoum, C. Ferrer, and C. Zhang (2016-10) Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction ICMI 2016, pp. 433–436. Cited by: §I.
  • [2] P. V. A. Barros, G. I. Parisi, and S. Wermter (2019) A personalized affective memory neural model for improving emotion recognition. ArXiv abs/1904.12632. Cited by: §I.
  • [3] R. Beard, R. Das, R. W. M. Ng, P. G. K. Gopalakrishnan, L. Eerens, P. Swietojanski, and O. Miksik (2018-10) Multi-modal sequence fusion via recursive attention for emotion recognition. In Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, pp. 251–259. Cited by: §I, §I, TABLE I, §IV, §IV.
  • [4] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma (2014) CREMA-d: crowd-sourced emotional multimodal actors dataset. IEEE Trans. on Affective Computing 5 (4), pp. 377–390. Cited by: §I, §I, §III, TABLE I.
  • [5] F. Cavallo, F. Semeraro, L. Fiorini, G. Magyar, P. Sinčák, and P. Dario (2018) Emotion modelling for social robotics applications: a review. Journal of Bionic Engineering 15 (2), pp. 185–203. Cited by: §I.
  • [6] P. Ekman (1992) An argument for basic emotions. Cognition & emotion 6 (3-4), pp. 169–200. Cited by: §I.
  • [7] E. Ghaleb, M. Popa, and S. Asteriadis (2019-Sep.) Multimodal and temporal perception of audio-visual cues for emotion recognition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Vol. , pp. 552–558. Cited by: §I.
  • [8] E. Ghaleb, M. Popa, and S. Asteriadis (2019) Metric learning based multimodal audio-visual emotion recognition. IEEE Multimedia (), pp. 1–1. External Links: Document, ISSN 1941-0166 Cited by: §I, TABLE I.
  • [9] L. Grama and C. Rusu (2019) Extending assisted audio capabilities of tiago service robot. In 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Vol. , pp. 1–8. Cited by: §I.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §I.
  • [11] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §II.
  • [12] C. D. Katsis, N. Katertsidis, G. Ganiatsas, and D. I. Fotiadis (2008-05) Toward emotion recognition in car-racing drivers: a biosignal processing approach. IEEE Trans. on Systems, Man, and Cybernetics - Part A: Systems and Humans 38 (3), pp. 502–512. External Links: ISSN 1558-2426 Cited by: §I.
  • [13] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen (2019) Epic-fusion: audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5492–5501. Cited by: §II.
  • [14] N.-C. Ristea, L. C. Duţu, and A. Radoi (2019) Emotion recognition system from speech and visual information based on convolutional neural networks. In 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–6. Cited by: §I, §I, §II, TABLE I, §IV, §IV.
  • [15] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I.
  • [16] Y. -. Tian, T. Kanade, and J. F. Cohn (2001-02) Recognizing action units for facial expression analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2), pp. 97–115. Cited by: §I.
  • [17] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou (2017-12) End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1301–1309. Cited by: §I.
  • [18] M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, and M. Pantic (2013) AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, AVEC ’13, pp. 3–10. External Links: ISBN 9781450323956 Cited by: §I.
  • [19] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016-10) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §II, §IV.