Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition

12/27/2020 ∙ by Hengshun Zhou, et al. ∙ 12

The audio-video based emotion recognition aims to classify a given video into basic emotions. In this paper, we describe our approaches in EmotiW 2019, which mainly explores emotion features and feature fusion strategies for audio and visual modality. For emotion features, we explore audio feature with both speech-spectrogram and Log Mel-spectrogram and evaluate several facial features with different CNN models and different emotion pretrained strategies. For fusion strategies, we explore intra-modal and cross-modal fusion methods, such as designing attention mechanisms to highlights important emotion feature, exploring feature concatenation and factorized bilinear pooling (FBP) for cross-modal feature fusion. With careful evaluation, we obtain 65.5 AFEW validation set and 62.48



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Emotion recognition(ER) has attracted increasing attention in academia and industry due to its wide range of applications such as human-computer interaction (Dix, 2009), clinical diagnosis (Mitchell et al., 2009), and cognitive science (Johnson-Laird, 1980). Although great progress in the face and video analysis has been made (Deng et al., 2018; Wen et al., 2016; Tan et al., 2017; Wang et al., 2018; Wang et al., 2019; Yang et al., 2018)

, audio-video emotion recognition in the wild remains a challenging problem due to the expression suffers from the large pose, illumination variance, occlusion, motion blur, etc.

Audio-Video emotion recognition can be summarized as a simple pipeline shown in Fig 1

, which includes four parts, namely Video preprocessing, Feature Extraction, Feature Fusion, and Classifier. Specifically, video preprocessing refers to extract the spectrogram of the audio, the faces or landmarks of video. Feature extraction and feature fusion respectively extracts emotion features from the audio or visual signal and fuses emotion features into compact feature vectors, which are subsequently fed into a classifier for prediction.

Reviewing the methods of Audio-Video emotion recognition, we find that some methods emphasize feature extraction and other methods emphasize feature fusion. Yao et al (Yao et al., 2016) construct Holonet as discriminative feature extraction, which combines residual structure (He et al., 2016) and CReLU (Shang et al., 2016) to increase network depth and maintain efficiency. The EmotiW2017 winner team (Hu et al., 2017) gets robust feature extraction with Supervised Scoring Ensemble (SSE) which adds supervision to intermediate layers and shallow layers. Since SSE only uses high-level representations, Fan et al(Fan et al., 2018) further improve SSE by utilizing middle feature maps to provide more discriminative features. These methods mainly use average pooling to obtain video-level representation from frame-level.

Many feature fusion strategies have been used in previous EmotiW challenges. (Fan et al., 2016; Vielzeuf et al., 2017; Lu et al., 2018) extract CNN-based frame features and use LSTM(Gers et al., 1999) or BLSTM(Graves and Schmidhuber, 2005) to fuse them. (Bargal et al., 2016; Knyazev et al., 2018; Liu et al., 2018) use Statistical encoding module to aggregate frame features which compute the mean, variance, minimum, and maximum of the frame feature vectors. However, these methods ignore the importance of frames. Besides, all previous methods mainly apply score averaging or feature concatenation for audio-video fusion, which ignores the correlation between the features from different modalities.

In this paper, we exploit three types of intra-modal fusion methods, namely self-attention, relation-attention, and transformer(Vaswani et al., 2017). They are used to learn weights for frame features to highlight important frames. For cross-modal fusion, we explore feature concatenation and factorized bilinear pooling (FBP) (Zhang et al., 2019)

. Besides, we evaluate different emotion features, including convolutional neural networks (CNN) for audio information with both speech-spectrogram and Log Mel-spectrogram and several facial features with different CNN models and different emotion pretrained strategies. Finally, we obtain 62.48% and rank second in the challenge.

Our contributions and finds can be summarized as follows.

  • We experimentally show that better face recognition CNN models and choosing suitable emotion datasets to further pretrain the face CNN models is important.

  • We design three kinds of attention mechanisms for visual and audio feature fusion.

  • We apply a Factorized Bilinear Pooling (FBP) for cross-modal feature fusion.

2. The proposed method

We develop our ER system based on the pipeline of Video preprocessing-Feature Extraction-Feature Fusion-Classifier.

2.1. Video preprocessing

2.1.1. Face detection and alignment.

We apply face detection and alignment by Dlib toolbox111http://dlib.net/. We extend the face bounding box with a ratio of 30% and then resize the cropped faces to scale of . We do not apply face detection and alignment for AffectNet dataset, due to the face bounding box had been provided. For AFEW dataset, If no face is detected in the picture, the entire frame is passed to the network.

2.1.2. Audio processing and Spectrogram calculation.

For each audio, the speech spectrogram and log Mel-spectrogram extraction process is consistent with (Zhang et al., 2019) and (Chen et al., 2018) respectively. For speech spectrogram, we use the Hamming window with 40 msec window size and 10 msec shift. Finally, the 200-dimensional low-frequency part of the spectrogram is used as the input to the audio modality. As for log Mel-spectrogram, we calculate its deltas and delta-deltas.

2.2. Feature Extraction

2.2.1. Visual Features

We apply three CNN backbones to extract facial emotion features, namely VGGFace, ResNet18, and IR50 (Deng et al., 2018). The dimensions are 4096, 512, and 512, respectively.

2.2.2. Audio Feature

We extract the feature maps of the audio from the last Pooling layer of AlexNet. The size of a 3-dimensional feature map is , where the () is the height(width) of the feature map, and is the number of the channel of the feature map. The feature maps are then split into n vectors(). Each vector is C-dimensional.

2.3. Intra-modal Feature Fusion

We apply the attention-based strategies for intra-modal feature fusion. It converts a variable number of emotion features(from audio or visual modality) into a fixed-dimension feature. We explore three attention methods, namely Self-attention, Relation-attention, and Transformer-attention. Formally, we denote a number of emotion features as {}.

2.3.1. Self-attention

We apply 1-dimensional Fully-Connected(FC) layer 

and a sigmoid function

for each emotion feature, the weight of the -th feature is defined by:


With these self-attention weights, we aggregate all the emotion features into a global representation as follows:


2.3.2. Relation-attention

This attention module was designed to learn weights from the relationship between features. After the self-attention, features are aggregated into a single vector. Since inherently contains global representation of these features, we use the sample concatenation of individual features and global represenation to model the global-local relation. Similar to the Self-attention module, with individual emotion features, we apply 1-dimensional FC layer  and a sigmoid function . The relation-attention weight of the -th feautre is formulated as follows:


With Self-attention and Relation-attention weights, all the emotion features was convert into a new feature as follows:


2.3.3. Transformer-attention

Inspired by the works in(Zhang et al., 2019) and (Yang et al., 2016), we formulate the attention weight as follows:


To reduce the dimension of the feature , we use a -dimensional FC layer in Eq.(5). Then the weight of the -th feautre is processed by a -dimensional FC layer , and function in Eq.(6).

With these transformer-attention weights, we aggregate all the emotion features into a single feature as follows:


2.4. Cross-modal Feature Fusion

Figure 2. Our factorized bilinear pooling(FBP) module.

We apply Factorized Bilinear Pooling(FBP) for cross-modal feature fusion. Given two features in different modalities,i.e. the audio feature vector for a spectrogram and visual feature for frame sequence, the simplest cross-modal bilinear model is defined as follows:


where is a projection matrix, is the output of the bilinear model. we use the Eq.(9) to obtain the output feature . The formula derivation from formula Eq.(8) to Eq( 9) was discribed in the paper(Zhang et al., 2019).


The implementation of Eq( 9) is illustrated in Fig2, where and are implemented by feeding feature and to FC layers, respectively, and the function applies sum pooling with non-overlapped windows to . Besides, Dropout is adopted to prevent over-fitting. The -normalization () is used to normalize the energy of to avoid the dramatical variation of the output magnitude, due to the introduced element-wise multiplication.

3. Experiments

3.1. Dataset

In this work we use four emotion datasets to train our models, i.e. AffectNet(Mollahosseini et al., 1949), RAF-DB(Li and Deng, 2019), FER+(Barsoum et al., 2016), AFEW(Dhall et al., 2019; Dhall et al., 2012).

The human-annotated part of AffectNet dataset contains 287,651 training images and 4,000 test images, which are annotated with both emotion labels and arousal valence values. Only emotion labels are used in this task.

The RAF-DB dataset consists of 15,339 images labeled with 7-class basic emotion and 3,954 labeled with 12-class compound emotion. Only images labeled with basic emotion are used in this study.

The FER+ dataset contains 28,709 training, 3,589 validation and 3,589 test images. We combine its training data with validation data for the training split and evaluate the model performance on the test data.

The AFEW contains 773 train, 383 val and 653 test samples, which are collected from movies and TV serials with spontaneous expressions, various poses, and illuminations.

3.2. Exploration of Emotion Features

We explore emotion features in two perspectives, namely CNN backbones and pretraining emotion datasets.

For the choice of the CNN model, we compare IR50(Deng et al., 2018), ResNet18(He et al., 2016), and VGGFace(Parkhi et al., 2015) in the Table 1, where the former two models are pretrained on MS-Celeb-1M dataset and the last one on VGGFace dataset. We find that the large CNN, IR50, is superior to the other two models.

We use the well-trained IR50 model to extract features and only train softmax classifier using these features. The IR50 models pre-trained on FER+, RAF-DB, and AffectNet achieve 50.13%, 51.436%, and 53.78%, respectively. Therefore, we choose the IR50 model pretrain on AffectNet as our visual features in the following fusion experiments.

Model FER+ RAF-DB AffectNet
VGGFace 88.84% 86.93% 51.425%
ResNet18 88.65% 86.696% 52.075%
IR50 89.257% 89.075% 53.925%
Table 1. Exploration of CNN models and pretrained emotion datasets.

3.3. Exploration of Fusion Strategies

We explore three intra-modal attention strategies with the FBP cross-modal fusion. We use speech spectrogram for audio CNN, which obtains 38% on AFEW validation set individally. In the Table 2, we find the FBP improves performance for all the intra-modal fusion methods. Transformer attention for intra-modal fusion is the best for FBP.

AudioVisual Self Relation Transformer
Self 54.6% 56.9% 60.3%
Relation 54.0% 57.2% 60%
Transformer 54.8% 58% 61.1%
Table 2. Evaluation of intra-modal fusion methods.

We also use log Mel-spectrogram for audio CNN, which obtains a little better performance, but the final results are very similar after intra- and cross-modal fusion. Besides, the concatenation of audio and visual vectors gets 58% accuracy in AFEW validation set with transformer attention. This is 3% lower than FBP which shows the effectiveness of FBP.

3.4. Feature Enhancement

In the Table 3, the Basic Features means that we only extract one feature vector for each frame. Besides, We apply 5 kinds of feature enhancement strategies as presented in Table 3. Specifically, for feature -, we first obtain 18 transformation frames by using three rotations, three scales, and flipping for a frame. After that, we compute the features of these 18 transformation frames and average these 18 features as the feature -. For the feature -

, we compute the average feature and feature standard deviation of these 18 features. We then concatenate the average feature and the standard deviation as

-. For the feature -

, we first compute the Fast Fourier transform(FFT) of the Basic Feature, and then normalize the feature and concatenate the real and imaginary parts as

-. For the feature --, means that the features are extracted by the models pre-trained on Affectnet, and by the models pre-trained on RAF-DB. we concatenate these two mean features of two different pretrained models as --.

Visual Feature Augmentation details AFEW Val acc
Basic Feature —- 61.1%
Basic Feature_RAF-DB —- 58.5%
F-Mean default setting 62.14%
F-MeanStd default setting 63.7%
F-NormFFT Normalized FFT 61.35%
F-AR-Mean default setting 62.92%
FG-Net —- 59%
Table 3. Evaluation of five feature enhancement strategies. The default setting is Rotation

Table  3 shows that the five feature enhancement methods further improve the performance of FBP where the feature F-MeanStd achieves the best result on the validation set.

Sub Val Test Fusion detail
(1) —- 62.481% 4 FG-Net-1
(2) —- 59.112% 2 F-MeabStd-2 + 2 F-AR-Mean
(3) —- 54.518% 4 FG-Net-2
(4) 64.5% 61.41% 4 F-MeanStd
(5) 65.5% 62.328%
F-Mean + F-MeanStd + F-NormFFT
+ F-MeanStd-2 + F-AR-Meam
Table 4. Submission results of different model combinations.

3.5. Results On EmotiW2019

In the Table 4. The first three submitted models are trained on the training and validation set of AFEW, and the last two models are trained on the training set of AFEW. We find that it is difficult to choose models and fuse models if combining the validation set with the training set. We adopt class weight in all submissions, which means that we re-weight the predicted scores by the square root of the sample numbers([0.15, 0.097, 0.129, 0.185, 0.138, 0.082, 0.215]).

4. Conclusions

In this paper, we exploit three types of intra-modal fusion methods, namely self-attention, relation-attention, and transformer. They are mainly used to highlight important emotion feature. For the fusion of audio and visual information, we explore feature concatenation and factorized bilinear pooling (FBP). Besides, we evaluate different emotion features, including an audio feature with both speech-spectrogram and Log Mel-spectrogram and several facial features with different CNN models and different emotion pretrained strategies. With careful evaluation, we obtain 62.48% and rank second in the EmotiW 2019 Challenge.

5. Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (U1613211), Shenzhen Basic Research Program (JCYJ20170818164704758), the Joint Lab of CAS-HK.


  • (1)
  • Bargal et al. (2016) Sarah Adel Bargal, Emad Barsoum, Cristian Canton Ferrer, and Cha Zhang. 2016. Emotion recognition in the wild from videos using images. In ACM ICMI.
  • Barsoum et al. (2016) Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang. 2016. Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution. In ACM ICMI.
  • Chen et al. (2018) Mingyi Chen, Xuanji He, Jing Yang, and Han Zhang. 2018.

    3-D convolutional recurrent neural networks with attention model for speech emotion recognition.

    IEEE Signal Processing Letters 25, 10 (2018), 1440–1444.
  • Deng et al. (2018) Jiankang Deng, Jia Guo, and Stefanos Zafeiriou. 2018. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698 (2018).
  • Dhall et al. (2019) Abhinav Dhall, Roland Goecke, Shreya Ghosh, and Tom Gedeon. 2019. EmotiW 2019: Automatic Emotion, Engagement and Cohesion PredictionTasks. In ACM International Conference on Mutimodal Interaction.
  • Dhall et al. (2012) Abhinav Dhall, Roland Goecke, Simon Lucey, Tom Gedeon, et al. 2012. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia (2012).
  • Dix (2009) Alan Dix. 2009. Human-computer interaction. In Encyclopedia of database systems. Springer, 1327–1331.
  • Fan et al. (2018) Yingruo Fan, Jacqueline CK Lam, and Victor OK Li. 2018. Video-based Emotion Recognition Using Deeply-Supervised Neural Networks. In ACM ICMI.
  • Fan et al. (2016) Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In ACM ICMI.
  • Gers et al. (1999) Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with LSTM. (1999).
  • Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5-6 (2005), 602–610.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
  • Hu et al. (2017) Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen. 2017. Learning supervised scoring ensemble for emotion recognition in the wild. In ACM ICMI.
  • Johnson-Laird (1980) Philip Nicholas Johnson-Laird. 1980. Mental models in cognitive science. Cognitive science 4, 1 (1980), 71–115.
  • Knyazev et al. (2018) Boris Knyazev, Roman Shvetsov, Natalia Efremova, and Artem Kuharenko. 2018. Leveraging large face recognition data for emotion classification. In Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on. IEEE, 692–696.
  • Li and Deng (2019) Shan Li and Weihong Deng. 2019. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Unconstrained Facial Expression Recognition. IEEE TIP (2019).
  • Liu et al. (2018) Chuanhe Liu, Tianhao Tang, Kui Lv, and Minghao Wang. 2018. Multi-Feature Based Emotion Recognition for Video Clips. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 630–634.
  • Lu et al. (2018) Cheng Lu, Wenming Zheng, Chaolong Li, Chuangao Tang, Suyuan Liu, Simeng Yan, and Yuan Zong. 2018. Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 646–652.
  • Mitchell et al. (2009) Alex J Mitchell, Amol Vaze, and Sanjay Rao. 2009. Clinical diagnosis of depression in primary care: a meta-analysis. The Lancet 374, 9690 (2009), 609–619.
  • Mollahosseini et al. (1949) Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. 1949. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Transactions on Affective Computing PP, 99 (1949), 1–1.
  • Parkhi et al. (2015) Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. 2015. Deep face recognition.. In BMVC, Vol. 1. 6.
  • Shang et al. (2016) Wenling Shang, Diogo Almeida, Diogo Almeida, and Honglak Lee. 2016.

    Understanding and improving convolutional neural networks via concatenated rectified linear units. In

    International Conference on International Conference on Machine Learning

  • Tan et al. (2017) Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao. 2017. Group emotion recognition with individual facial emotion CNNs and global image based CNNs. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 549–552.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. (2017).
  • Vielzeuf et al. (2017) Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild. In ACM ICMI.
  • Wang et al. (2018) Kai Wang, , Xiaoxing Zeng, Jianfei Yang, Debin Meng, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2018. Cascade Attention Networks For Group Emotion Recognition with Face, Body and Image Cues. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (in press). ACM.
  • Wang et al. (2019) Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. 2019. Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition. arXiv preprint arXiv:1905.04075 (2019).
  • Wen et al. (2016) Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision. Springer, 499–515.
  • Yang et al. (2018) Jianfei Yang, Kai Wang, Xiaojiang Peng, and Yu Qiao. 2018. Deep Recurrent Multi-instance Learning with Spatio-temporal Features for Engagement Intensity Prediction. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 594–598.
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1480–1489.
  • Yao et al. (2016) Anbang Yao, Dongqi Cai, Ping Hu, Shandong Wang, Liang Sha, and Yurong Chen. 2016. HoloNet: towards robust emotion recognition in the wild. In ACM ICMI.
  • Zhang et al. (2019) Yuanyuan Zhang, Zi-Rui Wang, and Jun Du. 2019. Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition. arXiv preprint arXiv:1901.04889 (2019).