Extensive efforts have been devoted to emotion recognition because of a wide range of applications in HCI. Recognizing human emotions from facial displays is still a challenging problem due to head pose variations, illumination changes, and identity-related attributes, which, however, do not affect the information extracted from the audio channel. Human affective behavior is multimodal naturally. Researchers have been investing significant efforts in exploring how to build the best model to effectively exploit information from multiple modalities .
We are motivated by the fact that voice and facial expression are highly correlated when conveying emotions. For example, without looking at a face, we would know that a person is happy when hearing laughter. In this paper, we developed a feature-level fusion method and a model-level fusion technique by exploiting the information from both audio and visual channels. For the feature-level fusion, the audio features and three different types of visual features are concatenated to construct a joint feature vector. The model-level fusion explicitly handles differences in time scale, time shift, and metrics from different signals. The audio information and visual information are extracted and employed for emotion recognition independently; and then these unimodal recognition results are combined via a probabilistic framework, i.e., a Bayesian network in this work. Fig.1 illustrates the proposed audiovisual feature-level fusion and model-level fusion frameworks.
Experimental results on the EmotiW 2018 AFEW database have shown that both of the proposed fusion methods significantly outperform the baseline unimodal recognition methods and also achieve better or comparable performance compared with the state-of-the-art methods.
2 Related Work
Facial activity plays a significant role in emotion recognition. One of the major steps is to extract the most discriminative features that are capable of capturing appearance and geometry facial changes. These features can be roughly divided into two categories: human designed and learned features. Most recently, CNNs [18, 34, 3, 5, 4, 47, 46, 30] have achieved promising recognition performance under real-world conditions as demonstrated in the recent EmotiW2017 [27, 22, 43] and EmotiW2018  challenges.
Speech/voice is another important means for human communication. As elaborated in the survey paper , the most popular features for audio-based emotion recognition include prosodic features, e.g., pitch-related features, energy-related features, zero crossing, formant, teager energy operator, fundamental frequency (F0), speech rate, and spectral features. Studies show that pitch-related and energy-related features are the most important audio features for affect recognition. Most recently, some spectrum features, such as Linear Prediction Coefficients , Linear Prediction Cepstrum Coefficients , Mel-Frequency Cepstrum Coefficients  and its first derivative, are employed for emotion recognition. RASTA-PLP  is another popular audio feature, which combines Relative Spectral Transform  and Perceptual Linear Prediction .
Furthermore, audiovisual fusion for emotion recognition has received an increasing interest. Previous approaches on audiovisual emotion recognition can be roughly divided into three categories: feature-level fusion, decision-level fusion, and model-level fusion.
Feature-level fusion directly combines audio and visual features into a joint feature vector 23, 12]. For example, Kahou et al. 
employed audio features, aggregated CNN features, and RNN features as input for a Multilayer Perceptron (MLP). Yao et al. utilized the AU-aware features and their latent relations and then fed the concatenated features into a group of SVMs to make prediction.
Decision-level fusion combines recognition results from two modalities assuming that audio and visual signals are conditionally independent of each other [51, 50, 12, 14, 48], while there are strong semantic and dynamic relationships between audio and visual channels. For example, Yao et al. [50, 49] performed the decision-level fusion by averaging prediction scores of individual classification scores. In [23, 12, 14, 48, 22], fusion by weighted scores of individual models were used instead of taking average over all different models.
Model-level fusion exploits correlation between audio and visual channels  and is usually performed in a probabilistic manner. For example, coupled, tripled, or multistream fused HMMs  were developed by integrating multiple component HMMs, each of which corresponds to one modality, e.g., audio or visual, respectively. Fragpanagos et al.  and Caridakis et al.  used an ANN to perform fusion of different modalities. Sebe et al.  employed a Bayesian network to recognize expressions from audio and facial activities. Chen et al.  employed Multiple Kernel Learning (MKL) to find an optimal combination of the features from two modalities.
3.1 Audiovisual Feature Extraction
In our work, information is extracted from both audio and visual channels. Specifically, for the audio channel, we utilized a set of low-level spectral or voicing related feature descriptors and voice/unvoice durational features; for the visual channel, we employed both human-crafted features, i.e., LBP-TOP , and features learned by a CNN and a bi-directional LSTM. Finally, we combined audio and visual information for emotion recognition given a video clip.
3.1.1 Audio Features
In this work, audio signal was extracted from video at a sampling rate of 48 kHz and 160k bps. Then, a feature set of 1582 audio features, which was used in the INTERSPEECH 2010 Paralinguistic challenge , was extracted by OpenSMILE . The feature set is comprised of 34 spectral related low-level audio feature descriptors (LLDs) with their delta coefficients functionals, 4 voicing related LLDs with their delta coefficients functionals, and 2 voiced/unvoiced durational features. In addition, PCA was utilized to compress the features. Specifically, the first 20 principal components were preserved.
3.1.2 LBP-TOP-based Visual Features
Following the EmotiW 2018 baseline paper , we extracted LBP-TOP features  ( dimensions) from non-overlapping spatial blocks. The LBP-TOP features from all blocks are concatenated to create one feature vector. Then, we employed PCA for dimensionality reduction and preserved the first 150 principal components.
3.1.3 Visual Features by An Ensemble of CNNs
In this work, two CNN architectures, i.e., a VGG-Face network  and a shallow CNN , were employed as our backbone CNNs. Starting from the pre-trained VGG-Face model , the VGG-Face CNN was first fine-tuned on the CK+ dataset , the MMI dataset , the Oulu-CASIA dataset , the RAF-DB dataset , and ExpW dataset , and then fine-tuned on the AFEW training set. The shallow CNN was pre-trained on the aforementioned five datasets and also the FER-2013 dataset , and then fine-tuned on the AFEW training set.
To further improve performance, an island loss , which was designed to simultaneously reduce the intra-class variations and increase the inter-class differences, and the softmax loss were jointly used as the supervision signal to train the CNNs. The island loss denoted as is defined as the summation of the center loss  and the pairwise distances between class centers in the feature space:
where is the set of emotion labels; and denote the and emotion center with norm and , respectively; represents the dot product. Specifically, the first term penalizes the distance between the sample and its corresponding center, and the second term penalizes the similarity between emotions. is used to balance the two terms. By minimizing the island loss, samples of the same emotion will get closer to each other and those of different emotions will be pushed apart.
Inspired by ensemble learning techniques, we built an ensemble of 40 CNNs, half of which employed the VGG-Face structures and the others used the shallow CNN structures. In addition, for each network structure, half of the models used both the softmax loss and the island loss, while the others used only the softmax loss. Each frame was fed into the 40 trained CNNs. Then the average scores of the softmax layer outputs of the 40 CNNs are employed as the per-frame feature.
In order to capture temporal information, we employed k-average temporal pooling as described in . The per-frame CNN features are averaged into bins to generate a k-dimensional vectors as the per-video-clip feature. In this work, was set to empirically. As a result, the per-video-clip CNN features are 49 dimensions. For videos with less than frames, the frames are locally repeated until the image length reaches . In addition, when the number of frames cannot be evenly divided by , several frames at the head and the tail of each video clip are discarded. Fig. 2 illustrates how the k-average temporal pooling works.
3.1.4 Visual Features Learned by a CNN-BLSTM
As shown in Fig. 3, a hybrid CNN-BLSTM network was designed for video-level prediction. Specifically, CNN features of the fine-tuned VGG-Face model were fed into a two-layer BLSTM. Next, features from both directions of the BLSTM were concatenated and then connected to a dense layer (512 dimensions). Finally, a softmax layer was employed to predict the emotion label of the video sequence. During training, a series of 20 images were randomly chosen from each video clip and their CNN features were used as the BLSTM inputs. For testing, a series of 20 evenly spaced images were chosen for each video clip. For videos with less than 20 frames, the frames are locally repeated until the image length reaches 20.
3.2 Audiovisual Feature-Level Fusion
In our work, the four types of features are directly extracted from the whole video clip and concatenated to form a joint feature vector. As shown in Fig. 4, the input features to the feature-level fusion are: (1) the first 20 principal components of the audio features, (2) the first 150 principal components of the LBP-TOP features, (3) the 49-dimensional ( bins) CNN features, and (4) the first 50 principal components of the BLSTM features. Hence, we can obtain a joint feature vector consisting of 269 features for each video clip, which was fed to a linear SVM for emotion classification.
In order to deal with the difference in metrics when extracting features from the four models/channels, the features were first normalized for each of the 269 dimensions to zero mean and unit variance across all samples. Then, each feature in the joint feature vector was normalized to zero mean and unit variance across all 269 dimensions.
3.3 Audiovisual Model-Level Fusion
In this work, we employed a Bayesian Network (BN) for model-level fusion as illustrated in Fig. 5
. A BN is a directed acyclic graph, where the shaded nodes are measurement nodes whose states are available during inference and the unshaded node is the hidden node whose state can be estimated by probabilistic inference over the BN.
Particularly, the four measurement nodes correspond to the four emotion recognition methods employing the four types of features, i.e., audio features, LBP-TOP features, CNN features, and CNN-BLSTM features, respectively. SVMs were employed for classification using each type of features. The directed links between the hidden node “Emotion” and the measurement nodes represent the measurement uncertainty associated with each measurement node, i.e., the recognition accuracy using each method, which can be estimated on the validation set. During inference, the final decision is made by maximizing the posterior probability given observations from all measurements:
4.1 Experimental Dataset
The proposed fusion methods were evaluated on the Acted Facial Expressions in the Wild (AFEW)  dataset. The AFEW dataset is a dynamic facial expression dataset consisting of short video clips, which are collected from scenes in movies/TV shows with natural head pose movements, occlusions and various illuminations. Each of the video clips has been labeled as one of seven expressions: anger, disgust, fear, happiness, sadness, surprise, and neutral. The subjects are diverse in race, age, and gender. The dataset is divided into three partitions: a training set (773), a validation set (383), and a test set (653) such that the data in those three sets are coming from mutually exclusive movies and actors. The test set labels are held by the challenge organizers and unknown to the public.
Face alignment was employed on each image based on centers of two eyes and nose. The aligned facial images were then resized to for the shallow CNN and for the VGG-Face CNN. For data augmentation purpose, patches and patches were randomly cropped from the images and images, and then rotated by a random degree between -5 and 5, respectively. The rotated images were horizontally flipped randomly as the input of all CNNs.
4.3 Experimental Results
The proposed two fusion methods were evaluated and compared with the four baseline methods based on (1) audio, (2) LBP-TOP, (3) an ensemble of CNNs, and (4) CNN-BLSTM, respectively. As shown in Table 1, the deep learning based features (the ensemble CNN features and the CNN-BLSTM features) achieved better recognition performance than the human crafted audio/visual features among the baseline methods. Both the proposed fusion methods improve the baseline unimodal methods employing a single type of features with a large margin.
|Hu et. al ||audio, visual||59.01||60.34|
|Fan et. al ||audio, visual||–||59.02|
|Vielzeuf et. al ||audio, visual||–||58.81|
|Yao et. al ||audio, visual||51.96||57.84|
|Ouyang et. al ||audio, visual||–||57.20|
|Kim et. al ||audio, visual||50.39||57.12|
|Yan et. al ||audio, visual||–||56.66|
|Wu et. al ||audio, visual||–||55.31|
|Kaya et. al ||audio, visual||57.02||54.55|
|Ding et. al ||audio, visual||51.20||53.96|
|Yao et. al ||audio, visual||49.09||53.80|
|Kaya et. al ||audio, visual||52.30||53.62|
|Kahou et. al ||audio, visual||–||52.88|
|Sun et. al ||audio, visual||–||51.43|
|Pini et. al ||audio, visual||49.92||50.39|
|Li et. al ||audio, visual||–||50.46|
|Gideon et. al ||audio, visual||38.81||46.88|
|Bargal et. al ||visual||59.42||56.66|
|Sun et. al ||visual||50.67||50.14|
|Feature-level fusion||audio, visual||53.79||56.81|
|Model-level fusion||audio, visual||54.83||54.06|
4.4 Analysis of Fusion Methods
As shown in Table 1, the feature-level fusion method achieved the best result with the performance of 56.81% on the test set; while the model-level fusion achieved a better result on the validation set compared to the feature-level fusion. This may be because the measurement uncertainty measured from the validation set is different from that on the test set.
Although the average recognition performance was improved significantly by using the fusion-based methods compared to all of the baseline methods, the improvement for recognizing disgust, fear, and surprise was marginal as compared to the best baseline methods, i.e., the two deep learning based methods on the validation set. Furthermore, the feature-level fusion method failed to recognize disgust, fear, and surprise on the test set. This is because all of the four types of features could not well characterize these three expressions: the recognition accuracies of the four baseline methods are all below for these expressions. In contrast, by considering the measurement uncertainty in a probabilistic manner, the model-level fusion yielded the best results on recognizing these difficult expressions on the validation set. These observations imply that the feature-level fusion may further boost the recognition performance of those expressions that can be well recognized by a single type of features, while the model-level fusion may be employed to improve the recognition performance of the difficult expressions.
We proposed two novel audiovisual fusion methods by exploiting audio features, LBP-TOP-based features, CNN-based features, and CNN-BLSTM features. Both the proposed fusion methods significantly outperform the baseline methods that employ a single type of feature in terms of the average recognition performance. In the future, more advanced techniques will be developed to improve the audio-based emotion recognition by exploring deep learning based approaches.
This work is supported by National Science Foundation under CAREER Award IIS-1149787. The Titan Xp used for this research was donated by the NVIDIA Corporation.
-  B. Atal. Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J. Acoustical Society of America, 55(6):1304–1312, 1974.
-  S. Bargal, E. Barsoum, C. C. F., and C. Zhang. Emotion recognition in the wild from videos using images. In ICMI, pages 433–436. ACM, 2016.
J. Cai, Z. Meng, A. Khan, Z. Li, J. O’Reilly, and Y. Tong.
Probabilistic attribute tree in convolutional neural networks for facial expression recognition.arXiv preprint, 2018.
-  J. Cai, Z. Meng, A. Khan, Z. Li, J. O’Reilly, and Y. Tong. Identity-free facial expression recognition using conditional generative adversarial network. arXiv preprint, 2019.
-  J. Cai, Z. Meng, A. Khan, Z. Li, J. O’Reilly, and Y. Tong. Island loss for learning discriminative features in facial expression recognition. In FG, pages 302–309. IEEE, 2019.
-  G. Caridakis, L. Malatesta1, L. Kessous, N. Amir, A. Raouzaiou, and K. Karpouzis. Modeling naturalistic affective states via facial and vocal expression recognition. In ICMI, 2006.
-  J. Chen, Z. Chen, Z. Chi, and H. Fu. Emotion recognition in the wild with feature fusion and multiple kernel learning. In ICMI, pages 508–513, 2014.
-  S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE T-ASSP, 28(4):357–366, 1980.
-  A. Dhall, R. Goecke, S. Lucey, T. Gedeon, et al. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia, 19(3):34–41, 2012.
-  A. Dhall, A. Kaur, R. Goecke, and T. Gedeon. Emotiw 2018: Audio-video, student engagement and group-level affect prediction. In ICMI. ACM, 2018.
-  W. Ding, M. Xu, D. Huang, W. Lin, M. Dong, X. Yu, and H. Li. Audio and face video emotion recognition in the wild using deep neural networks and small datasets. In ICMI, pages 506–513. ACM, 2016.
-  S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal. Recurrent neural networks for emotion recognition in video. In ICMI, pages 467–474. ACM, 2015.
-  F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in opensmile, the munich open-source multimedia feature extractor. In ACM MM, pages 835–838, 2013.
-  Y. Fan, X. Lu, D. Li, and Y. Liu. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In ICMI, pages 445–450, 2016.
-  N. Fragopanagos and J. G. Taylor. Emotion recognition in human-computer interaction. IEEE T-NN, 18(4):389–405, 2005.
-  J. Gideon, B. Zhang, Z. Aldeneh, Y. Kim, S. Khorram, D. Le, and E. Provost. Wild wild emotion: a multimodal ensemble approach. In ICMI, pages 501–505. ACM, 2016.
I. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner,
W. Cukierski, Y. Tang, D. Thaler, D. Lee, et al.
Challenges in representation learning: A report on three machine learning contests.In ICML, pages 117–124. Springer, 2013.
-  S. Han, Z. Meng, J. O’Reilly, J. Cai, X. Wang, and Y. Tong. Optimizing filter size in convolutional neural networks for facial action unit recognition. In CVPR, 2018.
-  H. Hermansky. Perceptual linear predictive (plp) analysis of speech. J. Acoustical Society of America, 87(4):1738–1752, 1990.
-  H. Hermansky and N. Morgan. Rasta processing of speech. IEEE Trans. on Speech and Audio Processing, 2(4):578–589, 1994.
-  H. Hermansky, N. Morgan, A. Bayya, and P. Kohn. Rasta-plp speech analysis. In ICASSP, volume 1, pages 121–124. IEEE, 1991.
-  P. Hu, D. Cai, S. Wang, A. Yao, and Y. Chen. Learning supervised scoring ensemble for emotion recognition in the wild. In ICMI, pages 553–560. ACM, 2017.
-  H. Kaya, F. Gürpinar, S. Afshar, and A. A. Salah. Contrasting and combining least squares based learners for emotion recognition in the wild. In ICMI, pages 459–466, 2015.
H. Kaya, F. Gürpınar, and A. Salah.
Video-based emotion recognition in the wild using deep transfer learning and score fusion.J. IVC, 65:66–75, 2017.
-  A. Khan, Z. Li, J. Cai, Z. Meng, J. O’Reilly, and Y. Tong. Group-level emotion recognition using deep models with a four-stream hybrid network. In ICMI, pages 623–629. ACM, 2018.
D. Kim, M. Lee, D. Choi, and B. Song.
Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild.In ICMI, pages 529–535. ACM, 2017.
B. Knyazev, R. Shvetsov, N. Efremova, and A. Kuharenko.
Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video.arXiv preprint, 2017.
-  S. Li, W. Deng, and J. Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, July 2017.
W. Li, F. Abtahi, and Z. Zhu.
A deep feature based multi-kernel learning approach for video emotion recognition.In ICMI, pages 483–490. ACM, 2015.
-  Z. Li, S. Han, A. Khan, J. Cai, Z. Meng, J. O’Reilly, and Y. Tong. Pooling map adaptation in convolutional neural networks for facial expression recognition. In ICME. IEEE, 2019.
-  P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohn-kanade dataset (CK+): A complete expression dataset for action unit and emotion-specified expression. In CVPR Workshops, pages 94–101, 2010.
-  J. Makhoul. Linear prediction: A tutorial review. Proc. of the IEEE, 63(4):561–580, 1975.
-  B. Martinez, M. F. Valstar, B. Jiang, and M. Pantic. Automatic analysis of facial actions: A survey. IEEE T-AC, 13(9):1–22, 2017.
-  Z. Meng, P. Liu, J. Cai, S. Han, and Y. Tong. Identity-aware convolutional neural network for facial expression recognition. In FG, pages 558–565. IEEE, 2017.
-  X. Ouyang, S. Kawaai, E. Goh, S. Shen, W. Ding, H. Ming, and D. Huang. Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In ICMI, pages 577–582. ACM, 2017.
-  M. Pantic, M. Valstar, R. Rademaker, and L. Maat. Web-based database for facial expression analysis. In ICME, pages 5–pp. IEEE, 2005.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, volume 1, page 6, 2015.
-  S. Pini, O. Ahmed, M. Cornia, L. Baraldi, R. Cucchiara, and B. Huet. Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild. In ICMI, pages 536–543. ACM, 2017.
-  B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic. AVEC 2011–the first international audio/visual emotion challenge. In ACII, pages 415–424. Springer, 2011.
-  N. Sebe, I. Cohen, T. Gevers, and T. Huang. Emotion recognition based on joint visual and audio cues. In ICPR, pages 1136–1139, 2006.
-  B. Sun, L. Li, G. Zhou, and J. He. Facial expression recognition in the wild based on multimodal texture features. J. Electronic Imaging, 25(6):061407, 2016.
-  B. Sun, Q. Wei, L. Li, Q. Xu, J. He, and L. Yu. LSTM for dynamic emotion and group emotion recognition in the wild. In ICMI, pages 451–457. ACM, 2016.
-  V. Vielzeuf, S. Pateux, and F. Jurie. Temporal multimodal fusion for video emotion classification in the wild. In ICMI, pages 569–576. ACM, 2017.
-  Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, pages 499–515. Springer, 2016.
-  J. Wu, Z. Lin, and H. Zha. Multi-view common space learning for emotion recognition in the wild. In ICMI, pages 464–471. ACM, 2016.
-  J. Xue, H. Zhang, and K. Dana. Deep texture manifold for ground terrain recognition. In CVPR, 2018.
-  J. Xue, H. Zhang, K. Dana, and K. Nishino. Differential angular imaging for material recognition. In CVPR, volume 5, 2017.
-  J. Yan, W. Zheng, Z. Cui, C. Tang, T. Zhang, Y. Zong, and N. Sun. Multi-clue fusion for emotion recognition in the wild. In ICMI, pages 458–463, 2016.
-  A. Yao, D. Cai, P. Hu, S. W., L. Sha, and Y. Chen. HoloNet: towards robust emotion recognition in the wild. In ICMI, pages 472–478. ACM, 2016.
-  A. Yao, J. Shao, N. Ma, and Y. Chen. Capturing AU-aware facial features and their latent relations for emotion recognition in the wild. In ICMI, pages 451–458, 2015.
-  Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE T-PAMI, 31(1):39–58, Jan. 2009.
-  Z. Zeng, J. Tu, B. M. Pianfetti, and T. S. Huang. Audio-visual affective expression recognition through multistream fused HMM. IEEE T-Multimedia, 10(4):570–577, June 2008.
-  Z. Zhang, P. Luo, C. Loy, and X. Tang. From facial expression recognition to interpersonal relation prediction. IJCV, 126(5):550–569, 2018.
-  G. Zhao, X. Huang, M. Taini, S. Li, and M. Pietikäinen. Facial expression recognition from near-infrared videos. J. IVC, 29(9):607–619, 2011.
-  G. Zhao and M. Pietiäinen. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE T-PAMI, 29(6):915–928, June 2007.