An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

02/12/2020 ∙ by Sicheng Zhao, et al. ∙ Nankai University berkeley college Beijing Didi Infinity Technology and Development Co., Ltd. 0

Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at:



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The convenience of mobile devices and social networks has enabled users to generate videos and upload to Internet in daily life to share their experiences and express personal opinions. As a result, an explosive growing volume of videos are being created, which results in urgent demand for the analysis and management of these videos. Besides the objective content recognition, such as objects and actions [zhu2018towards, choutas2018potion], understanding the emotional impact of the videos plays an important role in human-centered computing. On the one hand, the videos can, to a large extent, reflect the psychological states of the video generators. We can predict the generators’ possible extreme behaviors, such as depression and suicide, and take corresponding preventive actions. On the other hand, the videos that evoke strong emotions can easily resonate with viewers and bring them immersive watching experiences. Appropriate emotional resonation is crucial in intelligent advertising and video recommendation. Further, emotion recognition in user-generated videos (UGVs) can help companies analyze how customers evaluate their products and assist governments to manage the Internet.

Figure 1: Illustration of the keyframes and discriminative regions for emotion recognition in user-generated videos. Although the story in a video may contain different stages, the emotion is mainly evoked by some keyframes (as shown by the temporal attentions in the color bar) and corresponding discriminative regions (as illustrated by the spatial attentions in the heat map).

Although with the advent of deep learning, remarkable progress has been made on text sentiment classification 

[zhang2018deep], image emotion analysis [zhao2018affective, zhao2018predicting, yang2018weakly], and video semantic understanding [zhu2018towards, choutas2018potion]. Emotion recognition in UGVs still remains an unsolved problem, due to the following challenges. (1) Large intra-class variation. Videos captured in quite different scenarios may evoke similar emotions. For example, visiting an amusement park, taking part in sport competition, and playing video games may all make viewers feel “excited”. This results in obvious “affective gap” between low-level features and high-level emotions. (2) Low structured consistency. Unlike professional and commercial videos, such as movies [wang2006affective] and GIFs [jou2014predicting, yang2019human], UGVs are usually taken with diverse structures, e.g. various resolutions and image blurring noises. (3) Sparse keyframe expression. Only limited keyframes directly convey and determine emotions, as shown in Figure 1, while the rest are used to introduce the background and context.

Most existing approaches on emotion recognition in UGVs focus on the first challenge, i.e. employing advanced image representations to bridge the affective gap, such as (1) mid-level attribute features [jiang2014predicting, tu2019multi] like ObjectBank [li2010object] and SentiBank [borth2013sentibank], (2) high-level semantic features [chen2016emotion] like detected events [jiang2017exploiting, caba2015activitynet], objects [deng2009imagenet], and scenes [zhou2014learning], and (3) deep convolutional neural network (CNN) features [xu2018heterogeneous, zhang2018recognition]

. zhang2018recognition zhang2018recognition transformed frame-level spatial features to another kernelized feature space via discrete Fourier transform, which partially addresses the second challenge. For the third challenge, the videos are either downsampled averagely to a fixed number of frames 

[zhang2018recognition], or represented by continuous frames from only one segment [tu2019multi].

The above methods have contributed to the development of emotion recognition in UGVs, but they still have some problems. (1) They mainly employ a two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. (2) The visual CNN features of each frame are separately extracted, which ignore the temporal correlation of adjacent frames. (3) The fact that emotions may be determined by keyframes from several discrete segments is neglected. (4) Some methods require auxiliary data, which is not always available in real applications. For example, the extracted event, object, and scene features in [chen2016emotion] are trained on FCVID [jiang2017exploiting] and ActivityNet [caba2015activitynet]

, ImageNet 

[deng2009imagenet], and Places205 [zhou2014learning] datasets, respectively. (5) They do not consider the correlations of different emotions, such as the polarity-emotion hierarchy constraint, i.e. the relation of two different emotions belonging to the same polarity is closer than those from opposite polarities.

In this paper, we propose an end-to-end Visual-Audio Attention Network, termed VAANet, to address the above problems for recognizing the emotions in UGVs, without requiring any auxiliary data except the data for pre-training. First, we spit each video into an equal number of segments. Second, for each segment, we randomly select some successive frames and feed them into a 3D CNN [hara2018can] with both spatial and channel-wise attentions to extract visual features. Meanwhile, we transform the corresponding audio waves into spectrograms and feed them into a 2D CNN [he2016deep] to extract audio features. Finally, the visual and audio features of different segments are weighted by temporal attentions to obtain the whole video’s feature representation, which is followed by a fully connected layer to obtain emotion predictions. Considering the polarity-emotion hierarchy constraint, we design a novel classification loss, i.e. polarity-consistent cross-entropy (PCCE) loss, to guide the attention generation.

In summary, the contributions of this paper are threefold:

  1. We are the first to study the emotion recognition task in user-generated videos in an end-to-end manner.

  2. We develop a novel network architecture, i.e. VAANet, that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN for video emotion recognition. We propose a novel PCCE loss, which enables VAANet to generate polarity preserved attention map.

  3. We conduct extensive experiments on the VideoEmotion-8 [jiang2014predicting] and Ekman-6 [xu2018heterogeneous] datasets, and the results demonstrate the superiority of the proposed VAANet method, as compared to the state-of-the-art approaches.

Figure 2:

The framework of the proposed Visual and Audio Attention Network (VAANet). First, the MFCC descriptor from the soundtrack and the visual information are both divided into segments and fed into 2D ResNet-18 and 3D ResNet-101 respectively to extract audio and visual representation. The response feature maps of the visual stream are then fed into the stacked spatial attention, channel-wise attention, and temporal attention sub-networks, and the response feature map of the audio stream are fed into a temporal attention module. Finally, the attended semantic vectors that carry visual and audio information are concatenated. Meanwhile, a novel polarity-consistent cross-entropy loss is optimized to guide the attention generation for video emotion recognition.

Related Work

Video Emotion Recognition: Psychologists usually employ two kinds of models to represent emotions: categorical emotion states (CES) and dimensional emotions space (DES). CES classify emotions into several basic categories, such as Ekman’s 6 basic categories [ekman1992argument] and Plutchik’s wheel of emotions [plutchik1980emotion]. DES usually employ a Cartesian space to represent emotions, such as valence-arousal-dominance [schlosberg1954three]. Since CES are easy for users to understand and label, here we adopt CES to represent emotions in videos.

Early research on video emotion recognition mainly focused on movies, which are well structured. kang2003affective kang2003affective employed a Hidden Markov Model to detect affective event based on low-level features, including color, motion, and shot cut rate. Joint combination of visual and audio features with support vector machine 

[wang2006affective] and conditional random fields [xu2013hierarchical] achieves promising result. Some recent methods work on Animated GIFs [jou2014predicting, chen2016predicting, yang2019human]. jou2014predicting jou2014predicting firstly proposed to recognize GIF emotions by using features of different types. chen2016predicting chen2016predicting improved the performance by adopting 3D ConvNets to extract spatiotemporal features. Human-centered GIF emotion recognition is conducted by considering human related information and visual attention [yang2019human].

Because of the content diversity and low quality, UGVs are more challenging to recognize emotions. jiang2014predicting jiang2014predicting investigated a large set of low-level visual-audio features and mid-level attributes, e.g. ObjectBank [li2010object] and SentiBank [borth2013sentibank]

. chen2016emotion chen2016emotion extracted various high-level semantic features based on existing detectors. Compared with hand-crafted features, deep features are more widely used 

[xu2018heterogeneous, zhang2018recognition]. By combining low-level visual-audio-textual features, pang2015deep pang2015deep showed that learned joint representations are complementary to hand-crafted features. Different from these methods, which employ a two-stage shallow pipeline, we propose the first end-to-end method to recognize emotions in UGVs by extracting attended visual and audio CNN features.

Please note that emotion recognition has also been widely studied in other modalities, such as text [zhang2018deep], images [zhao2017continuous, yang2018retrieving, zhao2018affective, yang2018weakly, zhao2018emotiongan, zhao2019cycleemotiongan, zhao2019pdanet, yao2019attention, zhan2019zero], speech [el2011survey], physiological signals [alarcao2017emotions, zhao2019personalized], and multi-modal data [soleymani2017survey, zhao2019affective].

Attention-Based Models:

Since attention can be considered as a dynamic feature extraction mechanism that combines contextual fixations over time 

[mnih2014recurrent, chen2017sca], it has been seamlessly incorporated into deep learning architectures and achieved outstanding performances in many vision-related tasks, such as image classification [woo2018cbam]

, image captioning 

[you2016image, chen2017sca, chen2018show], and action recognition [song2017end]. These attention methods can be roughly divided into four categories: spatial attention [song2017end, woo2018cbam], semantic attention [you2016image], channel-wise attention [chen2017sca, woo2018cbam], and temporal attention [song2017end].

There are also several methods that employ attention for emotion recognition in images [you2017visual, yang2018weakly, zhao2019pdanet] and speech [mirsamadi2017automatic]. The former methods mainly consider spatial attention except PDANet [zhao2019pdanet] which also employs channel-wise attention, while the latter one only uses temporal attention. To the best of our knowledge, attention has not been studied on emotion recognition in user-generated videos. In this paper, we systematically investigate the influence of different attentions in video emotion recognition, including the importance of local spatial context by spatial attention, the interdependency between different channels by channel-wise attention, and the importance of different segments by temporal attention.

Visual-Audio Attention Network

We propose a novel CNN architecture with spatial, channel-wise, and temporal attention mechanisms for emotion recognition in user generated videos. Figure 2 shows the overall framework of the proposed VAANet. Specifically, VAANet has two streams to respectively exploit the visual and audio information. The visual stream consists of three attention modules and the audio stream contains a temporal attention module. The spatial attention and the channel-wise attention sub-networks in the visual stream are designed to automatically focus on the regions and channels that carry discriminative information within each feature map. The temporal attention sub-networks in both the visual and audio streams are designed to assign weights to different segments of a video. The training of VAANet is performed by minimizing the newly designed polarity-consistent cross-entropy loss in an end-to-end manner.

Visual Representation Extraction

To extract visual representations from a long-term video, following [wang2016temporal], the visual stream of our model works on short snippets sparsely sampled from the entire video. Specifically, we divide each video into segments with equal duration, and then randomly sample a short snippet of successive frames from each segment. We use 3D ResNet-101 [hara2018can] as backbone of the visual stream. It takes the snippets (each has successive frames) as input and independently processes them up to the last spatiotemporal convolutional layer conv5 into a super-frame. Suppose we are given training samples , where is the visual information of video , and is the corresponding emotion label. For sample , suppose the feature map of the conv5 in 3D ResNet-101 is (we omit for simplicity in the following), where and are the spatial size (height and width) of the feature map, is the number of channels, and is the number of snippets. We reshape as


by flattening the height and width of the original , where and . Here we can consider as the visual feature of the -th location in the -th super-frame. In the following, we omit the superscript V for simplicity.

Visual Spatial Attention Estimation

We employ a spatial attention module to automatically explore the different contributions of the regions in super-frames to predict the emotions. Following [chen2017sca], we employ a two-layer neural network, i.e. a convolutional layer followed by a fully-connected layer with a softmax function to generate the spatial attention distributions over all the super-frame regions. That is, for each


where and are two learnable parameter matrices, is the transpose of a matrix, and .

And then we can obtain a weighted feature map based on spatial attention as follows


where is the multiplication of a matrix and a vector, which is performed by multiplying each value in the vector to each column of the matrix.

Visual Channel-Wise Attention Estimation

Assuming that each channel of a feature map in a CNN is a response activation of the corresponding convolutional layer, channel-wise attention can be viewed as a process of selecting semantic attributes [chen2017sca]. To generate the channel-wise attention, we first transpose to G


where represents the -th channel in the -th super-frame of the feature map G. The channel-wise attention for is defined as


where and are two learnable parameter matrices, and .

And then a weighted feature map based on channel-wise attention is computed as follows


where is the multiplication of a matrix and a vector.

Visual Temporal Attention Estimation

For a video, the discriminability of each frame to recognize emotions is obviously different. Only some keyframes contain discriminative information, while the others only provide the background and context information [song2017end]. Based on such observations, we design a temporal attention sub-network to automatically focus on the important segments that contain keyframes. To generate the temporal attention, we first apply spatial average pooling to and reshape it to P


where . Here we can consider as the visual feature of the -th super-frame. The temporal attention is then defined as


where and are two learnable parameter matrices, and . Following  [song2017end]

, we use ReLU (Rectified Linear Units) as the activation function here for its better convergence performance. The final visual embedding is the weighted sum of all the super-frames


Audio Representation Extraction

Audio features are complementary to visual features, because they contain information of another modality. In our problem, we choose to use the most well-known audio representation: the mel-frequency cepstral coefficients (MFCC). Suppose we are given audio training samples , where is a descriptor from the entire soundtrack of the video and is the corresponding emotion label. We center-crop to a fixed length of to get

, and pad itself when it is necessary. Similar to the method we take in extracting visual representation, we divide each descriptor into

segments and use 2D ResNet-18  [he2016deep] as backbone of the audio stream of our model which processes descriptor segments independently. For descriptor , suppose the feature map of the conv5 in 2D ResNet-18 is (we omit for simplicity in the following), where and are the height and width of the feature map, is the number of channels, and is the number of segments. We apply spatial average pooling to and obtain .

Audio Temporal Attention Estimation

With similar motivation to integrate temporal attention sub-network into the visual stream, we introduce a temporal attention sub-network to explore the influence of audio information in different segments for recognizing emotions as


where and are two learnable parameter matrices, and . The final audio embedding is the weighted sum of all the segments


Polarity-Consistent Cross-Entropy Loss

We concatenate and to obtain an aggregated semantic vector , which can be viewed as the final representation of a video and is fed into a fully connected layer to predict the emotion labels. The traditional cross-entropy loss is defined as


where is the number of emotion classes ( for VideoEmotion-8 and for Ekman-6 in this paper), is a binary indicator, and

is the predicted probability that video

belongs to class .

Directly optimizing the cross-entropy loss in Eq. (12) can lead some videos to be incorrectly classified into categories that have opposite polarity. In this paper, we design a novel polarity-consistent cross-entorpy (PCCE) loss to guide the attention generation. That is, the penalty of the predictions that have opposite polarity to the ground truth is increased. The PCCE loss is defined as


where is a penalty coefficient that controls the penalty extent. Similar to the indicator function, g(.,.) represents whether to add the penalty or not and is defined as



is a function that maps an emotion category to its polarity (positive or negative). Since the derivatives with respect to all parameters can be computed, we can train the proposed VAANet effectively in an end-to-end manner using off-the-shelf optimizer to minimize the loss function in Eq. (



In this section, we evaluate the proposed VAANet model on emotion recognition in user-generated videos. We first introduce the employed benchmarks, compared baselines, and implementation details. And then we report and analyze the major results together with some empirical analysis.

Experimental Settings


We evaluate the performances of the proposed method on two publicly available datasets that contain emotion labels in user-generated videos: VideoEmotion-8 [jiang2014predicting] and Ekman-6 [xu2018heterogeneous].

VideoEmotion-8 [jiang2014predicting] (VE-8) consists of 1,101 videos collected from Youtube and Flickr with average duration 107 seconds. The videos are labeled into one of the Plutchik’s eight basic categories [plutchik1980emotion]: negative anger, disgust, fear, sadness and positive anticipation, joy, surprise, trust. In each category, there are at least 100 videos. Ekman-6 [xu2018heterogeneous] (E-6) contains 1,637 videos also collected from Youtube and Flickr. The average duration is 112 seconds. The videos are labeled with Ekman’s six emotion categories [ekman1992argument], i.e. negative anger, disgust, fear, sadness and positive joy, surprise.


To compare VAANet with the state-of-the-art approaches for video emotion recognition, we select the following methods as baselines: (1) SentiBank [borth2013sentibank], (2) Enhanced Multimodal Deep Bolzmann Machine (E-MDBM) [pang2015deep], (3) Image Transfer Encoding (ITE) [xu2018heterogeneous], (4) Visual+Audio+Attribute (V.+Au.+At.) [jiang2014predicting], (5) Context Fusion Net (CFN) [chen2016emotion], (6) V.+Au.+At.+E-MDBM [pang2015deep], (7) Kernelized and Kernelized+SentiBank [zhang2018recognition].

Implementation Details

Following [jiang2014predicting, zhang2018recognition], the experiments on VE-8 are conducted 10 runs. In each run, we randomly select 2/3 of the data from each category for training and the rest for testing. We report the average classification accuracy of the 10 runs. For E-6, we employ the split provided by the dataset, i.e. 819 videos for training and 818 for testing. The classification accuracy on the test set is evaluated. Our model is based on two state-of-the-art CNN architectures: 2D ResNet-18  [he2016deep] and 3D ResNet-101  [hara2018can], which are initialized with the weights pre-trained on ImageNet [deng2009imagenet] and Kinetics [carreira2017quo], respectively. In addition, for the visual stream, we divide the input video into 10 segments and sample 16 successive frames from each of them. We resize each frame of the visual sample and make the short side length of the sample equal to 112 pixels, and then apply random horizontal flips and crop a random 112 x 112 patch as data augmentation to reduce overfitting. In our training, Adam [kingma2014adam]

is adopted to automatically adjust the learning rate during optimization, with the initial learning rate set to 0.0002 and the model is trained with batch-size 32 for 150 epochs. Our model is implemented using PyTorch.

Method Visual Audio Attribute Auxiliary End-to-end Accuracy
SentiBank [borth2013sentibank] 35.5
E-MDBM [pang2015deep] 40.4
ITE [xu2018heterogeneous] 44.7
V.+Au.+At. [jiang2014predicting] 46.1
CFN [chen2016emotion] 50.4
V.+Au.+At.+E-MDBM [pang2015deep] 51.1
Kernelized [zhang2018recognition] 49.7
Kernelized+SentiBank [zhang2018recognition] 52.5
VAANet (Ours) 54.5
Table 1: Comparison between the proposed VAANet and several state-of-the-art methods on the VE-8 dataset, where ‘Visual’, ‘Audio’, and ‘Attribute’ indicate whether corresponding features are used, ‘Auxiliary’ means whether no auxiliary data is used except the commonly used ImageNet [deng2009imagenet] and Kinetics [kay2017kinetics] for pre-training, and ‘End-to-end’ indicates whether the corresponding algorithm is trained in an end-to-end manner. The best method is emphasized in bold. Our method achieves the best results, outperforming the state-of-the-art approaches.
Method Accuracy
ITE [xu2018heterogeneous] 51.2
CFN [chen2016emotion] 51.8
Kernelized [zhang2018recognition] 54.4
VAANet (Ours) 55.3
Table 2: Comparison between the proposed VAANet and several state-of-the-art methods on the E-6 dataset. The best method is emphasized in bold. Our method performs better than the state-of-the-art approaches.

Comparison with the State-of-the-art

The extracted features, training strategies, and average performance comparisons between the proposed VAANet and the state-of-the-art approaches are shown in Tables 1 and 2 on VE-8 and E-6 datsets, respectively. From the results, we have the following observations:

(1) All these methods consider visual features, which is reasonable since the visual content in videos is the most direct way to evoke emotions. Further, existing methods all employ the traditional shallow learning pipeline, which indicates that the corresponding algorithms are trained step by step instead of end-to-end.

(2) Most previous methods extract attribute features. It is demonstrated that attributes indeed contribute to the emotion recognition task [chen2016emotion]. However, this requires some auxiliary data to train attribute classifiers. For example, though highly related to emotions, the adjective noun pairs obtained by SentiBank are trained on the VSO dataset [borth2013sentibank]. Besides high computation cost, the auxiliary data to train such attribute classifiers are often not available in real applications.

(3) Without extracting attribute features or requiring auxiliary data, the proposed VAANet is the only end-to-end model and achieves the best emotion recognition accuracy. Compared with the reported state-of-the-art results, i.e. Kernelized+SentiBank [zhang2018recognition] on VE-8 and Kernelized [zhang2018recognition] on E-6, VAANet can respectively obtain 2% and 0.9% performance gains. The performance improvements benefit from the advantages of VAANet. First, the various attentions enable the network to focus on discriminative key segments, spatial context, and channel interdependency. Second, the novel PCCE loss considers the polarity-emotion hierarchy constraint, i.e. the emotion correlations, which can guide the detailed learning process. Third, the visual features extracted by 3D ResNet-101 can model the temporal correlation of the adjacent frames in a given segment.

Figure 3: Visualization of the learned visual spatial attention and visual temporal attention. In both learned color bar and attention maps, red regions indicate more attention. The proposed VAANet can focus on the salient and discriminative frames and regions for emotion recognition in user-generated videos. Note that all the shown examples are drawn from the test set of VE-8.
Attentions Anger Anticipation Disgust Fear Joy Sadness Surprise Trust Average
AT 11.3 3.0 19.1 46.1 46.1 42.4 74.7 10.7 41.4
VS 45.4 27.9 44.8 59.5 49.1 51.0 65.1 35.6 51.7
VS+VCW 46.2 25.9 42.8 63.2 50.9 40.2 67.5 45.2 52.6
VS+VCW+VT 55.6 30.8 37.5 60.4 57.7 50.0 65.2 34.6 53.6
VS+VCW+VT+AT 48.2 24.1 33.3 55.9 55.9 52.5 77.1 35.6 54.5
Table 3: Ablation study of different attentions in the proposed VAANet for video emotion recognition on the VE-8 dataset, where ‘VS’, ‘VCW’, ‘VT’, and ‘AT’ are short for visual spatial, visual channel-wise, visual temporal, and audio temporal attentions, respectively. All the attentions contribute to the emotion regression task.
Attentions Anger Disgust Fear Joy Sadness Surprise Average
AT 32.4 14.1 38.5 28.1 63.9 46.5 37.2
VS 59.9 44.8 49.7 46.2 35.3 62.5 50.8
VS+VCW 58.4 52.6 49.2 46.3 44.0 65.1 53.4
VS+VCW+VT 57.1 53.1 48.0 57.0 38.7 65.0 54.5
VS+VCW+VT+AT 55.1 50.6 45.7 53.7 50.3 68.9 55.3
Table 4: Ablation study of different attentions in the proposed VAANet for video emotion recognition on the E-6 dataset.

Ablation Study

The proposed VAANet model contains two major components: a novel attention mechanism and a novel cross-entropy loss. We conduct ablation study to further verify their effectiveness by changing one component and fixing the other. First, using polarity-consistent cross-entropy loss, we investigate the influence of different attentions, including visual spatial (VS), visual channel-wise (VCW), visual temporal (VT), and audio temporal (AT) ones. The emotion recognition accuracy of each emotion category and the average accuracy on VE-8 and E-6 datasets are shown in Table 3 and 4, respectively. From the results, we can observe that: (1) visual attentions even only using spatial attention significantly outperform audio attentions (on average more than 10% improvement), which is understandable because in many videos the audio does not change much; (2) adding each one of them introduces performance gains, which demonstrates that all these attentions contribute to the video emotion recognition task; (3) though not performing well alone, combining audio features with visual features can boost the performance with about 1% accuracy gains.

Attentions Loss VE-8 E-6
VS+VCW+VT CE 51.9 52.0
PCCE 53.6 54.5
VS+VCW+VT+AT CE 53.9 54.6
PCCE 54.5 55.3
Table 5: Performance comparison between traditional cross-entropy loss (CE) and our polarity-consistent cross-entropy loss (PCCE) measured by average accuracy.

Second, we evaluate the effectiveness of the proposed polarity-consistent cross-entropy loss (PCCE) by comparing with traditional cross-entropy loss (CE). Table 5

shows the results when visual attentions (VS+VCW+VT) and visual+audio attentions (VS+VCW+VT+AT) are considered. From the results, it is clear that for both settings, PCCE performs better. The performance improvements of PCCE over CE for visual attentions and visual+audio attentions are 1.5%, 0.6% and 2.5%, 0.7% on the VE-8 and E-6 datasets, respectively. This demonstrates the effectiveness of emotion hierarchy as prior knowledge. This novel loss can also be easily extended to other machine learning tasks if some prior knowledge is available.


In order to show the interpretability of our model, we use the heat map generated by the Gram-Cam algorithm [selvaraju2017grad] to visualize the visual spatial attention obtained by the proposed VAANet. The visual temporal attention generated by our model is also illustrated through the color bar. As illustrated in Figure 3, the well-trained VAANet can successfully pay more attention not only to the discriminative frames, but also to different salient regions in corresponding frames. For example, in the top left test case, the key object that makes people feel ‘disgust’ is a caterpillar, and a man is touching it with his finger. The model assigns the highest temporal attention when the finger is removed, and the caterpillar is completely exposed to the camera. In the bottom left case, our model can focus on the person and the dog during the whole video. Further, when the dog rushes out from the bottom right corner and makes the audience feel ‘anticipated’, the temporal attention becomes larger. In the middle bottom case, our model pays more attention when the ‘surprise’ comes up.


In this paper, we have proposed an effective emotion recognition method in user-generated videos based on visual and audio attentions. The developed novel VAANet model consists of a novel attention mechanism and a novel cross-entropy loss, with less auxiliary data used. By considering various attentions, VAANet can better focus on the discriminative key segments and their key regions. The polarity-consistent cross-entropy loss can guide the attention generation. The extensive experiments conducted on VideoEmotion-8 and Ekman-6 benchmarks demonstrate that VAANet achieves 2.0% and 0.9% performance improvements as compared to the best state-of-the-art video emotion recognition approach. In future studies, we plan to extend the VAANet model to both fine-tuned emotion classification and emotion regression tasks. We also aim to investigate attentions that can better concentrate on the key frames in each video segment.


This work is supported by Berkeley DeepDrive, the National Natural Science Foundation of China (Nos. 61701273, 61876094, U1933114), the Major Project for New Generation of AI Grant (No. 2018AAA010040003), Natural Science Foundation of Tianjin, China (Nos.18JCYBJC15400, 18ZXZNGX00110), and the Open Project Program of the National Laboratory of Pattern Recognition (NLPR).