The convenience of mobile devices and social networks has enabled users to generate videos and upload to Internet in daily life to share their experiences and express personal opinions. As a result, an explosive growing volume of videos are being created, which results in urgent demand for the analysis and management of these videos. Besides the objective content recognition, such as objects and actions [zhu2018towards, choutas2018potion], understanding the emotional impact of the videos plays an important role in human-centered computing. On the one hand, the videos can, to a large extent, reflect the psychological states of the video generators. We can predict the generators’ possible extreme behaviors, such as depression and suicide, and take corresponding preventive actions. On the other hand, the videos that evoke strong emotions can easily resonate with viewers and bring them immersive watching experiences. Appropriate emotional resonation is crucial in intelligent advertising and video recommendation. Further, emotion recognition in user-generated videos (UGVs) can help companies analyze how customers evaluate their products and assist governments to manage the Internet.
Although with the advent of deep learning, remarkable progress has been made on text sentiment classification[zhang2018deep], image emotion analysis [zhao2018affective, zhao2018predicting, yang2018weakly], and video semantic understanding [zhu2018towards, choutas2018potion]. Emotion recognition in UGVs still remains an unsolved problem, due to the following challenges. (1) Large intra-class variation. Videos captured in quite different scenarios may evoke similar emotions. For example, visiting an amusement park, taking part in sport competition, and playing video games may all make viewers feel “excited”. This results in obvious “affective gap” between low-level features and high-level emotions. (2) Low structured consistency. Unlike professional and commercial videos, such as movies [wang2006affective] and GIFs [jou2014predicting, yang2019human], UGVs are usually taken with diverse structures, e.g. various resolutions and image blurring noises. (3) Sparse keyframe expression. Only limited keyframes directly convey and determine emotions, as shown in Figure 1, while the rest are used to introduce the background and context.
Most existing approaches on emotion recognition in UGVs focus on the first challenge, i.e. employing advanced image representations to bridge the affective gap, such as (1) mid-level attribute features [jiang2014predicting, tu2019multi] like ObjectBank [li2010object] and SentiBank [borth2013sentibank], (2) high-level semantic features [chen2016emotion] like detected events [jiang2017exploiting, caba2015activitynet], objects [deng2009imagenet], and scenes [zhou2014learning], and (3) deep convolutional neural network (CNN) features [xu2018heterogeneous, zhang2018recognition]
. zhang2018recognition zhang2018recognition transformed frame-level spatial features to another kernelized feature space via discrete Fourier transform, which partially addresses the second challenge. For the third challenge, the videos are either downsampled averagely to a fixed number of frames[zhang2018recognition], or represented by continuous frames from only one segment [tu2019multi].
The above methods have contributed to the development of emotion recognition in UGVs, but they still have some problems. (1) They mainly employ a two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. (2) The visual CNN features of each frame are separately extracted, which ignore the temporal correlation of adjacent frames. (3) The fact that emotions may be determined by keyframes from several discrete segments is neglected. (4) Some methods require auxiliary data, which is not always available in real applications. For example, the extracted event, object, and scene features in [chen2016emotion] are trained on FCVID [jiang2017exploiting] and ActivityNet [caba2015activitynet]
, ImageNet[deng2009imagenet], and Places205 [zhou2014learning] datasets, respectively. (5) They do not consider the correlations of different emotions, such as the polarity-emotion hierarchy constraint, i.e. the relation of two different emotions belonging to the same polarity is closer than those from opposite polarities.
In this paper, we propose an end-to-end Visual-Audio Attention Network, termed VAANet, to address the above problems for recognizing the emotions in UGVs, without requiring any auxiliary data except the data for pre-training. First, we spit each video into an equal number of segments. Second, for each segment, we randomly select some successive frames and feed them into a 3D CNN [hara2018can] with both spatial and channel-wise attentions to extract visual features. Meanwhile, we transform the corresponding audio waves into spectrograms and feed them into a 2D CNN [he2016deep] to extract audio features. Finally, the visual and audio features of different segments are weighted by temporal attentions to obtain the whole video’s feature representation, which is followed by a fully connected layer to obtain emotion predictions. Considering the polarity-emotion hierarchy constraint, we design a novel classification loss, i.e. polarity-consistent cross-entropy (PCCE) loss, to guide the attention generation.
In summary, the contributions of this paper are threefold:
We are the first to study the emotion recognition task in user-generated videos in an end-to-end manner.
We develop a novel network architecture, i.e. VAANet, that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN for video emotion recognition. We propose a novel PCCE loss, which enables VAANet to generate polarity preserved attention map.
We conduct extensive experiments on the VideoEmotion-8 [jiang2014predicting] and Ekman-6 [xu2018heterogeneous] datasets, and the results demonstrate the superiority of the proposed VAANet method, as compared to the state-of-the-art approaches.
Video Emotion Recognition: Psychologists usually employ two kinds of models to represent emotions: categorical emotion states (CES) and dimensional emotions space (DES). CES classify emotions into several basic categories, such as Ekman’s 6 basic categories [ekman1992argument] and Plutchik’s wheel of emotions [plutchik1980emotion]. DES usually employ a Cartesian space to represent emotions, such as valence-arousal-dominance [schlosberg1954three]. Since CES are easy for users to understand and label, here we adopt CES to represent emotions in videos.
Early research on video emotion recognition mainly focused on movies, which are well structured. kang2003affective kang2003affective employed a Hidden Markov Model to detect affective event based on low-level features, including color, motion, and shot cut rate. Joint combination of visual and audio features with support vector machine[wang2006affective] and conditional random fields [xu2013hierarchical] achieves promising result. Some recent methods work on Animated GIFs [jou2014predicting, chen2016predicting, yang2019human]. jou2014predicting jou2014predicting firstly proposed to recognize GIF emotions by using features of different types. chen2016predicting chen2016predicting improved the performance by adopting 3D ConvNets to extract spatiotemporal features. Human-centered GIF emotion recognition is conducted by considering human related information and visual attention [yang2019human].
Because of the content diversity and low quality, UGVs are more challenging to recognize emotions. jiang2014predicting jiang2014predicting investigated a large set of low-level visual-audio features and mid-level attributes, e.g. ObjectBank [li2010object] and SentiBank [borth2013sentibank]
. chen2016emotion chen2016emotion extracted various high-level semantic features based on existing detectors. Compared with hand-crafted features, deep features are more widely used[xu2018heterogeneous, zhang2018recognition]. By combining low-level visual-audio-textual features, pang2015deep pang2015deep showed that learned joint representations are complementary to hand-crafted features. Different from these methods, which employ a two-stage shallow pipeline, we propose the first end-to-end method to recognize emotions in UGVs by extracting attended visual and audio CNN features.
Please note that emotion recognition has also been widely studied in other modalities, such as text [zhang2018deep], images [zhao2017continuous, yang2018retrieving, zhao2018affective, yang2018weakly, zhao2018emotiongan, zhao2019cycleemotiongan, zhao2019pdanet, yao2019attention, zhan2019zero], speech [el2011survey], physiological signals [alarcao2017emotions, zhao2019personalized], and multi-modal data [soleymani2017survey, zhao2019affective].
Since attention can be considered as a dynamic feature extraction mechanism that combines contextual fixations over time[mnih2014recurrent, chen2017sca], it has been seamlessly incorporated into deep learning architectures and achieved outstanding performances in many vision-related tasks, such as image classification [woo2018cbam]you2016image, chen2017sca, chen2018show], and action recognition [song2017end]. These attention methods can be roughly divided into four categories: spatial attention [song2017end, woo2018cbam], semantic attention [you2016image], channel-wise attention [chen2017sca, woo2018cbam], and temporal attention [song2017end].
There are also several methods that employ attention for emotion recognition in images [you2017visual, yang2018weakly, zhao2019pdanet] and speech [mirsamadi2017automatic]. The former methods mainly consider spatial attention except PDANet [zhao2019pdanet] which also employs channel-wise attention, while the latter one only uses temporal attention. To the best of our knowledge, attention has not been studied on emotion recognition in user-generated videos. In this paper, we systematically investigate the influence of different attentions in video emotion recognition, including the importance of local spatial context by spatial attention, the interdependency between different channels by channel-wise attention, and the importance of different segments by temporal attention.
Visual-Audio Attention Network
We propose a novel CNN architecture with spatial, channel-wise, and temporal attention mechanisms for emotion recognition in user generated videos. Figure 2 shows the overall framework of the proposed VAANet. Specifically, VAANet has two streams to respectively exploit the visual and audio information. The visual stream consists of three attention modules and the audio stream contains a temporal attention module. The spatial attention and the channel-wise attention sub-networks in the visual stream are designed to automatically focus on the regions and channels that carry discriminative information within each feature map. The temporal attention sub-networks in both the visual and audio streams are designed to assign weights to different segments of a video. The training of VAANet is performed by minimizing the newly designed polarity-consistent cross-entropy loss in an end-to-end manner.
Visual Representation Extraction
To extract visual representations from a long-term video, following [wang2016temporal], the visual stream of our model works on short snippets sparsely sampled from the entire video. Specifically, we divide each video into segments with equal duration, and then randomly sample a short snippet of successive frames from each segment. We use 3D ResNet-101 [hara2018can] as backbone of the visual stream. It takes the snippets (each has successive frames) as input and independently processes them up to the last spatiotemporal convolutional layer conv5 into a super-frame. Suppose we are given training samples , where is the visual information of video , and is the corresponding emotion label. For sample , suppose the feature map of the conv5 in 3D ResNet-101 is (we omit for simplicity in the following), where and are the spatial size (height and width) of the feature map, is the number of channels, and is the number of snippets. We reshape as
by flattening the height and width of the original , where and . Here we can consider as the visual feature of the -th location in the -th super-frame. In the following, we omit the superscript V for simplicity.
Visual Spatial Attention Estimation
We employ a spatial attention module to automatically explore the different contributions of the regions in super-frames to predict the emotions. Following [chen2017sca], we employ a two-layer neural network, i.e. a convolutional layer followed by a fully-connected layer with a softmax function to generate the spatial attention distributions over all the super-frame regions. That is, for each
where and are two learnable parameter matrices, is the transpose of a matrix, and .
And then we can obtain a weighted feature map based on spatial attention as follows
where is the multiplication of a matrix and a vector, which is performed by multiplying each value in the vector to each column of the matrix.
Visual Channel-Wise Attention Estimation
Assuming that each channel of a feature map in a CNN is a response activation of the corresponding convolutional layer, channel-wise attention can be viewed as a process of selecting semantic attributes [chen2017sca]. To generate the channel-wise attention, we first transpose to G
where represents the -th channel in the -th super-frame of the feature map G. The channel-wise attention for is defined as
where and are two learnable parameter matrices, and .
And then a weighted feature map based on channel-wise attention is computed as follows
where is the multiplication of a matrix and a vector.
Visual Temporal Attention Estimation
For a video, the discriminability of each frame to recognize emotions is obviously different. Only some keyframes contain discriminative information, while the others only provide the background and context information [song2017end]. Based on such observations, we design a temporal attention sub-network to automatically focus on the important segments that contain keyframes. To generate the temporal attention, we first apply spatial average pooling to and reshape it to P
where . Here we can consider as the visual feature of the -th super-frame. The temporal attention is then defined as
where and are two learnable parameter matrices, and . Following [song2017end]
Audio Representation Extraction
Audio features are complementary to visual features, because they contain information of another modality. In our problem, we choose to use the most well-known audio representation: the mel-frequency cepstral coefficients (MFCC). Suppose we are given audio training samples , where is a descriptor from the entire soundtrack of the video and is the corresponding emotion label. We center-crop to a fixed length of to get
, and pad itself when it is necessary. Similar to the method we take in extracting visual representation, we divide each descriptor intosegments and use 2D ResNet-18 [he2016deep] as backbone of the audio stream of our model which processes descriptor segments independently. For descriptor , suppose the feature map of the conv5 in 2D ResNet-18 is (we omit for simplicity in the following), where and are the height and width of the feature map, is the number of channels, and is the number of segments. We apply spatial average pooling to and obtain .
Audio Temporal Attention Estimation
With similar motivation to integrate temporal attention sub-network into the visual stream, we introduce a temporal attention sub-network to explore the influence of audio information in different segments for recognizing emotions as
where and are two learnable parameter matrices, and . The final audio embedding is the weighted sum of all the segments
Polarity-Consistent Cross-Entropy Loss
We concatenate and to obtain an aggregated semantic vector , which can be viewed as the final representation of a video and is fed into a fully connected layer to predict the emotion labels. The traditional cross-entropy loss is defined as
where is the number of emotion classes ( for VideoEmotion-8 and for Ekman-6 in this paper), is a binary indicator, and
is the predicted probability that videobelongs to class .
Directly optimizing the cross-entropy loss in Eq. (12) can lead some videos to be incorrectly classified into categories that have opposite polarity. In this paper, we design a novel polarity-consistent cross-entorpy (PCCE) loss to guide the attention generation. That is, the penalty of the predictions that have opposite polarity to the ground truth is increased. The PCCE loss is defined as
where is a penalty coefficient that controls the penalty extent. Similar to the indicator function, g(.,.) represents whether to add the penalty or not and is defined as
is a function that maps an emotion category to its polarity (positive or negative). Since the derivatives with respect to all parameters can be computed, we can train the proposed VAANet effectively in an end-to-end manner using off-the-shelf optimizer to minimize the loss function in Eq. (13).
In this section, we evaluate the proposed VAANet model on emotion recognition in user-generated videos. We first introduce the employed benchmarks, compared baselines, and implementation details. And then we report and analyze the major results together with some empirical analysis.
We evaluate the performances of the proposed method on two publicly available datasets that contain emotion labels in user-generated videos: VideoEmotion-8 [jiang2014predicting] and Ekman-6 [xu2018heterogeneous].
VideoEmotion-8 [jiang2014predicting] (VE-8) consists of 1,101 videos collected from Youtube and Flickr with average duration 107 seconds. The videos are labeled into one of the Plutchik’s eight basic categories [plutchik1980emotion]: negative anger, disgust, fear, sadness and positive anticipation, joy, surprise, trust. In each category, there are at least 100 videos. Ekman-6 [xu2018heterogeneous] (E-6) contains 1,637 videos also collected from Youtube and Flickr. The average duration is 112 seconds. The videos are labeled with Ekman’s six emotion categories [ekman1992argument], i.e. negative anger, disgust, fear, sadness and positive joy, surprise.
To compare VAANet with the state-of-the-art approaches for video emotion recognition, we select the following methods as baselines: (1) SentiBank [borth2013sentibank], (2) Enhanced Multimodal Deep Bolzmann Machine (E-MDBM) [pang2015deep], (3) Image Transfer Encoding (ITE) [xu2018heterogeneous], (4) Visual+Audio+Attribute (V.+Au.+At.) [jiang2014predicting], (5) Context Fusion Net (CFN) [chen2016emotion], (6) V.+Au.+At.+E-MDBM [pang2015deep], (7) Kernelized and Kernelized+SentiBank [zhang2018recognition].
Following [jiang2014predicting, zhang2018recognition], the experiments on VE-8 are conducted 10 runs. In each run, we randomly select 2/3 of the data from each category for training and the rest for testing. We report the average classification accuracy of the 10 runs. For E-6, we employ the split provided by the dataset, i.e. 819 videos for training and 818 for testing. The classification accuracy on the test set is evaluated. Our model is based on two state-of-the-art CNN architectures: 2D ResNet-18 [he2016deep] and 3D ResNet-101 [hara2018can], which are initialized with the weights pre-trained on ImageNet [deng2009imagenet] and Kinetics [carreira2017quo], respectively. In addition, for the visual stream, we divide the input video into 10 segments and sample 16 successive frames from each of them. We resize each frame of the visual sample and make the short side length of the sample equal to 112 pixels, and then apply random horizontal flips and crop a random 112 x 112 patch as data augmentation to reduce overfitting. In our training, Adam [kingma2014adam]
is adopted to automatically adjust the learning rate during optimization, with the initial learning rate set to 0.0002 and the model is trained with batch-size 32 for 150 epochs. Our model is implemented using PyTorch.
Comparison with the State-of-the-art
The extracted features, training strategies, and average performance comparisons between the proposed VAANet and the state-of-the-art approaches are shown in Tables 1 and 2 on VE-8 and E-6 datsets, respectively. From the results, we have the following observations:
(1) All these methods consider visual features, which is reasonable since the visual content in videos is the most direct way to evoke emotions. Further, existing methods all employ the traditional shallow learning pipeline, which indicates that the corresponding algorithms are trained step by step instead of end-to-end.
(2) Most previous methods extract attribute features. It is demonstrated that attributes indeed contribute to the emotion recognition task [chen2016emotion]. However, this requires some auxiliary data to train attribute classifiers. For example, though highly related to emotions, the adjective noun pairs obtained by SentiBank are trained on the VSO dataset [borth2013sentibank]. Besides high computation cost, the auxiliary data to train such attribute classifiers are often not available in real applications.
(3) Without extracting attribute features or requiring auxiliary data, the proposed VAANet is the only end-to-end model and achieves the best emotion recognition accuracy. Compared with the reported state-of-the-art results, i.e. Kernelized+SentiBank [zhang2018recognition] on VE-8 and Kernelized [zhang2018recognition] on E-6, VAANet can respectively obtain 2% and 0.9% performance gains. The performance improvements benefit from the advantages of VAANet. First, the various attentions enable the network to focus on discriminative key segments, spatial context, and channel interdependency. Second, the novel PCCE loss considers the polarity-emotion hierarchy constraint, i.e. the emotion correlations, which can guide the detailed learning process. Third, the visual features extracted by 3D ResNet-101 can model the temporal correlation of the adjacent frames in a given segment.
The proposed VAANet model contains two major components: a novel attention mechanism and a novel cross-entropy loss. We conduct ablation study to further verify their effectiveness by changing one component and fixing the other. First, using polarity-consistent cross-entropy loss, we investigate the influence of different attentions, including visual spatial (VS), visual channel-wise (VCW), visual temporal (VT), and audio temporal (AT) ones. The emotion recognition accuracy of each emotion category and the average accuracy on VE-8 and E-6 datasets are shown in Table 3 and 4, respectively. From the results, we can observe that: (1) visual attentions even only using spatial attention significantly outperform audio attentions (on average more than 10% improvement), which is understandable because in many videos the audio does not change much; (2) adding each one of them introduces performance gains, which demonstrates that all these attentions contribute to the video emotion recognition task; (3) though not performing well alone, combining audio features with visual features can boost the performance with about 1% accuracy gains.
Second, we evaluate the effectiveness of the proposed polarity-consistent cross-entropy loss (PCCE) by comparing with traditional cross-entropy loss (CE). Table 5
shows the results when visual attentions (VS+VCW+VT) and visual+audio attentions (VS+VCW+VT+AT) are considered. From the results, it is clear that for both settings, PCCE performs better. The performance improvements of PCCE over CE for visual attentions and visual+audio attentions are 1.5%, 0.6% and 2.5%, 0.7% on the VE-8 and E-6 datasets, respectively. This demonstrates the effectiveness of emotion hierarchy as prior knowledge. This novel loss can also be easily extended to other machine learning tasks if some prior knowledge is available.
In order to show the interpretability of our model, we use the heat map generated by the Gram-Cam algorithm [selvaraju2017grad] to visualize the visual spatial attention obtained by the proposed VAANet. The visual temporal attention generated by our model is also illustrated through the color bar. As illustrated in Figure 3, the well-trained VAANet can successfully pay more attention not only to the discriminative frames, but also to different salient regions in corresponding frames. For example, in the top left test case, the key object that makes people feel ‘disgust’ is a caterpillar, and a man is touching it with his finger. The model assigns the highest temporal attention when the finger is removed, and the caterpillar is completely exposed to the camera. In the bottom left case, our model can focus on the person and the dog during the whole video. Further, when the dog rushes out from the bottom right corner and makes the audience feel ‘anticipated’, the temporal attention becomes larger. In the middle bottom case, our model pays more attention when the ‘surprise’ comes up.
In this paper, we have proposed an effective emotion recognition method in user-generated videos based on visual and audio attentions. The developed novel VAANet model consists of a novel attention mechanism and a novel cross-entropy loss, with less auxiliary data used. By considering various attentions, VAANet can better focus on the discriminative key segments and their key regions. The polarity-consistent cross-entropy loss can guide the attention generation. The extensive experiments conducted on VideoEmotion-8 and Ekman-6 benchmarks demonstrate that VAANet achieves 2.0% and 0.9% performance improvements as compared to the best state-of-the-art video emotion recognition approach. In future studies, we plan to extend the VAANet model to both fine-tuned emotion classification and emotion regression tasks. We also aim to investigate attentions that can better concentrate on the key frames in each video segment.
This work is supported by Berkeley DeepDrive, the National Natural Science Foundation of China (Nos. 61701273, 61876094, U1933114), the Major Project for New Generation of AI Grant (No. 2018AAA010040003), Natural Science Foundation of Tianjin, China (Nos.18JCYBJC15400, 18ZXZNGX00110), and the Open Project Program of the National Laboratory of Pattern Recognition (NLPR).