Affective understanding is a task that aims to predict the viewer expressions evoked by the videos, which could be applied to video creation and recommendation. With the rapid increase of video content on the Internet, this technique becomes even more desired and has attracted enormous interest recently[baveye2017affective, wang2015video]. However, due to the affective gap between video content and the emotional response, it remains a challenging topic.
The main methods in this area are multi-modal models, which combine visual and auditory features for affective predictions[poria2015towards, sun2019gla]. However, for the recent proposed frame-level affective prediction task[sun2021eev]
, the expression of specific frame could be determined by signals of various timescales, these previous methods may be insufficient and imprecise. In this paper, we propose a Multi-Granularity Network with Modal Attention named MGN-MA, which extends previous multi-modal features into multi-granularity levels for frame-level affective understanding. Specifically, the multi-granularity features could be divided into three levels with various timescales, frame-level features, clips-level features and video-level features. Frame-level feature contains visual-salient content of the frame and audio style at the moment. Clips-level features represent semantic-context information around the frame, e.g. human behaviors, semantic information from speech. Video-level feature consists of the main affective information of the whole video, which is related to the theme of the video. Taken the above features into consideration, we employ a modal attention module with modal drop-out to further emphasize more affection-relevant modals. Finally, the representation goes through a MOE model[jordan1994hierarchical] for the expression prediction. Experiments on the Evoked Expressions from Videos (EEV) dataset verify the effectiveness of our proposed method.
The overall framework of our proposed Multi-Granularity Network with Modal Attention is illustrated in Fig. 2, which could divided into Multi-Granularity Feature Construction, Modal Attention Fusion and Expression Classification. Taking frames, clips and the whole video as the multi-granularity input, three sub-networks are employed to extract frame-level, clips-level and video-level features and construct the Multi-Granularity Feature(MGF). Then the above MGF are fed into the Modal Attention Fusion(MAF) module to obtain the multi-modal feature. Finally, a MOE model is used to predict the evoked expressions of the frame.
2.1 Multi-Granularity Feature Construction
To provide sufficient information for frame-level expression prediction, we construct the Multi-Granularity Feature, which contains frame-level features, clips-level features and video-level features. More details of the feature extraction will be discussed as follows.
2.1.1 Frame-level Features
Image Feature. We employ Inception-Resnet-v2[szegedy2017inception]
architecture pretrained on ImageNet[deng2009imagenet]
for image feature extraction, which could represent visual-semantic content appeared in the frame. We extract the output in the last hidden layer as the feature and obtain a 1536-D vector for each frame. The image feature is performed at 6 Hz.
Audio Feature. We employ a VGG-style model provided by AudioSet[gemmeke2017audio] trained on a preliminary version of YouTube8M[abu2016youtube] for audio feature extraction. Following the method from [hershey2017cnn]
, the raw audio signal is first divided into 960 ms frames, then decomposed with a short-time Fourier transform applying 25 ms windows every 10 ms, which results the log-mel spectrogram patches of 96×64 bins. The spectrogram is further fed into the model and the 128-D embedding could be obtained.
2.1.2 Clips-level Features
Action Feature. S3D[xie2018rethinking] CNN backbone is used to extract Action feature for human behaviors. The S3D backbone is trained with MIL-NCE method[miech2020end], which employs self-supervised information between visual content and speech texts on HowTo100M[miech2019howto100m] narrated video dataset. Around the target frame, we extract the context video clip with 32 frames sampled at 10 fps, and the output of the S3D Linear layer with 512-D is used as the embedding.
Subtitle Feature. Subtitles or automatically generated subtitles are downloaded for each video and the nearest speech text for each frame could be obtained. Then a base BERT model pretrained on the BooksCorpus (800M words)[devlin2018bert] and English Wikipedia (2,500M words) is applied to extract the 768-D embedding of the speech text.
2.1.3 Video-level Features
To extract video-level feature, we trained a video-level expression prediction network on the EEV dataset[sun2021eev] and the output of last hidden layer with 1024-D is used. The video-level expression is obtained by the averaged labels of frames. For each video, we take the 80 uniform selected video frames, audio frames and video title as input, and extract the multi-modal features by above Inception-Resnet-v2[szegedy2017inception], VGG-Style model[hershey2017cnn] and BERT[devlin2018bert] model. Features of video frames and audio frames are fed into its own NetVLAD sub-network for temporal pooling, and further merged into the video-level feature with the title feature. We propose a modal attention fusion module for feature fusion, which will be discussed in the following section.
2.2 Modal Attention Fusion
The modal attention fusion module is proposed to emphasize more affection-relevant features in the above multi-granularity multi-modal features. Specifically, The weighting value of each feature is adaptively predicted by an attention module , which outputs a normalized weight by softmax. The input of the modal attention fusion module is defined as the concatenation of above multi-granularity multi-modal features. By multiplying the weight to the corresponding feature, the total merged multi-modal feature could be obtained:
where is concatenation operator, is the weight of feature , is total number of modals. A modal level dropout mechanism is performed by replacing specific feature by zeros vector randomly, which could improve the robustness of the model.
2.3 Expression Classification
A 3-experts MOE[jordan1994hierarchical] model is used for expression classification. To train all the parameters in our model, we minimize the classification loss between the predicted score and ground-truth score. The predicted score
is first converted to probabilistic score by Sigmoid function:
Then the classification loss function is computed as :
where B is the batch size, C is the class number and is the expression label of the sample.
3.1 Datasets and Protocols
EEV dataset The dataset contains 8 million annotations of viewer facial reactions to 5,153 videos (370 hours). Each video is annotated at 6 Hz with 15 continuous evoked expression labels, corresponding to the facial expression of viewers who reacted to the video.
Protocols The performance is evaluated using correlation computed for each expression in each video, then averaged over the expressions and the videos. The correlation is based on scipy:
where is the predicted values (0 1) for each expression, is the ground truth value (0 1) for each expression, is the average of and is the average of . Note that correlation is computed over each video.
3.2 Implementation Details
Our model is trained with Adam optimizer with 30 epochs. We start training with a learning rate of 0.0001. We use a mini-batch size of 1536. The dimension of the multi-modal feature is set to 1024. For the mixture of experts, we apply batch normalization before the non-linear layer. To avoid over-fitting, we select the snapshot of the model that gains the best result of correlation on the validation set. As some video might be deleted or made private, we employ the available 3024 videos for training and 730 videos for validation. The models are implemented in TensorFlow.
To evaluate the importance of different granularity of features, we start with frame-level features and gradually add the clip-level features and video-level features, the experiment results on the test set of the EEV dataset are shown in Table 1. For the frame-level feature, the performance of image feature and audio feature are evaluated separately with the correlation score of 0.00349 and 0.00574, which implies that audio is more effective than visual content for affective prediction. The combination of image and audio feature could further increase the performance to 0.00739. After the introduction of clip-level features, the correlation score is remarkably increased by 0.0589. This indicates the importance of semantic-context information around the frame, which is not supplied from the image feature and audio feature. It is noted that the subtitle feature are more effective than action features, this means more semantic information exists in the subtitles. Moreover, the introduction of video-level feature further increases the performance by 0.00201, and proves the importance of the affective information from the whole video. The above results indicates that our proposed multi-granularity features are effective for affective prediction.
In addition to the proposed the multi-granularity feature, we further evaluate the performance of modal fusion module and expression classification module. As shown in Table 2, four methods are compared with the same multi-granularity feature. Compared with Multilayer Perception classification, MOE improves the correlation result by 0.00189. The model achieves the best performance when the number of expert is set as 3. When the Modal Attention Fusion (MAF) module is introduced to replace the feature concatenation fusion, the correlation score is improved from 0.01718 to 0.01849. We ensembles five models to further improve the performance, and achieve a correlation score of 0.02292, which improves the result by 0.00443 compared with single model.
In this paper, the multi-granularity network with modal attention (MGN-MA) is proposed for dense affective predictions. MGN-MA consists of multi-granularity feature construction module, modal attention fusion module and expression classification module. The multi-granularity feature construction module could learn more semantic content and video theme information. The modal attention fusion module further improves the performance by emphasizing more affection-relevant features in the multi-granularity multi-modal features. The MGN-MA achives the correlation score of 0.02292 in the EEV challenge.