🏆 The 2nd Place Submission to the CVPR2021-Evoked Emotion from Videos challenge.
Obtaining viewer responses from videos can be useful for creators and streaming platforms to analyze the video performance and improve the future user experience. In this report, we present our method for 2021 Evoked Expression from Videos Challenge. In particular, our model utilizes both audio and image modalities as inputs to predict emotion changes of viewers. To model long-range emotion changes, we use a GRU-based model to predict one sparse signal with 1Hz. We observe that the emotion changes are smooth. Therefore, the final dense prediction is obtained via linear interpolating the signal, which is robust to the prediction fluctuation. Albeit simple, the proposed method has achieved pearson's correlation score of 0.04430 on the final private test set.READ FULL TEXT VIEW PDF
The continuous dimensional emotion modelled by arousal and valence can d...
This paper describes audEERING's submissions as well as additional
Multimodal Sentiment Analysis (MuSe) 2021 is a challenge focusing on the...
We present our winning submission to the First International Workshop on...
The proposed model is only for the audio module. All videos in the OMG
The goal of continuous emotion recognition is to assign an emotion value...
What if emotion could be captured in a general and subject-agnostic fash...
🏆 The 2nd Place Submission to the CVPR2021-Evoked Emotion from Videos challenge.
Videos, as rich-media content, are capable of evoking a range of emotions of viewers [sun2021eev]
. With the booming of streaming platforms, creating attractive videos has become a key demand for both platforms and creators. Therefore, the 2021 Evoked Expression from Videos Challenge intends to solve this challenge and proposes the objective: predicting continuous viewer responses from youtube videos at the rate of 6Hz. Predicting evoked affect before viewers see the video is helpful to future video creation as well as recommendation optimization. The organizers provide Internet videos of more than 20 themes covering music, films, games, and more. The average length of these videos is 4.3 minutes which is approximately 1500 predictions to be made. The labels are logits with corresponding timestamps. The predictions of one video are evaluated using pearson’s correlation coefficient for each emotion. The final score is averaged over the allemotions on the private test set, which contains videos.
The proposed model is shown in Figure 1. Given one video, we split it into frames and audio inputs. We design one model architecture that exploit the temporal relation within each modality and the cross-modality information by fusing the feature together to generate viewer emotion predictions. As the competition demands, we are required to generate one dense predictions of viewer responses at Hz. Every second we need to predict 6-time emotion changes and the total number of prediction for each video is about . We observe that the emotion changes are smooth. In the experiment, we verify this point that directly predicting the Hz dense prediction is not stable and the pearson’s correlation is much lower than the smooth prediction from sparse signal. In particular, the final model is to predict reactions at 1Hz, which reduces the total of each video to about 250 predictions. In the end, we use interpolation techniques to compute the final dense 6Hz predictions.
In the following sections, we will explain our method and several attempts on this task in detail. In Section 2, we first study how to encode one video, followed by the proposed reaction prediction model in Section 3. The experimental results are provided in Section 4 . The summary and future works are discussed in Section 5.
We deploy the off-the-self Swin Transformer [liu2021swin]
, , Swin-L, to extract image features, which is pre-trained on ImageNet-22K[krizhevsky2012imagenet]. Each sampled frame is resized to
. We extract the output at stage 4 as the visual feature, which is a feature vector of
dimensions. In practice, we also can use the feature extracted by SE-ResNet[hu2018squeeze] and InceptionNet [szegedy2015rethinking]. The only requirement is to change the output feature dimension for the subsequent training.
The audio features are extracted using VGGish [vggish], which is pretrained on YouTube-100M [vggish]. The audio track of each video is first transcoded into 16kHz mono audio and then using the method from AudioSet [gemmeke2017audio] to compute the log mel-spectrogram. The log mel-spectrogram is then fed into VGGish resulting in a audio representation of 128 dimensions.
Our model is based on the baseline model proposed by Sun [sun2021eev], as shown in Figure 1
. The pre-computed features of each modality are fed into 2-layer bidirectional gated recurrent units (GRUs)[cho2014learning] without sharing weights. The bidirectional GRU builds the correlation in the temporal space. Comparing to unidirectional GRU, it takes both the past and future information into account, which enables us to form a better generalization at each time step for each modality. It is worth noting that we do not share weight for the two modalities, and the temporal information is only shared within each modality. We adopt the late fusion strategy [karpathy2014large]. The output of GRUs is concatenated as the video representation. Context gating [miech2017learnable] is used to exploit the dependencies within the fused feature vector. In the end, another context gating layer is added along with a sigmoid activation layer to calculate the final prediction. The final emotion scores of every frame are within , and we notice that the sum of emotion categories is not supposed to be .
Optimization Objective. We use the element-wise loss. The loss enforces the model to match the labels at a frame level as well as follow the trend of each expression along the temporal dimension.
The task requires us to generate dense 6Hz predictions for each video, but we notice the sparse emotion label is more stable. In practise, we train a model based on the 1Hz sparse sampling of each video. Each video is divided into 60 seconds clips and sampled at 1Hz for feature extraction. This is a trade-off between a moderate temporal perception field and the training difficulty of the GRUs. As our GRU module runs for 60 time steps at each run, sampling at 1Hz will provide a perception field of seconds. For the result submission, we use linear interpolation to generate the final dense 6Hz expression predictions as required. More discussion on the sparse sampling can be found in Section 4.2.
We use the partial EEV dataset [sun2021eev] provided by the organizer to train and evaluate our model. The partial dataset consists of 3061 training videos, 755 validation videos, and 1377 test videos. During the challenge data preparation, we notice that some videos are missing due to video unavailability or being private. Therefore, we actually obtain 3023 videos for training and 745 videos for validation. There are more than 20 themes for the videos in the dataset with an average video length of 4.3 minutes. There are in total 15 annotated expressions in the dataset. Several expressions might denote similar basic emotion and differ in degree, like elation and amusement. This also increases the difficulty in predicting the right expression since they are hard to differentiate in nature.
We train our model on 60-frame video clip with 1Hz sample rate, which covers a 60-second time span. We consider three different strategies (see Table 1) to obtain the dense prediction at 6Hz for submission. First, we consider to keep the input frame number, , 60 frames, and change the sampling rate from Hz to Hz. The time span is also reduced from seconds to seconds, which compromises the inference result. Second, we keep the perception range, , s and sample the frame at Hz, resulting the input of frames. In experiment, we also find that the dense inputs harm the performance. Third, we can use the linear interpolation strategy to up-sample the 1Hz predictions to 6Hz. The test setting is close to the training process, and achieves the competitive performance. For the final submission, we adopt the third strategy and find that the sparse input sampling and the dense prediction interpolation are helpful.
|Sample rate||Clip||Best Val corr.|
|6Hz (wo interp)||10s||60||0.0077|
|6Hz (wo interp)||60s||360||0.0106|
|1Hz (w interp)||60s||60||0.0121|
Here we report one failed attempt. From our observation of the expression labels, we find that it contains high-frequency noises and sudden changes to 0 (caused by technical reason when collecting the dataset, as explained in [sun2021eev]). These factors make the model hard to fit the data. Therefore, one straight-forward idea is to use low-pass filters to filter out these high-frequency noises and ease out the sudden ramps. To verify this point, we calculate pearson’s correlation coefficient between the filtered label and the original label (see Table 2). If we train the model to fit the filtered label, this label correlation score can be considered as an upper bound of the model trained on the filtered data. While the smoothed labels are easier to fit, we observe that the model learned on filtered labels do not achieved better performance than the model learned on the original label (see Table 2).
|Filters||Label corr.||Trained Model corr.|
In our model, we use a simple but yet effective
loss. One optional method is to use the KL Divergence Loss to minimize the distance between the predicted logit and the ground-truth label. However, we notice that the expression labels cannot be strictly considered as a probability distribution, since the sum score ofemotions in every frame does not equal to . However, later testing (after the challenge) reveals some interesting results as shown in Table 3. It out performs the loss we use in our final submission on the validation set.
We also have tried another alternative correlation-based loss called concordance correlation coefficient (CCC) (see Equation 1). The existing work [atmaja2021evaluation] has shown that CCC performs better than error-based losses in terms of average CCC score.
and are the means for the two variables and and
are the corresponding variances.is pearson’s correlation coefficient between the two variables. This eliminates the square root part in perason’s correlation and makes it easier to optimize. In our practise, although CCC is more stable than the loss on the training set, it leads to a worse correlation score on the validation set (see Table 3).
|Losses||Best Validation corr.|
The final submitted result is obtained by ensembling top models on the validation set. Among the eight models, we also include one model trained on both training and validation sets. We achieved a pearson’s correlation score of on the final private test set of the EEV challenge (see Table 4).
|Teams||Final Test corr.|
In this report, we present our approach for the viewer reaction prediction. Our model takes the advantages of both image and audio modalities to build the temporal correlation. During inference, we use the linear interpolation to generate dense predictions from sparse predictions. Albiet simple, the proposed method have achieved the 2nd place on the EEV Challenge leaderboard. In the experiment, we not only illustrate our detailed solution but provide our failed study on different loss terms and label smoothing strategies. We hope it can pave the way for future works on reducing the noise in the labels or a new loss function to regularize the model training. In the future, we will continue to study more discriminative cross-modality losses, such as Instance loss[zheng2020dual] and Clip loss [radford2021learning], and extend the proposed method to more real-world video understanding tasks, such as sign language recognition [li2020transferring].