People study facial expression recognition research for a long history, but still have some challenges to be addressed, especially in-the-wild scenarios. Current in-the-wild dataset contains limited human emotion annotation due to the cost and time requirement. And this causes the limitation of multi-task method progression and applications for emotion recognition in real life. Recently, to tackle such problems, Kollias et al. , , , , , , 
held Affective Behavior Analysis in-the-wild (ABAW2) ICCV-2021 Competition and built the large-scale Aff-Wild2 dataset, which includes annotations of valence/arousal value, action unit (AU), and facial expression for three different recognition tasks. Valence represents how positive the person is while arousal describes how active the person is. AUs are the basic actions of individuals or groups of muscles for portraying emotions. As for facial expression, it classifies into seven categories, neutral, anger, disgust, fear, happiness, sadness, and surprise.
However, the annotation of valence and arousal in Aff-Wild2 dataset is distributed imbalance heavily, as shown in Fig. 1. Deng et al.  addresses the data imbalance problem by importing external AFEW-VA dataset  which contains 30,051 frames followed by rebalancing. However, the distribution of the resampled dataset is still not fully balanced.
All imbalanced learning methods, directly or indirectly, operate by compensating for the imbalance in the empirical label density distribution. This works well for class imbalance, but for continuous labels the empirical density does not accurately reflect the imbalance as seen by the neural network. Hence, compensating for data imbalance based on empirical label density is inaccurate for the continuous label space. In this work, we propose a two stream valence-arousal estimation network based on MIMAMO Net 
. The spatial and temporal learning are used to capture appearance and facial action information, which can further improve emotion recognition. Moreover, to prevent data exhibit skewed distributions, we use LDS, which were proposed by Yang et al.
, for dealing with continuous targets to preserve a uniform distribution.
Ii Related Work
In recent years, most of the existing research for facial expression recognition focused on valence-arousal estimation, facial action unit detection, and expression classification. We will introduce the latest related work of valence-arousal estimation study.
Many data are in laboratory settings. However, models that perform well on controlled conditions don’t necessarily work well on uncontrolled ones. The ideal is still far from reality. Therefore, in-the-wild datasets come to exist. Kossaif et al.  proposed a new dataset called AFEW-VA and found that geometric features performed well no matter what settings were. But it was unuseful for dynamic architecture since some of the clips in the dataset were too short to explore information between frame and frame. Barros et al.  proposed OMG dataset, collected from YouTube in real-world settings. The main keyword to select the videos was ”monologue.” Kollias et al.  built a large-scale Aff-Wild dataset, collected from Youtube, and proposed deep convolutional and recurrent neural architecture, AﬀWildNet. A CNN extracted features while an RNN aimed to capture temporal information. Furthermore, their works not only got high performance on dimensional aspects but also for expression classification.
Chang et al. 
proposed an integrated deep learning framework that used the concept of applying the information of facial action unit detection to estimate valence-arousal intensity. They had shown that exploring the relationship between AUs and V-A was helpful for V-A research. Pan et al. proposed a two-stream network to utilize effective facial features. The model contained CNN and LSTM. For temporal stream, the former extracted temporal features; the latter resolved the temporal relation between frames. For spatial stream similar to temporal one, the former extracted spatial features; the latter analyzed the spatial association between frames. Kim et al. 
tackled regression problems with adversarial learning, which enabled the model to better understand complex emotion and achieved person-independent facial expression recognition. Also, they proposed a contrastive loss function and improved the performance effectively. This study proved the potential of adversarial learning instead of conventional methods on emotion recognition.
Iii-a Overall Architecture
. Specifically, for input images, we first compute their corresponding phase difference images. The spatial stream uses the pre-trained ResNet50 model to extract features of the pool5 layer, then the feature vector are fed into a MLP module to get the final feature vector. While in the temporal stream, it utilizes phase difference images to obtain the relationship among frames. Different from, we only use one convolutional-layer block considering training effort. The output of the two-stream network connects to the regression module, which combines the information of the whole video to achieve frame-level predictions of valence and arousal values. The Label distribution smoothing (LDS) module is modified based on DIR . As validated in , the empirical label distribution does not reflect the real label density distribution in the continuous case. This is because of the dependence between data samples at nearby labels. Fig. 3 illustrates LDS and how it smooths the label density distribution. It convolves a symmetric kernel with the empirical label density to estimate the effective label density distribution that accounts for the continuity of labels In this work, we re-weight the loss function by multiplying it by the inverse of the LDS estimated label density for each target.
Iv-A1 Aff-Wild2 dataset
The Aff-Wild2 is the largest in-the-wild database annotated for valence and arousal. It contains 558 videos with frame-level annotations for valence-arousal estimation, along with facial action unit detection, and expression classification tasks. In this work, we only focus on estimating valence and arousal values, which take values in -1 to 1 and -5 represent no annotated values. In the VA set, there are 422 subjects with 1,932,935 images in the training and validation and 139 subjects with 714,986 images in the test. These cropped and aligned images were all provided by ABAW2 ICCV-2021 Competition organizers.
Iv-A2 Aff-Wild dataset
Aff-Wild dataset is an in-the-wild dataset  and contains 298 videos from YouTube, which used the keyword ”reaction” to collect. Aff-wild dataset includes valence and arousal annotations ranging continuously in [-1, 1] and has 252 videos and 1,008,652 frames in the training set, and 46 videos and 215,441 frames in the test set.
Iv-A3 AffectNet dataset
AffectNet  contains more than 1M images collected by Google, Bing, and Yahoo, and it has valence and arousal, facial action unit, and expression annotations for 450,000 images. Notably, this dataset does not have temporal information, so we only apply it to the model without temporal stream.
Iv-B Evaluation metric
To measure the agreement between the outputs of the model and the ground truth, the Concordance Correlation Coefficient (CCC) metric is used as follow:
where x and y are the predictions and annotations, and are the mean values, and
are their variances, andis the correlation coefficient.
The mean value of CCC for valence and arousal estimation will be adopted as the main evaluation criterion of the ABAW2 challenge.
Iv-C Experimental settings
We implemented our network in PyTorch. The network is trained on NVIDIA GTX 2080Ti with 11GB memory.
Iv-C1 Data Pre-processing
We merge the training set and validation set and use the cross-validation method to acquire a more accurate estimate of model prediction performance. To apply the Aff-Wild2 dataset, we remove unannotated frames from the beginning and let the remaining ones match its annotation values.
Notably, the test set may encounter missing frames. We address those missing frames by using two different methods to deal with two situations. If the disregarding frames are at the beginning of the video, we label -5 as the prediction. And if the removed frames are not the case of above, we take the predicted value of the previous frame as its estimation.
Iv-D Experimental Results
Our valence CCC and arousal CCC are 0.415 and 0.511 on the validation set. We find that the performance on arousal is better than the performance on valence. Since arousal describes how active the person is, it should be more related to facial motion than valence.
Iv-D1 Performance Comparison
In this subsection, we provide the comparison of different methods on the validation set of Affwild2 (see Table I). Our spatial stream model achieves 0.591 and 0.617 on valence and arousal respectively.
Iv-D2 Ablation study
To validate the effectiveness of the LDS module, we conduct an ablation study as shown in Table II. Valence and arousal estimation can both benefit from the LDS module. This demonstrates that LDS captures the imbalance that affects valence and arousal regression problems.
|Ours (spatial) without LDS||0.603||0.502||0.552|
|Ours (spatial) with LDS||0.604||0.515||0.560|
We have conducted valence-arousal estimation in the Aff-Wild2 dataset by introducing a two stream learning network. Moreover, we apply label distribution smoothing (LDS) to tackle data imbalanced problem. Our proposed method achieves Concordance Correlation Coefficient (CCC) of 0.591 and 0.617 for valence and arousal on the validation set of Aff-wild2 dataset. In the future, we will improve the label distribution re-weight mechanism to achieve better performance.
-  (2018) The omg-emotion behavior dataset. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. Cited by: §II.
-  (2017) FATAUVA-net: an integrated deep learning framework for facial attribute recognition, action unit detection, and valence-arousal estimation. In , pp. 17–25. Cited by: §II.
-  (2020) Multitask emotion recognition with incomplete labels. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 592–599. Cited by: §I.
MIMAMO net: integrating micro-and macro-motion for video emotion recognition.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 2621–2628. Cited by: §I, §III-A.
-  (2021) Contrastive adversarial learning for person-independent facial emotion recognition. Cited by: §II.
-  Analysing affective behavior in the first abaw 2020 competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG), pp. 794–800. Cited by: §I.
-  (2021) Analysing affective behavior in the second abaw2 competition. External Links: Cited by: §I.
-  (2021) Analysing affective behavior in the second abaw2 competition. arXiv preprint arXiv:2106.15318. Cited by: Fig. 1, TABLE I.
-  (2019) Face behavior a la carte: expressions, affect and action units in a single network. arXiv preprint arXiv:1910.11111. Cited by: §I.
-  (2021) Distribution matching for heterogeneous multi-task learning: a large-scale face study. arXiv preprint arXiv:2105.03790. Cited by: §I.
-  (2019) Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision, pp. 1–23. Cited by: §II, §IV-A2.
-  (2019) Expression, affect, action unit recognition: aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855. Cited by: §I.
-  (2021) Affect analysis in-the-wild: valence-arousal, expressions, action units and a unified framework. arXiv preprint arXiv:2103.15792. Cited by: §I.
-  (2017) AFEW-va database for valence and arousal estimation in-the-wild. Image and Vision Computing 65, pp. 23–36. Cited by: §I, §II.
-  (2017) Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10 (1), pp. 18–31. Cited by: §IV-A3.
-  (2019) A deep spatial and temporal aggregation framework for video-based facial expression recognition. IEEE Access 7, pp. 48807–48815. Cited by: §II.
-  (2021) Affective processes: stochastic modelling of temporal context for emotion and facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9074–9084. Cited by: TABLE I.
-  (2021) Delving into deep imbalanced regression. arXiv preprint arXiv:2102.09554. Cited by: §I, Fig. 3, §III-A.
-  (2017) Aff-wild: valence and arousal ‘in-the-wild’challenge. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pp. 1980–1987. Cited by: §I.