In this paper, we aim to tackle two challenges in video emotion recognition: data imbalance and partial labels.
The problem of imbalanced-data is very common in single-label and multi-label classification datasets for emotion recognition. For example, the FER2013 dataset  consists of 35,887 facial images, where each image is annotated with 7 basic emotions (angry, disgust, fear, happy, sad, surprise and neutral). However, among all images, about of images are labeled with ”happy”, while only
of images are labeled with ”disgust”. The drastic difference between the numbers of samples in the majority class and the minority class probably leads to overfitting on the majority class, which hinders the overall prediction performance. Our strategy to deal with data imbalance is the combination of two methods: one is to introduce more data, and the other is to oversample the minority classes (or undersample the majority classes).
The problem of partial labels (missing labels) is defined as the incomplete labels in the training data of multitask learning. We aim to solve this partial labels problem, because we seek a unified solution for FAU, facial expressions, as well as valence and arousal. However very few emotion datasets have the full set of aforementioned labels. An intuitive solution is Binary Relevance (BR), which trains one classifier for each class, regardless of the presence of the labels for other classes. However, BR fails to model the correlations between different labels. And it is not efficient to train many classifiers, especially when the number of labels increases. To overcome these shortcomings, we propose to use a shared feature extractor for all tasks, and use multiple heads on top of the feature extractor as classifiers. To better learn inter-task correlations with incomplete labels, we first train a teacher model with partial labels, and then use the outputs of the teacher model as supervision for the student models. There are other studies which try to tackle this problem[9, 7]. Our method is different from theirs by the ways to complete the partial labels.
In this paper, we aim to perform three tasks: FAU prediction, facial expressions prediction, as well as valence and arousal prediction. We denote them as . FAU prediction is a multi-label classification problem, where the model predicts on a sample the presence/absence of eight AUs (AU1, AU2, AU4, AU6, AU12, AU15, AU20, AU25) that are not mutually exclusive. Facial expressions prediction is a multi-class classification problem, where the model predicts one out of seven categories corresponding to seven basis emotions. Valence and arousal prediction is a regression problem, where the model predicts two continuous scores between to .
We denote the training data by and the total number of instances from all three tasks by . We denote that is the instance, which belongs to the task, , and that is the corresponding label for the instance. Note that the labels for the first task (FAU prediction) is , where is the number of instances. Similarly, , and . We denote our model as , where are the model parameters. While each instance only has label from one task, our model predicts the results for all three tasks for each instance regardless which task it comes from, i.e., when the input to our model is instance , the output is , where
is the estimate of.
The main dataset that we use is the Aff-wild2 dataset , which is a large scale in-the-wild video dataset, developed from the Aff-wild dataset [12, 6, 4]. There are three sets in the Aff-wild2 dataset, corresponding to three emotion tasks. Each set contains several videos, in which every frame is annotated with related labels. More details about this dataset can be found in .
We aim to alleviate the data imbalance problem in the Aff-wild2 dataset by importing external datasets. For FAU, we import the Denver Intensity of Spontaneous Facial Action (DISFA) dataset . By merging the DISFA dataset with the Aff-wild2 dataset, we enlarge the data size and also increase the number of samples in the minority class. However, simply merging the two datasets cannot solve the data imbalance problem. We use the ML-ROS algorithm  to oversample instances with positive minority classes. The ML-ROS algorithm can reduce MeanIR (Mean Imbalance Ratio), which is a measurement of the average level of imbalance in an multilabel dataset. After applying ML-ROS, we find that the number of samples for each AU becomes closer. The FAU distributions of the Aff-wild2 dataset, the DISFA dataset and the merged dataset are shown in Figure 1.
For facial expression, we import the Expression in-the-Wild (ExpW) Dataset 
, which is a large-scale image dataset. The ExpW dataset contains 91,795 images annotated with seven emotion categories, while the number of images in the AU set of the Aff-wild2 dataset is 10 times larger than the ExpW dataset. We downsample the Aff-wild2 dataset to make the sizes of the two datasets comparable. After merging the downsampled Aff-wild2 dataset and the ExpW dataset, we oversample the minority classes and undersample the majority classes to ensure every class has the same probability of appearing in one epoch. The distributions of seven basic facial expressions in the downsampled Aff-wild2 dataset, the ExpW dataset and the merged dataset are shown in Figure2.
For valence and arousal prediction, we import the AFEW-VA dataset  which is a large-scale video dataset in the wild. In total there are 30,051 frames in the AFEW-VA dataset that are annotated with both valence and arousal between to . First, we rescale the labels of the AFEW-VA dataset to . Secondly, we downsample the Aff-wild2 dataset by 5 to reduce the number of frames. Finally, We merge the AFEW-VA dataset with the downsampled Aff-wild2 dataset. In addition, We discretize the continuous valence/arousal scores in into 20 bins of the same width. We treat each bin as a category, and apply the oversampling/undersampling strategy as we applied in the multilabel classification dataset. The distributions of the valence and arousal scores in the downsampled Aff-wild2 dataset, the AFEW-VA dataset and the merged dataset are shown in Figure 3.
Learning from partial labels
As each instance only has the label from one task, the intuitive way to train a unified model is to only use label from that task for supervision. However, this training strategy may not be able to capture inter-task correlations as one instance is only trained by one task. Thus, we propose a two-step semi-supervised method that can train each instance by information from all three tasks. In the first step, we train a unified teacher model with partial labels, where each instance is trained by the ground truth label of its corresponding task. After training the teacher model, we assign the missing labels to each instance by the estimates of the teacher model. We refer to the estimates of the teacher model as soft labels. In the second step, we use the ground truth labels and the soft labels to train the student model.
For the ease of presentation, we refer to the loss using ground truth as supervision as the supervision loss and refer to the loss using soft labels as supervision as thr distillation loss. We describe them formally below.
The supervision loss uses the ground truth for supervision. We choose different loss functions for different tasks. For FAU prediction, since this is a multilabel classification problem, we use the binary cross entropy loss for each AU, defined as follows:
where is the ground truth label for the AU, is the output of the network, and is the number of AUs, which equals to 8.
is the sigmoid function.
For facial expression prediction, since it is a multiclass classification problem, we use the categorical cross entropy loss, defined as follows:
where is the ground truth label for the facial expression category, is the output of the network, and is the number of facial expressions, which equals to 7.
For valence and arousal prediction, it is a regression problem. We discretize the continuous labels into 20 categorical labels and then combine the classification loss with the regression loss as follows:
where the first term is the categorical cross-entropy loss and the second term is the negative Concordance Correlation Coefficient (CCC). The subscript
indicates the value is a discretized one-hot vector instead of a continuous value. In CCC,is the correlation coefficient between the ground truth and the prediction (continuous). and
are the standard deviations of the ground truth and the prediction, respectively.and
are the means. To transform the output probabilities to continuous values, we simply calculate the expectation given the probability distribution.
We linearly combine the loss functions for the three tasks. The overall loss function for the teacher model is defined as follows:
Distillation loss. The distillation loss uses the soft labels for supervision. We use to denote the output of the teacher model, and to denote the output of the student model, For the facial expressions prediction, as well as valence and arousal prediction, we use Kullback-Leibler (KL) divergence between the estimates of the teacher models and that of the student models as the distillation loss
where the temperature is set to 1.5, and we omit the task index here. The KL divergence measures the difference between two probability distributions.
For AU prediction, the distillation loss we use is the binary cross entropy loss between the teacher model outputs and the student model outputs:
In practice, we find that it’s better to use both the soft labels and the ground truths to train the student model. When the input is , the input has its corresponding task and the ground truth . The soft labels for the input is denoted by , where is the parameter of the teacher model. We denote the loss function for the student model as :
where and are the supervision loss and the distillation loss respectively for the task. The parameter determines how much we should trust the soft labels, when the ground truths exist. In our experiments, we set , making the weights on the ground truth larger than the weights on the soft labels.
The pseudo code for training the teacher model and the student models are in Algorithm 1. When provided with the dataset , the initial model parameters and , the number of training epochs for the teacher model and the student model , we first train the teacher model using the loss function in Eq. (4). Then, given the outputs of the teacher model, we train the student model with the loss function in Eq. (7). Algorithm 1 only shows the case when the number of student models equals to 1. In practice, we can repeat the second procedure in Algorithm 1 to obtain multiple student models.
For parameter , we use two different architectures: the CNN architecture and CNN-RNN architecture. The CNN architecture consists of a ResNet50 model as the feature extractor. We stack three MLPs on the top of the final conv layer of the ResNet50 model as classifiers. For CNN-RNN architectures, the feature extractor is the same as the feature extractor in CNN architecture. We use three bidirectional GRU layers to encode the temporal correlations over frames of features. Each GRU layer corresponds to one task. Details about the two architectures are shown in Figure 4. The dashed rectangle means the parameters inside are the same between the two models.
During the training of CNN architecture, we use the combination of Aff-wild2 dataset and external datasets. our experiments can be divided into two parts: single-task training and multi-task training. In the single-task training, we want to compare the performance of the teacher model before and after applying our data balancing techniques. In the multi-task training, we follow the procedures in Algorithm 1 to get the teacher model and multiple student models.
For training CNN-RNN model, we only use the data from the Aff-wild2 dataset, because it is not always possible to get image sequences from our external datasets. We initialize the feature extractor in CNN-RNN model by the feature extractor that was learned in CNN architectures, and we fix it while training the RNNs.
For both architectures, we use Adam as the only optimizer. The learning rate of Adam optimizer is initially 0.0001, and it will decrease by 10 after every 3 epochs. The total number of epochs for training the teacher model is 8. The total number of epochs of training the student model is 3. The input image size to both architectures is .
About evaluation metrics, we keep them consistent with the metrics used in. For FAU prediction, the evaluation metric is , where denotes an unweighted F1 score for all 8 AUs, and denotes the total accuracy. For facial expression prediction, we use as the metric, where denotes the unweighted F1 score for 7 classes, and is the total accuracy. For valence and arousal prediction, we use CCC as the evaluation metric.
We train single-task CNN models for all three tasks, with imbalanced dataset and balanced datset. The results are reported in Table 1. After applying our data balancing techniques, we find that the performance of single-task CNN was improved by a large margin for FAU and facial expression prediction. But the improvement for the regression tasks (valence and arousal) is not obvious. Considering its impact on the other tasks, we consistently apply our data balancing techniques for the rest of the experiments.
We train multiple-task CNN models using Algorithm 1. The performance of the teacher model, the student models, and the ensemble of student models are reported in Table 2. The number of students is set to 5. Comparing the results of the multitask teacher model with the results of the single-task models, we find that in some tasks, such as FAU and facial expression prediction, the multitask teacher model does not outperform the single-task models. The arousal and valence predictions are improved using multitask learning. The teacher model is not trained with complete training labels, but the student models are. Our assumption is that the student model can perform better than the teacher model because of the supervision of the complete labels. Besides that, the soft labels may provide the dark knowledge  that exists in the softened probabilities. From Table 2, we find the almost every student model outperforms the teacher model on all the tasks. Finally, we achieve better performance by merging the outputs of all five student models.
For CNN-RNN architectures, we only train the multi-task models. We set the image sequence length to be 32. Note that we use the parameter of the best CNN model to initialize the feature extractor in CNN-RNN model. In our experiments, it is the Student0 model in Table 2. The performance of the multitask CNN-RNN models are reported in Table 3. The number of students is also set to 5. Among the student models, we find some of them can outperform the teacher model. The student ensemble performs the best among all these models.
In this paper, we explored the data balancing techniques and applied them to multitask emotion recognition task. We proposed an algorithm for the multitask model to learn from partial labels, which consists of the teacher model training and the student model training. Our results prove that every student model outperforms the teacher model on almost all tasks, which might due to the benefit of full set of labels and softened probabilities.
-  (2015) Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163, pp. 3–16. Cited by: Data Imbalance.
Challenges in representation learning: a report on three machine learning contests. Neural Networks 64, pp. 59–63. Cited by: Introduction.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: CNN Results.
-  (2017) Recognition of affect in the wild using deep neural networks. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pp. 1972–1979. Cited by: Data Imbalance.
-  (2020) Analysing affective behavior in the first abaw 2020 competition. External Links: Cited by: Data Imbalance, Experiments, Table 1.
-  (2019) Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision, pp. 1–23. Cited by: Data Imbalance.
-  (2018) A multi-component cnn-rnn approach for dimensional emotion recognition in-the-wild. arXiv preprint arXiv:1805.01452. Cited by: Introduction.
-  (2018) Aff-wild2: extending the aff-wild database for affect recognition. arXiv preprint arXiv:1811.07770. Cited by: Data Imbalance.
-  (2019) Expression, affect, action unit recognition: aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855. Cited by: Introduction.
-  (2017) AFEW-va database for valence and arousal estimation in-the-wild. Image and Vision Computing 65, pp. 23–36. Cited by: Data Imbalance.
-  (2013) Disfa: a spontaneous facial action intensity database. IEEE Transactions on Affective Computing 4 (2), pp. 151–160. Cited by: Data Imbalance.
-  (2017) Aff-wild: valence and arousal’in-the-wild’challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 34–41. Cited by: Data Imbalance.
-  (2018) From facial expression recognition to interpersonal relation prediction. International Journal of Computer Vision 126 (5), pp. 550–569. Cited by: Data Imbalance.