Log In Sign Up

Rethinking the Learning Paradigm for Facial Expression Recognition

by   Weijie Wang, et al.

Due to the subjective crowdsourcing annotations and the inherent inter-class similarity of facial expressions, the real-world Facial Expression Recognition (FER) datasets usually exhibit ambiguous annotation. To simplify the learning paradigm, most previous methods convert ambiguous annotation results into precise one-hot annotations and train FER models in an end-to-end supervised manner. In this paper, we rethink the existing training paradigm and propose that it is better to use weakly supervised strategies to train FER models with original ambiguous annotation.


page 3

page 4

page 8


Using Crowdsourcing to Train Facial Emotion Machine Learning Models with Ambiguous Labels

Current emotion detection classifiers predict discrete emotions. However...

Understanding and Mitigating Annotation Bias in Facial Expression Recognition

The performance of a computer vision model depends on the size and quali...

AU-Expression Knowledge Constrained Representation Learning for Facial Expression Recognition

Recognizing human emotion/expressions automatically is quite an expected...

Dynamic Adaptive Threshold based Learning for Noisy Annotations Robust Facial Expression Recognition

The real-world facial expression recognition (FER) datasets suffer from ...

Adaptive Graph-Based Feature Normalization for Facial Expression Recognition

Facial Expression Recognition (FER) suffers from data uncertainties caus...

Disjoint Contrastive Regression Learning for Multi-Sourced Annotations

Large-scale datasets are important for the development of deep learning ...

1 Introduction

Aiming at analyzing and understanding human emotion, Facial Expression Recognition (FER) is an active field of research in computer vision. In view of its practical importance in various intelligent systems, such as sociable robotics, psychological treatments, and automatic driving, numerous Facial Expression Recognition studies have been conducted.

Figure 1: These samples are from the FERPlus dataset, with the corresponding labels in green, which is obtained by voting on the results of the crowdsourcing annotations. This label conversion strategy can introduce artificial noise.

In recent years, FER has achieved impressive performance on laboratory-controlled datasets, such as CK+ [lucey2010extended], JAFFE [lyons1998coding] and RaFD [langner2010presentation], which are collected under ideal conditions and annotated by experts. Recently, with the requirement of real-world FER applications, large-scale datasets collected in an unconstrained environments, such as FERPlus [barsoum2016training], RAF-DB [li2017reliable] and AffectNet [mollahosseini2017AffectNet], continue to emerge. However, collecting large-scale datasets with fully precise annotations is a labor-intensive and challenging task. Consequently, the crowdsourcing approach is commonly utilized for annotation. Due to the subjective perception of annotators and the inherent inter-class similarity of facial expressions, the real-world FER datasets usually exhibit ambiguous annotation. To make it easier to use the crowdsourcing annotation results, most previous methods convert the annotation results into precise one-hot annotations by simply voting [barsoum2016training] or thresholding [li2017reliable] and train FER models in an end-to-end supervised manner. However, these label conversion strategies do not always yield satisfactory results as shown in Figure 1. The performance of models supervised with such converted labels are inevitably impeded.

In this paper, we rethink the existing training paradigm and propose that it is better to use weakly supervised strategies to train FER models with original ambiguous annotation. Specifically, we model FER as a Partial Label Learning (PLL) problem, where each training instance is associated with a set of candidate labels among which exactly one is true, reducing the overhead of finding exact label from ambiguous candidates [DBLP:conf/icml/LvXF0GS20]. The major challenge of PLL lies in the label disambiguation. Existing PLL methods [DBLP:journals/corr/abs-2201-08984, DBLP:conf/nips/LiuD12, DBLP:conf/kdd/ZhangZL16, DBLP:journals/tkde/LyuFWLL21] state that examples close to each other in the feature space will tend to share identical label in the label space and their label disambiguation is based on this assumption. This also indicates the reliance of label disambiguation on a good feature representation. However, there is a contradiction here: the direct use of ambiguous annotations inevitably affects the quality of representation learning, which in turn hinders label disambiguation. Errors can accumulate severely, especially with the low-quality initial representations [DBLP:conf/icml/WuWZ22].

To mitigate the influence of ambiguous labels on feature learning, we propose to use unsupervised learning to obtain the initial feature representation. Specifically, we propose to use the recently emerged Masked Image Modeling (MIM) to pre-train facial expression representation. MIM is mostly built upon the Vision Transformer (ViT), which suggests that self-supervised visual representations can be done by masking input image parts while requiring the target model to recover the missing contents. Facial expressions are high-level semantic information obtained from the combination of action units in multiple regions of the human face. Vision Transformer can model the relationship between action units with the global self-attention mechanism. MIM pre-training forces the ViT to learn the local facial action units and global facial structures in various expressions 

[He_2022_CVPR]. Then the pre-trained ViT is finetuned with an ambiguous label set in the PLL paradigm. Moreover, unlike other tasks, facial expression data exhibit high inter-class similarity, which makes label disambiguation more challenging. In this paper, we propose to treat the label disambiguation process as querying the confidence of the correlation between each category label and features. Specifically, inspired by DETR [DBLP:conf/eccv/CarionMSUKZ20], we propose a decoder-based label disambiguation module. We leverage learnable label embeddings as queries to obtain confidence in the correlation between features and the corresponding label via the cross-attention module in Transformer decoder. Finally, we integrate the encoder of ViT backbone, and the Transformer decoder-based label disambiguation module to build a fully transformer-based facial expression recognizer in the partial label learning paradigm.

Our contributions are summarized as follows:

  • We rethink the existing training paradigm of FER and propose to use weakly supervised strategies to train FER models with original ambiguous annotation. To the best of our knowledge, it is the first work that addresses the annotation ambiguity in FER with Partial Label Learning (PLL) paradigm.

  • We build a fully transformer-based facial expression recognizer, in which we explore the benefits of the facial expression representation based on MIM pre-train and the Transformer decoder-based label confidence queries for label disambiguation.

  • Our method is extensively evaluated on large-scale real-world FER datasets. Experimental results show that our method consistently outperforms state-of-the-art FER methods. This convincingly shows the great potential of the Partial Label Learning (PLL) paradigm for real-world FER.

2 Related Work

2.1 Facial Expression Recognition

The purpose of Facial Expression Recognition (FER)  [wang2020suppressing, ng2003sift, mollahosseini2017AffectNet, albanie2018emotion]

is to help computers understand human behavior and even interact with humans, which can be mainly divided into traditional-based and deep learning-based methods. For the former, it mainly focuses on hand-designed texture-based feature extraction for in-the-lab FER datasets, on which the recognition accuracy is very high.

Since then, most studies tried to tackle the in-the-wild FER task. These large-scale datasets are collected in an unconstrained environment and annotated by crowdsourcing. Due to the subjective perception of annotators and the inherent inter-class similarity of facial expressions, the real-world FER datasets usually exhibit ambiguous annotation. Previous methods [DBLP:conf/mm/WangWL19, DBLP:conf/cvpr/YangCY18, vo2020pyramid, farzaneh2020discriminant, gera2021landmark] convert the annotation results into precise one-hot annotations by simply voting [barsoum2016training] or thresholding [li2017reliable] and train FER models in an end-to-end supervised manner. However, these label conversion strategies do not always yield satisfactory results as shown in Figure 1. The performance of models supervised with such converted labels is inevitably impeded. In recent years, several attempts try to address the ambiguity problem in FER. SCN [wang2020suppressing] and RUL [zhang2021relative] relabel noisy samples automatically by learning clean samples to suppress uncertainty from noise labels. DMUE [she2021dive] proposes a method of latent label distribution mining and an inter-sample uncertainty assessment, which aims to solve the problem of label ambiguity. However, they either require the noise rate to better filter out noisy samples or bring extra computation overhead and cannot generalize well to classification tasks with a large number of classes.

Recently, some researchers move from fully-supervised learning (SL) to semi-supervised learning (SSL). Ada-CM 

[li2022towards] proposes an SSL method by adaptive learning confidence margins to utilize all unlabeled samples during training time, which represents a new idea to solve FER tasks by making use of additional unlabeled data. In this paper, we rethink the existing training paradigm and propose that it is better to use weakly supervised strategies to train FER models with original ambiguous annotation. As Figure 1 shows, in-the-wild facial expressions often include compound emotions or even mixture ones. Thus, for these samples with uncertainty, it is natural to have multiple annotated candidates. Consequently, for the first time, a new Partial Label Learning (PLL) paradigm is proposed to address the annotation ambiguity in FER.

Figure 2: The overview of our framework. In the left part, we use the pre-trained ViT encoder as our backbone for feature representation of the 2D image, In the right part, we input the obtained feature from the left part to the transformer decoder with learnable label embedding. After that, we revise confidence and update it to get

confidence. Finally, the loss is computed between logits and


2.2 Partial Label Learning

Partial Label Learning (PLL) is an important category of weakly supervised learning methods in which each object has multiple candidate labels, but only one of them is the ground-truth label. It is mainly divided into average-based and identification-based methods. The most intuitive solution is average-based methods, which generally consider each label to be equally important during training time, and make predictions by averaging the output of all candidate labels [hullermeier2006learning, zhang2015solving]. Others [cour2011learning, zhang2016partial]

utilize parametric model to maximize the average scores of candidate labels minus that of non-candidate labels. The identification-based methods normally maximize the output of the most likely candidate labels in order to disambiguate the ground-truth label. Some typical methods mainly include maximum margin criterion 

[yu2016maximum], graph-based [wang2021adaptive]. With the development of deep learning, more network structures [feng2019partial, lv2020progressive, wen2021leveraged] emerge, which rely on the output of the model itself to disambiguate the candidate label sets. Some researchers argue [liu2012conditional, zhang2022semi, lyu2019gm]

that data points closer in feature space are more likely to be classified as the same label, which relied on a good feature representation. PICO 

[wang2022pico] reconciles PLL-based label disambiguation and feature representation based on contrastive learning. However, it is required to elaborately design data augmentation due to its sensitivity to positive sample selection. In addition, the high inter-class similarity of facial expressions made it hard to obtain robust feature representations, which is a drawback of contrastive learning in FER.

2.3 Masked Image Modeling

Inspired by Masked Language Modeling (MLM), Masked Image Modeling (MIM) becomes a popular self-supervised method. iGPT [chen2020generative] predicts unknown pixels by operating on known pixel sequences. ViT [dosovitskiy2020image] predicts masked patch prediction by employing a self-supervised approach. Moreover, iBOT [zhou2021ibot] achieves excellent performance through siamese networks. However, these methods have the assumption of image semantic consistency. The ability of BEiT [bao2021beit] to predict discrete tokens depends on the pre-trained model VQVAE [van2017neural] that it relies on. As an extension of the ViT work, the main aim of SimMIM [xie2022simmim] is to predict pixels. Unlike the previously mentioned approaches, MAE [he2022masked] proposes an encoder-decoder architecture that uses the decoder for the MIM task. Differently, Maskfeat [wei2022masked] adopts hog descriptor as prediction target rather than pixels.

3 Proposed Method

In this section, we introduce a fully transformer-based facial expression recognizer with a new partial label learning paradigm, which mainly consists of a process of robust representation learning and label disambiguation as shown in Figure 2. Specifically, we explore the benefits of the facial expression representation based on MIM pre-train and the Transformer decoder-based label confidence queries for label disambiguation, which is detailed in the following sections.

3.1 Problem Formulation

Generally, given an in-the-wild FER dataset , each sample belongs to one of the categories, we denote as the deterministic category. This simplified form for labels is consistent with the supervised learning setting. However, as shown in Figure 1, most of the FER data is annotated by a way of crowdsourcing. Therefore, it is not ideal for most methods to simplify the FER task to supervised learning (SL) and use the one-hot label on all labeled data while ignoring the noise inherent in the label level.

Figure 3: We adopt Masked Image Modeling (MIM) to pre-train, which predicts the hog descriptor (HOG) with the randomly masked input. After obtaining the pre-trained model, we use it to fine-tune our framework for the FER task.

Partial-label learning (PLL) is a multi-class classification problem. Similar to SL, PLL aims to predict a true label of the input using a mapping of the classification function. However, PLL can tolerate more uncertainty from the label space, and gradually eliminate it in a weakly supervised way during the training stage. We argue that modelling the FER task as partial label learning is more appropriate. PLL assumes that each input must have a true label in the corresponding candidate label set as the Eq.(1):


where is the collection of all candidate label sets. In addition, different from SL setup, the definition of each sample in partial labeled dataset is denoted by , where is a set of candidate label corresponding to .

Thus, the aim of partial label learning (PLL) is to recognize the (unseen)true label of each sample from by means of train a classifier , which utilize to compute the probability of each class in the candidate label set. We construct a candidate label set for the sample in the FER dataset, which is a set of binary codes. e.g. = 1 if the crowdsourcing result contains this category, else . Before training, we normalized each label set

to obtain a confidence vector

, which represents the probability value of the sample in the different categories, it sums to 1. In training stage, we train a classifier to update the confidence vector . Ideally, the will concentrate most of the probability density on the true label of the sample . We use the categorical cross-entropy loss to constrain it as the Eq.(2) following:


where denotes the index of training sample and denotes the index of confidence vector . And is the soft-max layer of the model with the input . In the next section, we continue to describe the feature representation part of our proposed method.

3.2 PLL with Pre-trained Feature Representation

In this work, we use MIM to obtain a better initial feature representation and based on that, use PLL to do label disambiguation. An important assumption of PLL is that samples close to each other in sample space are more likely to share the identical label in label space, which indicates that feature robustness is important for label disambiguation. However, as shown in Figure 1, the uncertainty of the FER data leads to ambiguity at the label level, which hinders the feature representation. Based on these observations, as shown in Figure 3, we use MIM pre-training method to optimize feature representation. Furthermore, since the network is easily misled by the incorrect labels, guiding feature learning by labels alone is difficult for the FER task, which is one of the bottlenecks of current most fully-supervised learning methods. Therefore, we argue that using a self-supervised pre-training manner can help in obtaining a better feature representation by avoiding being misled by incorrect labels.

Inspired by MaskFeat [wei2022masked], we adopt Masked Image Modeling (MIM) to pre-train facial expression representation by predicting hod descriptor. Specifically, in the encoder part, we adopt the same structure with ViT [dosovitskiy2020image] to pre-train, which is only for visible, unmasked patches. This is because the MIM pre-training manner forces the ViT to learn the local facial action units and global facial structures in various expressions. In FER, focusing on the variations in the action units of the facial expression is more meaningful rather than the information embedded in the pixel level. Thus, in the decoder part, we adopt hog descriptor as prediction of decoder rather than pixel values. In addition, predicting pixels is easy to influence by redundant information (high-frequency details, lighting and contrast variations), which is less important in FER task. In contrast, hog descriptor [dalal2005histograms] is dedicated to describe the distribution of edge directions within a local subregion. Finally, the output of decoder can reshape to a reconstructed hog prediction of the image. In the training stage, we only keep the encoder for feature extraction and discard the decoder part.

3.3 Label Disambiguation With Transformer Decoder

Given an image , input pre-trained encoder to get visual token , where and denote the number of visual tokens and the feature dimension of hidden state respectively. We initialize the query embedding , where denotes number of classes. In training stage, transformer decoder update according to visual tokens, which reflects the correlation of category-related features. Inspired by DETR [carion2020end], we link the category-related features to the learnable query embedding and predict the confidence sector

of the sample by their correspondence. In detailed, we input the logits from the last layer of transformer decoder to sigmoid function to get confidence vector


3.3.1 Label embedding regularization

It is worth noting that in addition to the ambiguity annotation these FER datasets also prevalently exhibit imbalance distribution. In the training stage, it makes the model more biased towards majority classes, resulting in an uneven distribution of the feature space, and degrading model performance by classifying minority classes as head classes. To mitigate its negative effects of it, inspired by [wang2020understanding], the ideal query embedding should be evenly distributed over a hypersphere. Thus, we design a uniform loss for query embedding to pull the distance between different classes on the hypersphere. Specifically, we design the target positions of classes, before training, we first compute the optimal positions for different classes of query embedding, , and optimize it as the Eq.(3).



is a temperature hyperparameter,

denotes the number of classes in FER dataset.

3.3.2 Revision confidence

As we mentioned before, the robust representation in the feature space promotes the ability of PLL to disambiguate in the label space. For these samples with incorrect labels in the dataset, we first initialize the pseudo targets for candidate label set

with a uniform distribution

. Then, during the update, we continuously revise the confidence score for each batch based on the Eq.(4).


where is the response of sigmoid layer, and is the output of the transformer decoder. After that, we compute the metrics of the for updating confidence vector as the Eq.(5):


where , the threshold [0,1), which shows the amount of uncertainty between top2 rank label in the candidate label set. Algorithm 1 is pseudo code of our method.

0:  : Pre-trained model ;

: Number of total epochs;

  : Number of total iterations;
  : query ;
   : partially labeled training set ;
  , initialize the confidence vector ;
  for  do
     Shuffle ;
     for  do
        fetch mini-batch from ;
        calculate Uniform distribution for query by Eq.(3);
        revise confidence vector by Eq.(4) and Eq.(5);
        calculate PLL loss by Eq.(2)
        update Network parameter of by Eq.(2);
     end for
  end for
Algorithm 1 Pseudo-code of our method
Method Backbone RAF-DB FERPlus AffectNet7 AffectNet8
PSR [vo2020pyramid] VGG-16 88.98 89.75 63.77 60.68
DDA [farzaneh2020discriminant] ResNet-18 86.90 - 62.34 -
SCN [wang2020suppressing] ResNet-18 87.03 88.01 63.40 60.23
SCAN [gera2021landmark] ResNet-50 89.02 89.42 65.14 61.73
DMUE [she2021dive] ResNet-18 88.76 88.64 - 62.84
RUL [zhang2021relative] ResNet-18 88.98 88.75 61.43 -
DACL [farzaneh2021facial] ResNet-18 87.78 88.39 65.20 60.75
MA-Net [zhao2021learning] ResNet-18 88.40 - 64.53 60.96
MVT [li2021mvt] DeIT-S/16 88.62 89.22 64.57 61.40
VTFF [ma2021facial] ResNet-18+ViT-B/32 88.14 88.81 64.80 61.85
EAC [zhang2022learn] ResNet-18 89.99 89.64 65.32 -
RC [feng2020provably] ResNet-18 88.92 88.53 64.87 61.63
LW [wen2021leveraged] ResNet-18 88.70 88.10 64.53 61.46
PICO [wang2022pico] ResNet-18 88.53 87.93 64.35 61.38
Baseline [dosovitskiy2020image] ViT-B/16 88.36 88.18 64.61 61.34
Ours ViT-B/16 92.05 91.11 67.11 64.12
Table 1: Comparison with the state-of-the-art results on the RAF-DB, FERPlus, AffectNet7 and AffectNet8, respectively. Accuracy (Acc.(%)) is reported on the test dataset.

4 Experiments

In this section, we first describe the FER datasets and experiment implementation details. To evaluate our proposed method, we compare our method with baseline and other related methods including fully-supervised learning: PSR [vo2020pyramid], DDA [farzaneh2020discriminant], SCN [wang2020suppressing], SCAN [gera2021landmark],DMUE [she2021dive], RUL [zhang2021relative], DACL [farzaneh2021facial], MA-Net [zhao2021learning], MVT [li2021mvt], VTFF [ma2021facial], EAC [zhang2022learn]; partial label learning: RC [feng2020provably], LW [wen2021leveraged], PICO [wang2022pico].

4.1 Datasets

RAF-DB [li2019reliable] contains around 30K great-diverse facial images with single or compound expressions labeled by 40 trained human annotators in a crowdsourcing way. In our experiments, we use 12271 images for training and 3068 images for testing.

FERPlus [BarsoumICMI2016] is an extended dataset of the standard emotion FER dataset, which provides a set of new labels for each image labeled by 10 human annotators. It consists of 28709 training images, 3589 validation images, and 3589 test images. According to previous work, we just use the training part and test part.

AffectNet [DBLP:journals/corr/abs-1708-03985] is one of the largest and most challenging FER datasets with manually labeled 440k images. According to previous work, we make two different divisions of AffectNet7 and AffectNet8. Compared to AffectNet7, there is one more expression of contempt with 3667 training images and 500 test images.

Method RAF-DB FERPlus AffectNet8
DMUE 88.76 88.64 62.84
ResNet-18 + PLL 89.67 89.80 62.89
Table 2: For fairness, we replace our backbone with ResNet-18, only keep the partial label part and remove other modules. Accuracy (Acc.(%)) is reported on the test dataset.

4.2 Baseline and Experiment setup

We adopt pre-trained ViT-B/16 [dosovitskiy2020image]

on Imagenet-1K without any modification as our baseline. For data preprocessing, we keep all the images with the size of

and use MobileFacenet [chen2018mobilefacenets] to obtain aligned face regions for RAF-DB and FERPlus. For AffectNet, alignment was obtained via the landmarks provided in the data. In the pre-train part, we use four Tesla V100 GPUs to pre-train the ViT for 600 epochs with batch size 192 and learning rate 2e-4. We set the mask ratio of MAE to 0.4. In finetune part, we use only two Tesla V100 GPUs. We utilize the AdamW optimizer, setting the batch size to 320 and the initial learning rate to 1e-4 with the weight decay of 5e-2. The pre-training and fine-tuning protocol that we use is based on [wei2022masked]. In the PLL setup, for the FER dataset without crowdsourcing results, we construct the candidate set for it based on LW [wen2021leveraged].

4.3 Comparison with State-of-the-art FER Methods

We conduct extensive comparisons with the SOTA FER methods on RAF-DB, FERPlus, AffectNet7, and AffectNet8 respectively. Specifically, we compared ResNet18-based

method, which is pre-trained on face recognition dataset MS-Celeb-1M 

[guo2016ms], such as SAN [wang2020suppressing], DMUE [she2021dive], RUL [zhang2021relative], DACL [farzaneh2021facial], MA-Net [zhao2021learning], DDA [farzaneh2020discriminant], EAC [zhang2022learn]; ResNet50-based: SCAN [gera2021landmark]. In addition, we also compare transformer-based such as MVT [li2021mvt], VTFF [ma2021facial] and only PSR [vo2020pyramid] using VGG-16. We list all the comparison results in Table 1. We achieve the highest performance on all datasets as shown in Table 1. From the results, our method outperforms all current SOTA methods for FER, including ResNet-based and Transformer-based methods. Specifically, we surpass the baseline [dosovitskiy2020image] by 3.69%, 2.93%, 2.5%, 2.78% accuracy on RAF-DB, FERPlus, AffectNet7, AffectNet8 respectively.

4.3.1 Comparison with SL methods

Compared with latest SL methods, we surpass EAC [zhang2022learn] by 2.06%, 1.79% on RAF-DB and AffectNet7 respectively, PSR [vo2020pyramid] by 1.36% on FERPlus, and DMUE [she2021dive] by 1.28% on AffectNet8. These experimental results show the effectiveness of our proposed method in the FER task and demonstrate that converting ambiguous annotation results into one-hot labels using fully-supervised learning is not ideal. Compared to PLL, these CNN-based SL methods are misled by the incorrect labels in the label space, which hinders the performance of the model. Furthermore, the transformer-based SL method suffers from the same problems. Although MVT [li2021mvt] uses ViT as the backbone for pre-training, it is not unsupervised pre-training and can still be misled by incorrect labels in the label space, thus affecting the quality of feature learning. The same problem exists with VTFF [ma2021facial].

4.3.2 Comparison with PLL methods

We add several representative works of PLL methods, respectively RC [feng2020provably], LW [wen2021leveraged] and PICO [wang2022pico]. The results in Table 1

are our implementation based on their opens-source codes. The general PLL method for label disambiguation, such as RC and LW relies on the robustness of the features, and from the results, for the FER task, RC and LW are misled by incorrect labels in the label space, which affects disambiguation due to the lack of robust facial expression feature representation. In addition, it is worth noting the result of PICO demonstrates that contrastive-based partial label learning does not apply to the FER task. As mentioned before, the contrastive-based approach is sensitive to data augmentation. Unlike objects, facial expressions have high inter-class similarities, which consist of five fixed facial action units. Whereas typical data augmentation usually works on the lighting and contrast variation of the image, which only leads to extra redundant information in the FER task, sometimes, even performance degradation.

4.4 Ablation Study

4.4.1 Effectiveness of PLL paradigm.

To verify the effectiveness of PLL, we select DMUE [she2021dive] for a fair comparison, we only keep the PLL part with ResNet-18 as the backbone, and remove the other key modules, we report the results in Table 2. Compared to DMUE, we exceed DMUE across the board, improving the accuracy by 0.91% on RAF-DB, 1.16% on FERPlus, and 0.05% on Affectnet8. From the results, simplifying FER as a supervised learning task is not ideal, DMUE mines and describes the distribution of the label space corresponding to the samples. However, it is based on supervised learning, which easily suffers from incorrect labels in the label space. In addition, we adopt pre-trained ResNet-18 as the backbone, which is a less robust representation in FER compared to MAE. However, we still obtain competing results using PLL. For a more visual illustration, we have attached some PLL disambiguation results on the FERPlus database as shown in Figure 4. The experiment results demonstrate the effectiveness of using the PLL paradigm.

4.4.2 Protocol: SL or MIM? Prediction: HOG or Pixel?

To verify the effectiveness of the MIM pre-train protocol and hog descriptor for the FER task. We use different pre-trained models to predict the pixel and hog descriptor, respectively. In addition, we also test them in the PLL paradigm, we report the related results in Table 3. From the results, on the one hand, the use of MIM pre-training avoids the negative effects of incorrect labels and obtains a better feature representation for FER compared to the SL pre-training protocol. On the other hand, MIM pre-training force the ViT to learn the local facial action units and global facial structures using self-attention mechanism in various expressions.

As for the prediction of the decoder, the result of HOG is better than that of Pixel. In FER, we focus more on the different action units of facial expressions rather than the redundant information (illumination, contrast variation) between pixels. Thus, hog descriptor is more ideal one to predict.

Method Dataset RAF-DB AffectNet7
Baseline ImageNet-1k 88.36 64.61
MIM(pixel) FER 89.64 65.25
MIM(Hog) FER 89.85 65.31
MIM(pixel)+PLL FER 90.09 65.97
MIM(Hog)+PLL FER 90.35 66.12
Table 3: For the ablation of ’Pixel’ and ’Hog’, as well as with PLL, no label disambiguation module. Accuracy (Acc.(%)) is reported on the test dataset.
Decoder Re-conf RAF-DB AffectNet7
91.29 66.67
91.64 66.90
91.89 66.93
92.05 67.11
Table 4: Evaluation of transformer decoder-based label disambiguation module with PLL. Accuracy (Acc.(%)) is reported on the test dataset of RAF-DB and AffectNet7. We denote transformer decoder as ’Decoder’, label embedding regularization as ’’ ,and revision confidence as ’Re-conf’, respectively.

4.4.3 Evaluation of tranformer decoder and key modules

To evaluate the effectiveness of the transformer decoder and other key modules, we conduct an ablation study on RAF-DB and FERPlus as shown in Table 4. From the results, based on the PLL paradigm, the transformer decoder improve accuracy by 0.94% on RAF-DB and 0.55% on AffectNet7. The results show that it uses the cross-attention module to obtain the correlation of features and category labels via learnable label embedding, which in turn provides a more adequate foundation for the PLL disambiguation. As for the label uniform embedding module, alleviates the influence of imbalanced FER data classes by distributing query embedding uniformly over the hypersphere. As for the revision confidence module, the revision confidence strategy can further facilitate the ability of PLL to disambiguate in the label space. This module enhances performance as an auxiliary module of disambiguation. For more details, we provide them in the appendix.

Figure 4: The grey part of the figure shows the GT labels corresponding to the samples in FERPlus; the blue part shows the candidate set we constructed for the samples, where the strikethroughs mark the results of PLL disambiguation for error removal. The predicted labels are consistent with our intuition.

5 Conclusion

In this paper, we rethink the existing training paradigm and propose that it is better to use weakly supervised strategies to train FER models with original ambiguous annotation. To solve the subjective crowdsourcing annotation and the inherent inter-class similarity of facial expressions, we model FER as a Partial label learning (PLL) problem, which allows each training example to be labeled with an ambiguous candidate set. We use Masked Image Modeling (MIM) strategy to learn the feature representation of facial expressions in a self-supervised manner. Then a Transformer decoders module is proposed to query the probability of ambiguous label candidates being the ground truth. Extensive experiments show the effectiveness of our method.