Recognizing human emotion/expressions automatically is quite an expected ability for intelligent robotics, as it can promote better communication and cooperation with humans. Current deep-learning-based algorithms may achieve impressive performance in some lab-controlled environments, but they always fail to recognize the expressions accurately for the uncontrolled in-the-wild situation. Fortunately, facial action units (AU) describe subtle facial behaviors, and they can help distinguish uncertain and ambiguous expressions. In this work, we explore the correlations among the action units and facial expressions, and devise an AU-Expression Knowledge Constrained Representation Learning (AUE-CRL) framework to learn the AU representations without AU annotations and adaptively use representations to facilitate facial expression recognition. Specifically, it leverages AU-expression correlations to guide the learning of the AU classifiers, and thus it can obtain AU representations without incurring any AU annotations. Then, it introduces a knowledge-guided attention mechanism that mines useful AU representations under the constraint of AU-expression correlations. In this way, the framework can capture local discriminative and complementary features to enhance facial representation for facial expression recognition. We conduct experiments on the challenging uncontrolled datasets to demonstrate the superiority of the proposed framework over current state-of-the-art methods.READ FULL TEXT VIEW PDF
Facial expression recognition (FER) is essential for intelligent robotics because it can help the robotics to understand human emotions and behaviors. Basically, this task aims to classify basic (e.g., happy, angry) or compound (e.g., happy & surprised, sad & angry) expressions based on face appearance for both in-the-lab [26, 34, 27] and in-the-wild environments [22, 10]
. Recently, most fruitful algorithms resort to deep neural networks[36, 15] to learning powerful feature representation to promote the FER performance. Despite achieving impressive progress for the lab-controlled environments, it is still challenging and unsolved due to the complex variations in pose, illumination, and age, especially in the uncontrolled environments during the natural human robot interaction process.
According to the facial action coding system (FACS) [12, 38], facial action units (AUs) encode subtle facial appearances and changes, which have strong correlations with human expressions. For example as shown in Figure 1, if a face image is detected to have the AUs of “cheek raiser” and “lip corner puller”, it tends to be a “happy” face. In contrast, if the detected AUs are “inner brow raiser” and “lip corner depressor”, it is more likely to be a “sad” face. Thus, automatically detecting the AUs and modeling their relationships with the expressions is essential to promote FER performance, especially to help distinguish uncertain or ambiguous expressions.
In this work, we aim to mine the AU features to enhance face image representation for more robust and accurate expression recognition. To achieve this end, two crucial challenges arise. First, most current FER datasets (e.g., RAF-DB , SFEW2.0 ) do not have AU annotations, and it is very expensive and labor-consuming to annotate the AUs for these datasets. Thus, how to learn AU features without AU annotations is a key challenge. Second, there exist tens of AUs, and not all AUs are equally important for different expressions. How to adaptively select AU features to enhance image representation for each expression is another vital problem.
To address these challenges, we explore exploiting the correlations among expressions and AUs to learn AU features in an unsupervised manner and adaptively select these features for feature enhancement by developing a novel AU-Expression Knowledge Constrained Representation Learning (AUE-CRL) framework. Specifically, there exist strong correlations among expressions and AUs. We first design a knowledge-guided AU representation learning module that leverages these correlations to covert expression labels to pseudo AU labels and utilizes pseudo labels to train the AU classifiers to obtain the AU features. As suggested in previous work, different AUs also have obvious co-occurrence dependencies, and these dependencies are important for selecting useful AUs. With the AU features, we further introduce a knowledge-guided attention module that learns to adaptively mine the most relevant AU features under the constraint of AU dependencies. In this way, the framework can automatically discover useful local facial behaviors to facilitate FER.
In summary, the contributions of this work are three-fold. First, we design a novel AU-Expression Knowledge Constrained Representation Learning (AUE-CRL) framework that exploits prior knowledge of AU-expression correlations to automatically discover useful AU features for feature enhancement to facilitate recognizing facial expression. Second, we propose to leverage the AU-expression correlations to guide learning AU features without AU annotations. Thus, the framework can be easily generalized to all of current FER datasets. Finally, we conduct extensive experiments on several large-scale in-the-wild datasets to demonstrate the effectiveness of the proposed framework, and carry out ablative studies to analyze the actual contribution of each key component.
In this section, we review the most related works about facial expression recognition and facial action unit detection.
Previous works on facial expression recognition mainly focused on the basic categories (e.g., happy, angry, etc.) in which the data were collected by asking volunteers to make specific expressions in the constrained lab environments [39, 9]. Traditional methods primarily designed hand-crafted features (e.g., Local Binary Pattern (LBP) , Bag of Word (BoW) , and Histogram of Oriented Gradient (HoG) ). These methods can achieve satisfactory performance in such constrained environments. Recently, there emerged various large-scale datasets, in which the data were captured in real-world scenarios [21, 32, 25]
. Compared with previous in-the-lab settings, these datasets were even more challenging due to more variance in pose, illumination, etc. Previous hand-crafted features could hardly capture such variance and thus they worked quite poor on these datasets. To address this issue, recent works resorted to deep convolutional networks[36, 15] to learn more powerful feature representation for expression recognition. For example, Liu et al. 
proposed a Boosted Deep Belief Network to learn feature that could characterize expression-related facial appearance/shape changes. Mollahosseini et al.
designed deeper convolutional networks that built on inception module to extract more discriminative feature for recognition. Liu et al. 
exploited 3D Convolutional Neural Networks that was trained with deformable action parts constraints to learn discriminative part-based representation. Yu et al. ensembled multiple deep Convolutional Neural Networks to further improve recognition accuracy. Different from these works, our method introduces the relationships between AU-expression and relationship among AUs to mine information of AUs to help capture local facial behaviors to facilitate facial expression recognition.
The proposed method is also related to some recent works that also exploit prior knowledge to facilitate visual reasoning [42, 6, 4, 7, 8, 41]. For example, Chen et al. introduce the co-occurrence correlations among different categories to help better recognize multiple semantic objects [7, 4]. Generally, these methods represent prior knowledge in the form of graphs and introduce the graph neural networks [23, 18] for message propagation to learn contextualized feature representations. Different from these works, we introduce the relationships between AU-expression and among AUs to guide generating pseudos AU labels for AU representation learning and selecting useful AUs for facial expression recognition.
Facial action units are defined to describe facial muscle movements according to , and detecting action unit is helpful for expression and emotion understanding. Previous works [35, 28] leveraged traditional shadow models (e.g., SVM and SVR) to solve this task. For example, Mahoor et al.  projected high-dimensional facial images into a low-dimensional space via the spectral regression technique and adopted SVM classifier to predict the AU intensity. Inspired by recent advance of deep neural network on vision tasks, current works [2, 49, 40, 13] also designed deep model for action unit recognition. Gudi et al. 
designed a simple seven-layer network to estimate both occurrences and intensities of the AUs. Furthermore, Walecki et al. adopted the conditional random field (CRF) to encode AU dependencies and combined it with deep neural networks to improve recognition.
This proposed AUE-CRL framework explores mining useful local AU information to enhance image representation learning under the guidance of the correlations among AUs and expressions. It mainly consists of modules, i.e., feature extractors, knowledge-guided AU representation learning (KGAURL) module, and knowledge-constrained AU selection (KCAUS) module. Taking an input image
, the feature extractor generates multi-layer feature maps, and then it fuses these feature maps and performs global average pooling to obtain global expression feature vector. The feature extractor also fuses the feature maps from multiple layers inversely to obtain a set of feature maps with relative large size and leverage a crop network to extract feature for each AU, i.e., . Here, is the feature vector of the -th AU and is the AU number. Then, the KGAURL module converts the expression labels to pseudo AU labels based on the AU-expression correlations. The pseudo labels are then used to supervised AU classifier training and thus it can learn AU features without any additional AU annotations. Finally, the KCAUS exploits an attention mechanism to automatically discover useful AU features under the constraint of AU dependencies, and aggregate these features with the global expression feature vector for expression prediction. An overall framework is illustrated in Figure 2.
Current algorithms [21, 13] mainly resort to deep neural networks [15, 36, 5] to learn AU representations, but they requires a large amount of ground truth annotations to ensure the discriminative and generalization abilities. However, current datasets with AU annotations are mainly captured in the constrained in-the-lab environment and cover very few subjects, e.g., 27 and 41 subjects on the DISFA  and BP4D  datasets. Features trained using these datasets can hardly generalize to other environments and subjects, especially to the in-the-wild settings. On the other hand, current FER datasets lack the AU annotations, and it is expensive and unpractical to add the AU annotations for these datasets. In this work, we design a knowledge-guided module that exploits the correlations between AU and expression to generate pseudo AU labels, and thus the proposed framework can learn AU representation without incurring additional annotations.
As suggested in previous literatures 
, each expression is relevant with several AUs, and the relevance can be further divided into primary and secondary ones. More concretely, if a face image is annotated with a specific expression, it tends to have the corresponding primary AUs with high probabilities, secondary AUs with middle probabilities, and other AUs with low probabilities. According to these relationships, we can build a correlation matrix, where and denote the expression and AU number, respectively. The value denote the prior relevant probability between expression and AU . It is assigned with a large value if they are primarily relevant, a middle value if they are secondarily relevant, and a small value otherwise. Given a sample with expression annotations of , it is intuitive to produce the pseudo AU labels by . However, the matrix merely defines three level correlations, which is very cursory. It is desirable to exploit finer-grained correlations so as to obtain more precise pseudo AU labels. In this work, we use a learnable correlation matrix to generate the pseudo AU labels by
We apply the simple linear function to map the AU features to the predicted AU labels
where is a learnable weight vector. During training, we define a mean square error loss between the pseudo and the predicted labels, and a regularization loss between the learned and prior correlation matrices, formulated as
In this way, we can learn finer-grained correlation and simultaneously exploit prior correlations to promote generating more precise AU labels. is a balance parameter and it is set to 1.0 in the experiments.
In this subsection, we introduce the knowledge-constrained attention mechanism that learns to adaptively select useful AU features and fuses these features to enhance image representation. It computes a correlation coefficient for each AU, performs weighted average to obtain a merged AU feature, and concatenates it with expression feature to obtain the final image representation.
Specifically, we first fuse the expression features with each AU feature using the low-rank bilinear pooling  to compute a coefficient
that denotes the importance of AU for expression recognition. Here, we use low-rank bilinear pooling  as it is effective for feature fusion. In the equation, is the hyperbolic tangent function and is the element-wise product operation. , , are the learnable parameter matrixes. To make the coefficients easily comparable across different samples, we normalize the coefficients over all AUs using a softmax function
Then, we perform weighted average over all AUs to obtain the AU features
Finally, we concatenate with for expression prediction
As suggested in previous literatures , there exists strong co-occurrence dependencies among different AUs. In other words, some AUs co-occur frequently while some AUs are mutually exclusive. For example, the AU inner brow raiser always co-occurs with outer brow raiser as they are both controlled by the muscle group of Occipito franontanlis. In contrast, the AU lip corner puller hardly co-exists with lip corner depressor as the corresponding controlled muscle group can not co-activate. It is expected and natural that the learned attentional coefficients should match such dependencies, and thus we introduce these prior dependencies as a regularization term during training.
Inspired by previous work , we consider the pair-wise dependencies that include positive and negative correlations to define the regularization term. Here, we consider the AU exists if its attention coefficient is higher than 0.5. We denote it as if it exists and denote as otherwise. For the AU and with positive correlation, it is expected
which can be transformed to the equivalent formulation
Accordingly, we can define the regularization term that constrains the positive correlation as
where is the set of positive AU pairs. Similarly, the regularization term for negative correlation constraint can be defined as
where is the set of negative AU pairs.
During training, we use the cross entropy loss, which is denoted by , to train the expression classifier, and thus the loss can be defined as
where is the balance parameter and it is set to 0.5 in the experiments.
We select ResNet-101 
as the backbone network for feature extraction, which consists of four block layers. Given an input image of size 224224, we can obtain feature maps of size 56 56 256 from first layer, feature maps of size 28 28 512 from second layer, feature maps of size 14 14 1024 from third layer and feature maps of size 7 7
2048 from last layer. For holistic expression feature, we downsample the feature maps from the first, second, third layer of backbone network to the size of feature maps from last layer by max pooling, then concatenate these four feature maps, and perform global average pooling to obtain a 3840-dimensional vector. For AU feature, we inversely upsample the feature maps to the size of feature maps from previous layer by deconvolution, starting from feature maps from last layer to the end of second layer, and concatenate each upsampled feature maps with original feature maps, and upsample it again by deconvolution. At end, we obtain a feature map of size 5656 256, which is used for cropping feature from corresponding location to obtain the corresponding AU feature by using crop net.
In crop net, we first use MTCNN  to get facial landmarks of input image and crop the corresponding region for each AU on feature maps by using code of , and pass it through a convolutional layer and a fully connected layer, whose parameters are not shared for each AU, to obtain a 512-dimensional vector.
To obtain more stable experiment results, we adopt three-stages training process. In the first stage, we train the backbone and expression classifier with the cross-entropy loss using stochastic gradient descent(SGD) with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of 0. In the second stage, we fix the parameters of backbone, and train crop net and AU classifier with mean-square error loss and lossdefined by the formula (3) using stochastic gradient descent(SGD) with an initial learning rate of 0.001, a momentum of 0.9, and a weight decay of 0. And in third stage, we fix the parameters of backbone, crop net and AU classifier, and train expression classifier with loss defined by the formula (12) using stochastic gradient descent(SGD) with an initial learning rate of 0.0001, a momentum of 0.9, and a weight decay of 0.
|Methods||BaseDCNN ||Center Loss ||DLP-CNN ||Ours|
|Ours w/o KGAURL||88.3||63.5||64.2||92.7||85.7||79.9||87.3||80.2|
|Ours w/o Attention||88.0||67.6||60.1||94.1||84.0||82.5||85.0||80.2|
|Ours w/o AU-Independent||86.1||67.6||65.5||94.2||84.6||79.9||85.9||80.6|
The existing datasets of facial expression can be divided into two main categories according to its collecting environment. Over quite a long period there only exist datasets collected in the lab-controlled conditions with limited size. Recently, some comparatively large-scale datasets that reflect real-world scenarios are released to promote the research. Meanwhile, the types of expressions are expanded with compound expressions that can be constructed by the combination of basic expression.
We chose two challenging in-the-wild datasets to evaluate the performance of our method. The Real-world Affective Face Database (RAF-DB)  and the Static Facial Expressions in the Wild (SFEW2.0)  datasets.
RAF-DB  contains 29,672 highly diverse facial images from thousands of individuals that were also collected from the Internet. With manually crowd-sourced annotation and reliable estimation, seven basic and eleven compound emotion labels are provided for the samples. Specifically, 15,339 images from the basic emotion set are divided into two groups (12,271 training samples and 3,068 testing samples) for evaluation.
SFEW2.0  is an in-the-wild dataset collected from different films with spontaneous expressions, various head poses, age ranges, occlusions and illuminations. This dataset is divided into training, validation, and test sets, with 958, 436, and 372 samples, respectively.
In this subsection, we present the performance comparisons with current state-of-the-art methods to evaluate the superiority of our proposed method.
RAF-DB is a challenging in-the-wild dataset that is widely used for evaluating facial expression recognition. In this part, we compare our method with current state-of-the-art competitors, including Deep Neural Network Augmentation(DCNN-DA) , Weakly Supervised Local-Global Relation Network (WSLGRN) , Covariance Pooling (CP) , Compact Deep Learning Model(CompactDLM) 
, Feature Selection Network(FSN), Deep Locality-Preserving Learning (DLP-CNN) , and Multi-Region Ensemble CNN (MRE-CNN) .
We first present the accuracy of each basic expression and the average accuracy over all expressions in Table I. As shown, our method achieves the best performance compared with existing competitors, i.e., improving the average accuracy from 79.4% to 81.0%. In addition, our method obtains better accuracy for most basic expression, especially for those that are difficult to recognize. For example, current best accuracy for the expression “Fear” is 63.5%. Our method improves the accuracy to 68.9%, with a relative improvement of 8.50%. One possible reason is integrating information of facial AU can well capture local discriminative feature and help to distinguish uncertain and ambiguous expression.
Except for the basic expression, RAF-DB contains another sub-set in which each face image is annotated with a compound expression. A compound expression usually contains two or more basic expressions. For example, a person may be happy and simultaneously surprised. Obvious, this is an even more difficult task as it needs to recognize multiple expression patterns, which depends more on local discriminative feature mining. Here, we compare our method with BaseDCNN , Center Loss , and Deep Locality-Preserving Learning (DLP-CNN) . As shown in Table II, our method outperforms current state-of-the-art competitors by a sizable margin, i.e., an improvement of 6.50% in average accuracy.
SFEW2.0 is an even more challenging dataset, and there are also some works that conduct experiments on this dataset. Here, we compare with our proposed method with the following works: Covariance Pooling (CP) , Deep Locality-Preserving Learning (DLP-CNN) , Identity-Aware Convolutional Neural Network (IA-CNN)  , and Island Loss (IL) .
The accuracy of each basic expression and average accuracy overall all expressions are presented in Table III. Our method obtains an average accuracy of 52.8%, improving that of the previous best method by 1.7%. Similar to results on RAF, existing methods perform extremely poor for the expressions “Disgust” and “Fear”, e.g., accuracies of 0.0% and 14.0% for these two expressions. Our methods improve the accuracies to 17.4% and 25.5%. These comparisons again demonstrate the superiority of our proposed method in ambiguous expression recognition.
In this subsection, we conduct comprehensive ablation studies to discuss and analyze the contribution of each component and obtain a more thorough understanding of the framework.
The core contribution of the proposed framework is mining useful local AU information to enhance image representation. To analyze the contribution of this framework, we compare AUE-CRL framework with the ResNet-101 baseline. The experiment is conducted on the RAF-DB dataset and the results are presented in Table IV. As shown, the average accuracy drops from 81.0% to 79.9%. It is worth noting that, compared to AUE-CRL framework, the accuracy of baseline drops significantly on expression ”Disgust”, which proves AUE-CRL framework is effective to distinguish uncertain and ambiguous expression.
As suggested above, we leverage the relationship of AU and expression to guide learning AU representation, which can help get rid of heavy AU annotations and guide learning domain-adaptive representation. To analyze the effect of the relationship of AU and expression, we remove the AU-expression regularization loss and use AU annotation of the BP4D dataset  to train AU classifiers. We conduct the comparison on the RAF-DB dataset and present the results on Table IV. Although this method adopts ground truth AU annotations, it performs inferior compared with our knowledge-guided AU representation learning, i.e., decreasing the accuracy from 81.0% to 80.2%.
Indeed, the images of BP4D cover merely 41 subjects and they are captured in the constrained lab environment. Thus, representation learned on such a dataset can hardly generalize to other environments. In contrast, our proposed knowledge-guided AU representation learning enables training on the target dataset and tends to learn domain-adaptive AU representation, leading to better performance.
In this work, we introduce a knowledge-constrained attention mechanism to adaptively select useful AU for expression representation enhancement. To analyze its contribution, we remove this component, simply perform average pooling over all action unit to obtain AU representation and concatenate it with expression feature for expression recognition. We find the average accuracy drops to 80.2% and accuracy drops significantly on expression ”Disgust”, which suggests that the attention mechanism can help mine useful AU information to facilitate expression recognition and play a great role in distinguish uncertain and ambiguous expression.
To better select useful AUs, we introduce prior knowledge of the relationship among AUs as a constraint term during training. Here, we remove this constraint to analyze its contribution. As shown in Table IV, we find the average accuracy is 80.6% on the RAF-DB dataset, which is better than that without the attention mechanism but still worse than our method.
In this paper, we propose an AU-Expression Knowledge Constrained Representation Learning framework that exploits prior knowledge to help mining AU information to promote facial expression recognition. It first leverages relationships between AUs and expressions to guide learning domain-adaptive AU representation without any additional AU annotations. Then, it introduces an attentional mechanism to adaptively select useful AU representation under the constraint of the dependencies among AUs. We conduct an experiment on two in-the-wild datasets and show that our method outperforms current state-of-the-art competitors by a sizable margin.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8594–8601. Cited by: §II-A, §III-B, §III-E1.
Deep reasoning with knowledge graph for social relationship understanding. In Proc. of International Joint Conference on Artificial Intelligence, pp. 2021–2028. Cited by: §II-A.
Joint face detection and alignment using multi-task cascaded convolutional networks. CoRR abs/1604.02878. Cited by: §III-E1.
Classifier learning with prior probabilities for facial action unit recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5108–5116. Cited by: §III-D.
Pose-independent facial action unit intensity regression based on multi-task deep transfer learning. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 872–877. Cited by: §II-B.