DeepAI
Log In Sign Up

AU-Expression Knowledge Constrained Representation Learning for Facial Expression Recognition

12/29/2020
by   Tao Pu, et al.
0

Recognizing human emotion/expressions automatically is quite an expected ability for intelligent robotics, as it can promote better communication and cooperation with humans. Current deep-learning-based algorithms may achieve impressive performance in some lab-controlled environments, but they always fail to recognize the expressions accurately for the uncontrolled in-the-wild situation. Fortunately, facial action units (AU) describe subtle facial behaviors, and they can help distinguish uncertain and ambiguous expressions. In this work, we explore the correlations among the action units and facial expressions, and devise an AU-Expression Knowledge Constrained Representation Learning (AUE-CRL) framework to learn the AU representations without AU annotations and adaptively use representations to facilitate facial expression recognition. Specifically, it leverages AU-expression correlations to guide the learning of the AU classifiers, and thus it can obtain AU representations without incurring any AU annotations. Then, it introduces a knowledge-guided attention mechanism that mines useful AU representations under the constraint of AU-expression correlations. In this way, the framework can capture local discriminative and complementary features to enhance facial representation for facial expression recognition. We conduct experiments on the challenging uncontrolled datasets to demonstrate the superiority of the proposed framework over current state-of-the-art methods.

READ FULL TEXT VIEW PDF
02/08/2019

FERAtt: Facial Expression Recognition with Attention Net

We present a new end-to-end network architecture for facial expression r...
03/15/2018

Deep Structure Inference Network for Facial Action Unit Recognition

Facial expressions are combinations of basic components called Action Un...
09/30/2018

Spontaneous Facial Expression Recognition using Sparse Representation

Facial expression is the most natural means for human beings to communic...
11/27/2018

A Compact Embedding for Facial Expression Similarity

Most of the existing work on automatic facial expression analysis focuse...
09/30/2022

Rethinking the Learning Paradigm for Facial Expression Recognition

Due to the subjective crowdsourcing annotations and the inherent inter-c...
08/19/2021

Understanding and Mitigating Annotation Bias in Facial Expression Recognition

The performance of a computer vision model depends on the size and quali...
04/16/2021

I Only Have Eyes for You: The Impact of Masks On Convolutional-Based Facial Expression Recognition

The current COVID-19 pandemic has shown us that we are still facing unpr...

Code Repositories

I Introduction

Facial expression recognition (FER) is essential for intelligent robotics because it can help the robotics to understand human emotions and behaviors. Basically, this task aims to classify basic (e.g., happy, angry) or compound (e.g., happy & surprised, sad & angry) expressions based on face appearance for both in-the-lab [26, 34, 27] and in-the-wild environments [22, 10]

. Recently, most fruitful algorithms resort to deep neural networks

[36, 15] to learning powerful feature representation to promote the FER performance. Despite achieving impressive progress for the lab-controlled environments, it is still challenging and unsolved due to the complex variations in pose, illumination, and age, especially in the uncontrolled environments during the natural human robot interaction process.

Fig. 1: Two examples of the correlations between facial expression and action units.

According to the facial action coding system (FACS) [12, 38], facial action units (AUs) encode subtle facial appearances and changes, which have strong correlations with human expressions. For example as shown in Figure 1, if a face image is detected to have the AUs of “cheek raiser” and “lip corner puller”, it tends to be a “happy” face. In contrast, if the detected AUs are “inner brow raiser” and “lip corner depressor”, it is more likely to be a “sad” face. Thus, automatically detecting the AUs and modeling their relationships with the expressions is essential to promote FER performance, especially to help distinguish uncertain or ambiguous expressions.

In this work, we aim to mine the AU features to enhance face image representation for more robust and accurate expression recognition. To achieve this end, two crucial challenges arise. First, most current FER datasets (e.g., RAF-DB [22], SFEW2.0 [10]) do not have AU annotations, and it is very expensive and labor-consuming to annotate the AUs for these datasets. Thus, how to learn AU features without AU annotations is a key challenge. Second, there exist tens of AUs, and not all AUs are equally important for different expressions. How to adaptively select AU features to enhance image representation for each expression is another vital problem.

To address these challenges, we explore exploiting the correlations among expressions and AUs to learn AU features in an unsupervised manner and adaptively select these features for feature enhancement by developing a novel AU-Expression Knowledge Constrained Representation Learning (AUE-CRL) framework. Specifically, there exist strong correlations among expressions and AUs. We first design a knowledge-guided AU representation learning module that leverages these correlations to covert expression labels to pseudo AU labels and utilizes pseudo labels to train the AU classifiers to obtain the AU features. As suggested in previous work, different AUs also have obvious co-occurrence dependencies, and these dependencies are important for selecting useful AUs. With the AU features, we further introduce a knowledge-guided attention module that learns to adaptively mine the most relevant AU features under the constraint of AU dependencies. In this way, the framework can automatically discover useful local facial behaviors to facilitate FER.

In summary, the contributions of this work are three-fold. First, we design a novel AU-Expression Knowledge Constrained Representation Learning (AUE-CRL) framework that exploits prior knowledge of AU-expression correlations to automatically discover useful AU features for feature enhancement to facilitate recognizing facial expression. Second, we propose to leverage the AU-expression correlations to guide learning AU features without AU annotations. Thus, the framework can be easily generalized to all of current FER datasets. Finally, we conduct extensive experiments on several large-scale in-the-wild datasets to demonstrate the effectiveness of the proposed framework, and carry out ablative studies to analyze the actual contribution of each key component.

Fig. 2: An illustration of the proposed framework. We exploit the relationships between AUs and expressions to guide learning AU representation in an unsupervised manner and the dependencies among AU to constrain the attention mechanism to better select useful AU representation to facilitate facial expression recognition. To keep this illustration concise, we only show part of relationships and dependencies.

Ii Related work

In this section, we review the most related works about facial expression recognition and facial action unit detection.

Ii-a Facial Expression Recognition

Previous works on facial expression recognition mainly focused on the basic categories (e.g., happy, angry, etc.) in which the data were collected by asking volunteers to make specific expressions in the constrained lab environments [39, 9]. Traditional methods primarily designed hand-crafted features (e.g., Local Binary Pattern (LBP) [39], Bag of Word (BoW) [16], and Histogram of Oriented Gradient (HoG) [9]). These methods can achieve satisfactory performance in such constrained environments. Recently, there emerged various large-scale datasets, in which the data were captured in real-world scenarios [21, 32, 25]

. Compared with previous in-the-lab settings, these datasets were even more challenging due to more variance in pose, illumination, etc. Previous hand-crafted features could hardly capture such variance and thus they worked quite poor on these datasets. To address this issue, recent works resorted to deep convolutional networks

[36, 15] to learn more powerful feature representation for expression recognition. For example, Liu et al. [25]

proposed a Boosted Deep Belief Network to learn feature that could characterize expression-related facial appearance/shape changes. Mollahosseini et al.

[31]

designed deeper convolutional networks that built on inception module

[37] to extract more discriminative feature for recognition. Liu et al. [24]

exploited 3D Convolutional Neural Networks that was trained with deformable action parts constraints to learn discriminative part-based representation. Yu et al.

[43] ensembled multiple deep Convolutional Neural Networks to further improve recognition accuracy. Different from these works, our method introduces the relationships between AU-expression and relationship among AUs to mine information of AUs to help capture local facial behaviors to facilitate facial expression recognition.

The proposed method is also related to some recent works that also exploit prior knowledge to facilitate visual reasoning [42, 6, 4, 7, 8, 41]. For example, Chen et al. introduce the co-occurrence correlations among different categories to help better recognize multiple semantic objects [7, 4]. Generally, these methods represent prior knowledge in the form of graphs and introduce the graph neural networks [23, 18] for message propagation to learn contextualized feature representations. Different from these works, we introduce the relationships between AU-expression and among AUs to guide generating pseudos AU labels for AU representation learning and selecting useful AUs for facial expression recognition.

Ii-B Action Unit Recognition

Facial action units are defined to describe facial muscle movements according to [33], and detecting action unit is helpful for expression and emotion understanding. Previous works [35, 28] leveraged traditional shadow models (e.g., SVM and SVR) to solve this task. For example, Mahoor et al. [28] projected high-dimensional facial images into a low-dimensional space via the spectral regression technique and adopted SVM classifier to predict the AU intensity. Inspired by recent advance of deep neural network on vision tasks, current works [2, 49, 40, 13] also designed deep model for action unit recognition. Gudi et al. [13]

designed a simple seven-layer network to estimate both occurrences and intensities of the AUs. Furthermore, Walecki et al.

[40] adopted the conditional random field (CRF) to encode AU dependencies and combined it with deep neural networks to improve recognition.

Iii AUE-CRL Framework

Iii-a Overview

This proposed AUE-CRL framework explores mining useful local AU information to enhance image representation learning under the guidance of the correlations among AUs and expressions. It mainly consists of modules, i.e., feature extractors, knowledge-guided AU representation learning (KGAURL) module, and knowledge-constrained AU selection (KCAUS) module. Taking an input image

, the feature extractor generates multi-layer feature maps, and then it fuses these feature maps and performs global average pooling to obtain global expression feature vector

. The feature extractor also fuses the feature maps from multiple layers inversely to obtain a set of feature maps with relative large size and leverage a crop network to extract feature for each AU, i.e., . Here, is the feature vector of the -th AU and is the AU number. Then, the KGAURL module converts the expression labels to pseudo AU labels based on the AU-expression correlations. The pseudo labels are then used to supervised AU classifier training and thus it can learn AU features without any additional AU annotations. Finally, the KCAUS exploits an attention mechanism to automatically discover useful AU features under the constraint of AU dependencies, and aggregate these features with the global expression feature vector for expression prediction. An overall framework is illustrated in Figure 2.

Iii-B Knowledge-Guided AU Representation Learning

Current algorithms [21, 13] mainly resort to deep neural networks [15, 36, 5] to learn AU representations, but they requires a large amount of ground truth annotations to ensure the discriminative and generalization abilities. However, current datasets with AU annotations are mainly captured in the constrained in-the-lab environment and cover very few subjects, e.g., 27 and 41 subjects on the DISFA [29] and BP4D [46] datasets. Features trained using these datasets can hardly generalize to other environments and subjects, especially to the in-the-wild settings. On the other hand, current FER datasets lack the AU annotations, and it is expensive and unpractical to add the AU annotations for these datasets. In this work, we design a knowledge-guided module that exploits the correlations between AU and expression to generate pseudo AU labels, and thus the proposed framework can learn AU representation without incurring additional annotations.

As suggested in previous literatures [33]

, each expression is relevant with several AUs, and the relevance can be further divided into primary and secondary ones. More concretely, if a face image is annotated with a specific expression, it tends to have the corresponding primary AUs with high probabilities, secondary AUs with middle probabilities, and other AUs with low probabilities. According to these relationships, we can build a correlation matrix

, where and denote the expression and AU number, respectively. The value denote the prior relevant probability between expression and AU . It is assigned with a large value if they are primarily relevant, a middle value if they are secondarily relevant, and a small value otherwise. Given a sample with expression annotations of , it is intuitive to produce the pseudo AU labels by . However, the matrix merely defines three level correlations, which is very cursory. It is desirable to exploit finer-grained correlations so as to obtain more precise pseudo AU labels. In this work, we use a learnable correlation matrix to generate the pseudo AU labels by

(1)

We apply the simple linear function to map the AU features to the predicted AU labels

(2)

where is a learnable weight vector. During training, we define a mean square error loss between the pseudo and the predicted labels, and a regularization loss between the learned and prior correlation matrices, formulated as

(3)

In this way, we can learn finer-grained correlation and simultaneously exploit prior correlations to promote generating more precise AU labels. is a balance parameter and it is set to 1.0 in the experiments.

Iii-C Knowledge-Constrained AU Selection

In this subsection, we introduce the knowledge-constrained attention mechanism that learns to adaptively select useful AU features and fuses these features to enhance image representation. It computes a correlation coefficient for each AU, performs weighted average to obtain a merged AU feature, and concatenates it with expression feature to obtain the final image representation.

Specifically, we first fuse the expression features with each AU feature using the low-rank bilinear pooling [17] to compute a coefficient

(4)

that denotes the importance of AU for expression recognition. Here, we use low-rank bilinear pooling [17] as it is effective for feature fusion. In the equation, is the hyperbolic tangent function and is the element-wise product operation. , , are the learnable parameter matrixes. To make the coefficients easily comparable across different samples, we normalize the coefficients over all AUs using a softmax function

(5)

Then, we perform weighted average over all AUs to obtain the AU features

(6)

Finally, we concatenate with for expression prediction

(7)

Iii-D Knowledge-Regularized Training Loss

As suggested in previous literatures [12], there exists strong co-occurrence dependencies among different AUs. In other words, some AUs co-occur frequently while some AUs are mutually exclusive. For example, the AU inner brow raiser always co-occurs with outer brow raiser as they are both controlled by the muscle group of Occipito franontanlis. In contrast, the AU lip corner puller hardly co-exists with lip corner depressor as the corresponding controlled muscle group can not co-activate. It is expected and natural that the learned attentional coefficients should match such dependencies, and thus we introduce these prior dependencies as a regularization term during training.

Inspired by previous work [47], we consider the pair-wise dependencies that include positive and negative correlations to define the regularization term. Here, we consider the AU exists if its attention coefficient is higher than 0.5. We denote it as if it exists and denote as otherwise. For the AU and with positive correlation, it is expected

(8)

which can be transformed to the equivalent formulation

(9)

Accordingly, we can define the regularization term that constrains the positive correlation as

(10)

where is the set of positive AU pairs. Similarly, the regularization term for negative correlation constraint can be defined as

(11)

where is the set of negative AU pairs.

During training, we use the cross entropy loss, which is denoted by , to train the expression classifier, and thus the loss can be defined as

(12)

where is the balance parameter and it is set to 0.5 in the experiments.

Iii-E Implementation Details

Iii-E1 Network architecture

We select ResNet-101 [14]

as the backbone network for feature extraction, which consists of four block layers. Given an input image of size 224

224, we can obtain feature maps of size 56 56 256 from first layer, feature maps of size 28 28 512 from second layer, feature maps of size 14 14 1024 from third layer and feature maps of size 7 7

2048 from last layer. For holistic expression feature, we downsample the feature maps from the first, second, third layer of backbone network to the size of feature maps from last layer by max pooling, then concatenate these four feature maps, and perform global average pooling to obtain a 3840-dimensional vector. For AU feature, we inversely upsample the feature maps to the size of feature maps from previous layer by deconvolution, starting from feature maps from last layer to the end of second layer, and concatenate each upsampled feature maps with original feature maps, and upsample it again by deconvolution. At end, we obtain a feature map of size 56

56 256, which is used for cropping feature from corresponding location to obtain the corresponding AU feature by using crop net.

In crop net, we first use MTCNN [45] to get facial landmarks of input image and crop the corresponding region for each AU on feature maps by using code of [21], and pass it through a convolutional layer and a fully connected layer, whose parameters are not shared for each AU, to obtain a 512-dimensional vector.

Iii-E2 Training details

To obtain more stable experiment results, we adopt three-stages training process. In the first stage, we train the backbone and expression classifier with the cross-entropy loss using stochastic gradient descent(SGD) with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of 0. In the second stage, we fix the parameters of backbone, and train crop net and AU classifier with mean-square error loss and loss

defined by the formula (3) using stochastic gradient descent(SGD) with an initial learning rate of 0.001, a momentum of 0.9, and a weight decay of 0. And in third stage, we fix the parameters of backbone, crop net and AU classifier, and train expression classifier with loss defined by the formula (12) using stochastic gradient descent(SGD) with an initial learning rate of 0.0001, a momentum of 0.9, and a weight decay of 0.

Methods Angry Disgust Fear Happy Neural Sad Surprised Ave. acc
DCNN-DA [19] 78.4 64.4 62.2 91.1 80.6 81.2 84.5 77.5
WSLGRN [44] 75.3 56.9 63.5 93.8 85.4 83.5 85.4 77.7
CP [1] 80.0 61.0 61.0 93.0 89.0 86.0 86.0 79.4
CompactDLM [20] 74.5 67.6 46.9 82.3 59.1 58.0 84.6 67.6
FSN [48] 72.8 46.9 56.8 90.5 76.9 81.6 81.8 72.5
DLP-CNN [22] 71.6 52.2 62.2 92.8 80.3 80.1 81.2 74.2
MRE-CNN [11] 84.0 57.5 60.8 88.8 80.2 79.9 86.0 76.7
Ours 80.5 67.6 68.9 94.1 85.8 83.6 86.4 81.0
TABLE I: Performance of our proposed method and current existing state-of-the-art competitors for recognizing the basic expressions on the RAF-DB dataset.
Methods BaseDCNN [22] Center Loss [22] DLP-CNN [22] Ours
Ave. acc 40.2 40.0 44.6 51.1
TABLE II: Performance of our proposed method and current existing state-of-the-art competitors for recognizing the compound expressions on the RAF-DB dataset.
Methods Angry Disgust Fear Happy Neural Sad Surprised Ave. acc
CP [1] 66.0 0.0 14.0 90.0 86.0 66.0 29.0 50.1
DLP-CNN [22] - - - - - - - 51.1
IA-CNN [30] 70.7 0.0 8.9 70.4 60.3 58.8 28.9 42.6
IL [3] 61.0 0.0 6.4 89.0 66.2 48.0 33.3 43.4
Ours 75.3 17.4 25.5 86.3 72.1 50.7 42.1 52.8
TABLE III: Performance of our proposed method and current existing state-of-the-art competitors recognizing the basic expressions on the SFEW2.0 dataset. - denotes corresponding result is not provided.
Methods Surprised Fear Disgust Happy sad Angry Neural Ave. acc
Baseline 85.5 68.9 60.1 94.1 85.3 81.8 83.3 79.9
Ours w/o KGAURL 88.3 63.5 64.2 92.7 85.7 79.9 87.3 80.2
Ours w/o Attention 88.0 67.6 60.1 94.1 84.0 82.5 85.0 80.2
Ours w/o AU-Independent 86.1 67.6 65.5 94.2 84.6 79.9 85.9 80.6
Ours 86.4 68.9 67.6 94.1 83.6 80.5 85.8 81.0
TABLE IV: Performance of our method (Ours), our method without knowledge-guided AU representation Learning (Ours w/o KGAURL), Our method without attention mechanism for AU representaion selection (Ours w/o Attention), Our method without AU-indenpent constrain (Ours w/o AU-Indenpendent), and the baseline resnet 101 (Baseline) on the RAF-DB dataset.

Iv Experiments

Iv-a Datasets

The existing datasets of facial expression can be divided into two main categories according to its collecting environment. Over quite a long period there only exist datasets collected in the lab-controlled conditions with limited size. Recently, some comparatively large-scale datasets that reflect real-world scenarios are released to promote the research. Meanwhile, the types of expressions are expanded with compound expressions that can be constructed by the combination of basic expression.

We chose two challenging in-the-wild datasets to evaluate the performance of our method. The Real-world Affective Face Database (RAF-DB) [22] and the Static Facial Expressions in the Wild (SFEW2.0) [10] datasets.

RAF-DB [22] contains 29,672 highly diverse facial images from thousands of individuals that were also collected from the Internet. With manually crowd-sourced annotation and reliable estimation, seven basic and eleven compound emotion labels are provided for the samples. Specifically, 15,339 images from the basic emotion set are divided into two groups (12,271 training samples and 3,068 testing samples) for evaluation.

SFEW2.0 [10] is an in-the-wild dataset collected from different films with spontaneous expressions, various head poses, age ranges, occlusions and illuminations. This dataset is divided into training, validation, and test sets, with 958, 436, and 372 samples, respectively.

Iv-B Comparison with State-of-the-Arts

In this subsection, we present the performance comparisons with current state-of-the-art methods to evaluate the superiority of our proposed method.

Iv-B1 Performance on RAF-DB

RAF-DB is a challenging in-the-wild dataset that is widely used for evaluating facial expression recognition. In this part, we compare our method with current state-of-the-art competitors, including Deep Neural Network Augmentation(DCNN-DA)  [19], Weakly Supervised Local-Global Relation Network (WSLGRN)  [44], Covariance Pooling (CP) [1], Compact Deep Learning Model(CompactDLM)  [20]

, Feature Selection Network(FSN)  

[48], Deep Locality-Preserving Learning (DLP-CNN)  [22], and Multi-Region Ensemble CNN (MRE-CNN)  [11].

We first present the accuracy of each basic expression and the average accuracy over all expressions in Table I. As shown, our method achieves the best performance compared with existing competitors, i.e., improving the average accuracy from 79.4% to 81.0%. In addition, our method obtains better accuracy for most basic expression, especially for those that are difficult to recognize. For example, current best accuracy for the expression “Fear” is 63.5%. Our method improves the accuracy to 68.9%, with a relative improvement of 8.50%. One possible reason is integrating information of facial AU can well capture local discriminative feature and help to distinguish uncertain and ambiguous expression.

Except for the basic expression, RAF-DB contains another sub-set in which each face image is annotated with a compound expression. A compound expression usually contains two or more basic expressions. For example, a person may be happy and simultaneously surprised. Obvious, this is an even more difficult task as it needs to recognize multiple expression patterns, which depends more on local discriminative feature mining. Here, we compare our method with BaseDCNN [22], Center Loss  [22], and Deep Locality-Preserving Learning (DLP-CNN)  [22]. As shown in Table II, our method outperforms current state-of-the-art competitors by a sizable margin, i.e., an improvement of 6.50% in average accuracy.

Iv-B2 Performance on SFEW2.0

SFEW2.0 is an even more challenging dataset, and there are also some works that conduct experiments on this dataset. Here, we compare with our proposed method with the following works: Covariance Pooling (CP) [1], Deep Locality-Preserving Learning (DLP-CNN) [22], Identity-Aware Convolutional Neural Network (IA-CNN) [30] , and Island Loss (IL) [3].

The accuracy of each basic expression and average accuracy overall all expressions are presented in Table III. Our method obtains an average accuracy of 52.8%, improving that of the previous best method by 1.7%. Similar to results on RAF, existing methods perform extremely poor for the expressions “Disgust” and “Fear”, e.g., accuracies of 0.0% and 14.0% for these two expressions. Our methods improve the accuracies to 17.4% and 25.5%. These comparisons again demonstrate the superiority of our proposed method in ambiguous expression recognition.

Iv-C Ablation Study

In this subsection, we conduct comprehensive ablation studies to discuss and analyze the contribution of each component and obtain a more thorough understanding of the framework.

Iv-C1 Analysis of AUE-CRL framework

The core contribution of the proposed framework is mining useful local AU information to enhance image representation. To analyze the contribution of this framework, we compare AUE-CRL framework with the ResNet-101 baseline. The experiment is conducted on the RAF-DB dataset and the results are presented in Table IV. As shown, the average accuracy drops from 81.0% to 79.9%. It is worth noting that, compared to AUE-CRL framework, the accuracy of baseline drops significantly on expression ”Disgust”, which proves AUE-CRL framework is effective to distinguish uncertain and ambiguous expression.

Iv-C2 Analysis of knowledge-guided AU representation learning

As suggested above, we leverage the relationship of AU and expression to guide learning AU representation, which can help get rid of heavy AU annotations and guide learning domain-adaptive representation. To analyze the effect of the relationship of AU and expression, we remove the AU-expression regularization loss and use AU annotation of the BP4D dataset [46] to train AU classifiers. We conduct the comparison on the RAF-DB dataset and present the results on Table IV. Although this method adopts ground truth AU annotations, it performs inferior compared with our knowledge-guided AU representation learning, i.e., decreasing the accuracy from 81.0% to 80.2%.

Indeed, the images of BP4D cover merely 41 subjects and they are captured in the constrained lab environment. Thus, representation learned on such a dataset can hardly generalize to other environments. In contrast, our proposed knowledge-guided AU representation learning enables training on the target dataset and tends to learn domain-adaptive AU representation, leading to better performance.

Iv-C3 Analysis of knowledge-constrained AU selection

In this work, we introduce a knowledge-constrained attention mechanism to adaptively select useful AU for expression representation enhancement. To analyze its contribution, we remove this component, simply perform average pooling over all action unit to obtain AU representation and concatenate it with expression feature for expression recognition. We find the average accuracy drops to 80.2% and accuracy drops significantly on expression ”Disgust”, which suggests that the attention mechanism can help mine useful AU information to facilitate expression recognition and play a great role in distinguish uncertain and ambiguous expression.

To better select useful AUs, we introduce prior knowledge of the relationship among AUs as a constraint term during training. Here, we remove this constraint to analyze its contribution. As shown in Table IV, we find the average accuracy is 80.6% on the RAF-DB dataset, which is better than that without the attention mechanism but still worse than our method.

V Conclusion

In this paper, we propose an AU-Expression Knowledge Constrained Representation Learning framework that exploits prior knowledge to help mining AU information to promote facial expression recognition. It first leverages relationships between AUs and expressions to guide learning domain-adaptive AU representation without any additional AU annotations. Then, it introduces an attentional mechanism to adaptively select useful AU representation under the constraint of the dependencies among AUs. We conduct an experiment on two in-the-wild datasets and show that our method outperforms current state-of-the-art competitors by a sizable margin.

References

  • [1] D. Acharya, Z. Huang, D. Pani Paudel, and L. Van Gool (2018-06) Covariance pooling for facial expression recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    ,
    Cited by: TABLE I, TABLE III, §IV-B1, §IV-B2.
  • [2] J. C. Batista, V. Albiero, O. R. Bellon, and L. Silva (2017) Aumpnet: simultaneous action units detection and intensity estimation on multipose facial images using a single convolutional neural network. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 866–871. Cited by: §II-B.
  • [3] J. Cai, Z. Meng, A. S. Khan, Z. Li, J. O’Reilly, and Y. Tong (2018) Island loss for learning discriminative features in facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 302–309. Cited by: TABLE III, §IV-B2.
  • [4] T. Chen, L. Lin, X. Hui, R. Chen, and H. Wu (2020) Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II-A.
  • [5] T. Chen, L. Lin, X. Wu, N. Xiao, and X. Luo (2018) Learning to segment object candidates via recursive neural networks. IEEE Transactions on Image Processing 27 (12), pp. 5827–5839. Cited by: §III-B.
  • [6] T. Chen, T. Pu, Y. Xie, H. Wu, L. Liu, and L. Lin (2020) Cross-domain facial expression recognition: a unified evaluation benchmark and adversarial graph learning. Arxiv. Cited by: §II-A.
  • [7] T. Chen, M. Xu, X. Hui, H. Wu, and L. Lin (2019) Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 522–531. Cited by: §II-A.
  • [8] T. Chen, W. Yu, R. Chen, and L. Lin (2019) Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6163–6171. Cited by: §II-A.
  • [9] M. Dahmane and J. Meunier (2011-03) Emotion recognition using dynamic grid-based hog features. In Face and Gesture 2011, Vol. , pp. 884–888. External Links: Document Cited by: §II-A.
  • [10] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon (2011) Static facial expression analysis in tough conditions: data, evaluation protocol and benchmark. In IEEE International Conference on Computer Vision Workshops, Cited by: §I, §I, §IV-A, §IV-A.
  • [11] Y. Fan, J. C. Lam, and V. O. Li (2018) Multi-region ensemble convolutional neural network for facial expression recognition. In International Conference on Artificial Neural Networks, pp. 84–94. Cited by: TABLE I, §IV-B1.
  • [12] E. Friesen and P. Ekman (1978) Facial action coding system: a technique for the measurement of facial movement. Palo Alto 3. Cited by: §I, §III-D.
  • [13] A. Gudi, H. E. Tasli, T. M. Den Uyl, and A. Maroulis (2015) Deep learning based facs action unit occurrence and intensity estimation. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 6, pp. 1–5. Cited by: §II-B, §III-B.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. Cited by: §III-E1.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §II-A, §III-B.
  • [16] Y. Hu, Z. Zeng, L. Yin, X. Wei, X. Zhou, and T. S. Huang (2008-Sep.) Multi-view facial expression recognition. In 2008 8th IEEE International Conference on Automatic Face Gesture Recognition, Vol. , pp. 1–6. External Links: Document, ISSN Cited by: §II-A.
  • [17] J. H. Kim, K. W. On, J. Kim, J. W. Ha, and B. T. Zhang (2016) Hadamard product for low-rank bilinear pooling. Cited by: §III-C.
  • [18] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: §II-A.
  • [19] D. Kollias, S. Cheng, E. Ververas, I. Kotsia, and S. Zafeiriou (2020-02) Deep neural network augmentation: generating faces for affect analysis. International Journal of Computer Vision, pp. . External Links: Document Cited by: TABLE I, §IV-B1.
  • [20] C. Kuo, S. Lai, and M. Sarkis (2018) A compact deep learning model for robust facial expression recognition. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 2202–22028. Cited by: TABLE I, §IV-B1.
  • [21] G. Li, X. Zhu, Y. Zeng, Q. Wang, and L. Lin (2019) Semantic relationships guided representation learning for facial action unit recognition. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 8594–8601. Cited by: §II-A, §III-B, §III-E1.
  • [22] S. Li, W. Deng, and J. Du (2017-07) Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §I, TABLE I, TABLE II, TABLE III, §IV-A, §IV-A, §IV-B1, §IV-B1, §IV-B2.
  • [23] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel (2016) Gated graph sequence neural networks. In International Conference on Learning Representations, Cited by: §II-A.
  • [24] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen (2015) Deeply learning deformable facial action parts model for dynamic expression analysis. D. Cremers, I. Reid, H. Saito, and M. Yang (Eds.), pp. 143–157. Cited by: §II-A.
  • [25] P. Liu, S. Han, Z. Meng, and Y. Tong (2014) Facial expression recognition via a boosted deep belief network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1805–1812. Cited by: §II-A.
  • [26] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 94–101. Cited by: §I.
  • [27] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba (1998) Coding facial expressions with gabor wavelets. In Proceedings Third IEEE international conference on automatic face and gesture recognition, pp. 200–205. Cited by: §I.
  • [28] M. H. Mahoor, S. Cadavid, D. S. Messinger, and J. F. Cohn (2009) A framework for automated measurement of the intensity of non-posed facial action units. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 74–80. Cited by: §II-B.
  • [29] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn (2013) Disfa: a spontaneous facial action intensity database. IEEE Transactions on Affective Computing 4 (2), pp. 151–160. Cited by: §III-B.
  • [30] Z. Meng, P. Liu, J. Cai, S. Han, and Y. Tong (2017) Identity-aware convolutional neural network for facial expression recognition. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 558–565. Cited by: TABLE III, §IV-B2.
  • [31] A. Mollahosseini, D. Chan, and M. H. Mahoor (2016) Going deeper in facial expression recognition using deep neural networks. In 2016 IEEE Winter conference on applications of computer vision (WACV), pp. 1–10. Cited by: §II-A.
  • [32] A. Mollahosseini, B. Hasani, and M. H. Mahoor (2017) Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing. Cited by: §II-A.
  • [33] P.Ekman and W.Friesen (1978) Facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists Press. Cited by: §II-B, §III-B.
  • [34] M. Pantic, M. Valstar, R. Rademaker, and L. Maat (2005) Web-based database for facial expression analysis. In 2005 IEEE international conference on multimedia and Expo, pp. 5–pp. Cited by: §I.
  • [35] A. Savran, B. Sankur, and M. T. Bilge (2012) Regression-based intensity estimation of facial action units. Image and Vision Computing 30 (10), pp. 774–784. Cited by: §II-B.
  • [36] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I, §II-A, §III-B.
  • [37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015-06) Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-A.
  • [38] Y. Tian, T. Kanade, and J. F. Cohn (2001) Recognizing action units for facial expression analysis. IEEE Transactions on pattern analysis and machine intelligence 23 (2), pp. 97–115. Cited by: §I.
  • [39] M. F. Valstar, M. Mehu, B. Jiang, M. Pantic, and K. Scherer (2012-08) Meta-analysis of the first facial expression recognition challenge. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42 (4), pp. 966–979. Cited by: §II-A.
  • [40] R. Walecki, V. Pavlovic, B. Schuller, M. Pantic, et al. (2017) Deep structured learning for facial action unit intensity estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3405–3414. Cited by: §II-B.
  • [41] Z. Wang, T. Chen, J. Ren, W. Yu, H. Cheng, and L. Lin (2018)

    Deep reasoning with knowledge graph for social relationship understanding

    .
    In Proc. of International Joint Conference on Artificial Intelligence, pp. 2021–2028. Cited by: §II-A.
  • [42] Y. Xie, T. Chen, T. Pu, H. Wu, and L. Lin (2020) Adversarial graph representation adaptation for cross-domain facial expression recognition. In Proceedings of the 28th ACM international conference on Multimedia, pp. 1255–1264. Cited by: §II-A.
  • [43] Z. Yu and C. Zhang (2015) Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ICMI ’15, pp. 435–442. Cited by: §II-A.
  • [44] H. Zhang, W. Su, and J. Y. andZengfu Wang (2020) Weakly supervised local-global relation network for facial expression recognition. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pp. 1040–1046. Cited by: TABLE I, §IV-B1.
  • [45] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016)

    Joint face detection and alignment using multi-task cascaded convolutional networks

    .
    CoRR abs/1604.02878. Cited by: §III-E1.
  • [46] X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard (2014) Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing 32 (10), pp. 692–706. Cited by: §III-B, §IV-C2.
  • [47] Y. Zhang, W. Dong, B. Hu, and Q. Ji (2018)

    Classifier learning with prior probabilities for facial action unit recognition

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5108–5116. Cited by: §III-D.
  • [48] S. Zhao, H. Cai, H. Liu, J. Zhang, and S. Chen (2018) Feature selection mechanism in cnns for facial expression recognition.. In BMVC, pp. 317. Cited by: TABLE I, §IV-B1.
  • [49] Y. Zhou, J. Pi, and B. E. Shi (2017)

    Pose-independent facial action unit intensity regression based on multi-task deep transfer learning

    .
    In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 872–877. Cited by: §II-B.