In recent years, research on Facial Action Unit (AU), as a comprehensive description of facial movements, has attracted more and more attention in the field of human-computer interaction and affective computing. Facial AU detection is beneficial to facial expression recognition and analysis. According to statistical calculation and facial anatomy information, strong relationships are exist among different AUs under different facial expressions, e.g., happiness might be the combination of AU12 (Lip Corner Puller) and AU13 (Cheek Puffer).
Most of existing AU detection methods focus on AU relationship modeling implicitly. For example, probabilistic graphic models including Bayesian Networks, Dynamic Bayesian Networks 23] have demonstrated their effectiveness of relation modeling for AU detection. However, these generative models are always integrated with manually extracted feature, i.e. LBP, SIFT, HoG, which limits its extension ability with state-of-the-art deep discriminative models.
With the recent development of deep graph networks, relation modeling with graph based deep graph models has attracted more and more attention. In this paper, we use the graph convolutional network (GCN)  for AU relation modeling to strengthen the facial AU detection. In particular, reference to EAC-Net , AU related regions are extracted at first, these AU regions are feed into some specific AU auto-encoder for deep representation extraction in the next. Moreover, each latent representation is pull into GCN as a node, the connection mode of GCN is determined by the relationship of AUs. Finally, the assembled features are concatenated for AU detection. These auto-encoders are trained firstly, then the whole framework is trained together.
The contributions of this paper are twofold. (1) We propose a deep learning framework for AU detection with graph convolutional network for AU relation modeling. (2) Results of extensive experiments conducted on two benchmark datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art.
2 Related Work
Our proposed framework is closely related to facial AU detection and graph convolutional network.
Facial AU detection: Based on previous research[27, 23, 20], AUs are in contact, which make itself a problem different from standard expression recognition. To capture such correlations, a generative dynamic Bayesian networks (DBN)  was proposed to model the AU relationships and their temporal evolution. Rather than learning, pairwise AU relations can be explicitly inferred using statistics in annotations, and then injected such relations into a multi-task learning framework to select important patches for each AU. In addition, a restricted Boltzmann machine (RBM)  was developed to directly capture the dependencies between image features and AU relationships. Following this direction, image features and AU outputs were fused in a continuous latent space using a conditional latent variable model. Song et al. 
studied the sparsity and co-occurrence of AUs. Although improvements can be observed considering the relationships among AUs, these approaches rely on manually extracted features such as SIFT, LBP, or Gabor, rather than deep features.
With the recent rise of deep learning, CNN have been widely adopted to extract AU features. Zhao et al.  proposed a deep region and multi-label learning (DRML) network to divide the face images into 8 8 blocks and used individual convolutional kernels to convolve each block. Although this approach treats each face as a group of individual parts, it divides blocks uniformly and does not consider the FACS knowledge, thereby leading to poor performance. Wei Li et al. 
proposed Enhancing and Cropping Net (EAC-Net), which intends to give significant attention to individual AU centers; however, this approach does not consider AU relationship modeling, and the lack of RoI-level supervised information can only give coarse guidance. All these researches demonstrate the effectiveness of deep learning on feature extraction for AU detection task. However, they all do not consider the AU relation modeling.
Graph Convolutional Network: There have been a lot of works for graph convolution, whose principle of constructing GCNs mainly follows two streams: spatial perspective [4, 1, 19] and spectral perspective [18, 2, 8, 7, 3]. Spatial perspective methods directly perform the convolution filters on the graph vertices and their neighbors. Atwood et al. 
proposed the diffusion-convolutional neural networks (DCNNs). Transition matrices are used to define the neighborhood for nodes in DCNN. Niepert et al.
extracts and normalizes a neighborhood of exactly k nodes for each node. And then the normalized neighborhood serves as the receptive field for the convolutional operation. Different with the spatial perspective methods, spectral perspective methods utilize the eigenvalues and eigenvectors of graph Laplace matrices. Bruna et al. proposed the spectral network. The convolution operation is defined in the Fourier domain by computing the eigendecomposition of the graph Laplacian.
Recently, Li et al.  proposed the AU semantic relationship embedded representation learning (SRERL) framework to combine facial AU detection and Gated Graph Neural Network (GGNN)  and achieved good results. But the commonly used Graph Convolutional Network (GCN) for classification task with relation modeling is adopted for AU relation modeling in our proposed method, while the Gated Graph Neural Network (GGNN) adopted in  is inspired by GRU and mainly used for the task of Visual Question Answering and Semantic Segmentation. In addition, our method has only about 2.3 million parameters, but SRERL has more than 138 million parameters.
In this paper, we apply the spectral perspective of GCNs  for AU relation modeling. Our GCN is bulit by stacking multiple layers of graph convolutions with AU relation graph. The outputs of the GCNs are updated features for each AU region node by modeling their relationships, which can be used to perform classification.
3 Proposed Method
The propose AU-GCN framework for AU detection by considering AU relation modeling through GCN is shown in Figure 1, in which, four modules are included: AU local region division, AU local region representation, AU relation graph, Convolutions on graph. Given the face image with facial landmark key points, AU related local regions are extracted at first by taking EAC-Net 
as a reference. After that, deep representations of each AU region are represented by the latent vectors of an auto-encoder supervised through the reconstruction loss and the AU classification loss. In the next, each latent vector is pull into GCN as a node, and AU relationships are modeled through the edges of GCN. Multiple layers of graph convolution operations will be applied on the input data and generating higher-level feature maps on the graph. It will then be classified by the modified multi-label cross entropy loss and Dice loss to the AU correct classification. We will now go over the components in the AU-GCN model as following.
3.2 AU ROI Partition Rule
The most recent deep learning based image classification methods make use of CNN for feature extraction, and the basic assumption for a standard CNN is the shared convolutional kernels for an entire image. For an image with the relatively fixed structure, such as a human face, a standard CNN may fail to capture those subtle appearance changes. In order to focus more on AU specific regions, the AU local region partition rules are defined at first by taking FACS  and EAC-Net  as reference.
The first step is to use the facial landmark information to get the AUs centers. The landmark points provide rich information about the face, which help us to locate specific AU related facial areas. Then, taking this AU related landmark as the center to extract the nn size region as AU local region. Figure 2(a) shows the AU region partition of the face, in which, the face image is partitioned into 19 basic ROIs using AU related landmarks, AU12, AU14 and AU15 share a ROI, AU23, AU24, AU25 and AU26 share a ROI. These 12 ROIs are shared by different benchmark datasets, i.e. BP4D  and DISFA  datasets, in which, 6 AU ROIs for BP4D dataset, and addition one for DISFA dataset. Due to the fact that previous ROIs are all the facial local feature, the facial global feature is ignored, another special ROI representing the whole face image in introduced. All these AU related ROIs will be resized into nn for further representation learning and relation modeling. Finally, BP4D dataset has 19 ROIs, DISFA dataset has 14 ROIs.
3.3 AU Deep Representation Extraction
Figure 2(b) shows the architecture of network for AU deep representation extraction. This purpose of this step is to get -dim deep representations full of AU information for further AU relation modeling and AU detection.
The AU specific ROIs obtained in the previous step are feed into AU specific auto-encoders (AEs)  to reconstruct each AU ROI. To get latent vectors full of AU information, two kinds of losses are introduced here to constrain the extracted deep representations. The first loss is the pixel-wise L1-reconstruction loss :
where is the size of each AU ROI, denotes the ground truth AU ROI image, denotes the reconstructed AU ROI image.
To make sure that the extracted AU deep representation contains as more AU information as possible, the second loss for ROI-level multi-label AU detection is introduced as following:
where is the number of the classes, is the number of the ROIs obtained in the previous step, i.e., 19 ROIs and 14 ROIs are defined in BP4D dataset and DISFA dataset respectively according to the provided AU labels, the ground truth of AU label is , indicates the (, ) -th element of Y, where = 0 denotes AU is inactive in AU ROI , and = 1 denotes AU is active in AU ROI . In addition, the ground truth must satisfy the constraint of AU region partition rule: = 0 if AU does not belong to the -th AU ROI. In particular, when an AU ROI consists of multiple AUs, just like the ROI containing AU12, AU14 and AU15 in BP4D, the
also follows the above rules. The ROI-level label also helps to improve AU detection performance through the space constraint and supervised information of the ROIs. Finally, the overall loss function for AU deep representation extraction is shown below:
in which, is a trade-off parameter.
3.4 AU Relation Graph
In this section, AU relation graph is proposed to encoder those AU relations, in which, AUs with high confidence of relations are connected together. The relationships among AUs in the ROIs are analyzed to construct the AU relation graph. In the graph, we connect pairs of related AU ROIs together. The graph will show in Section 4.
Formally, we assume the number of the AU is , the number of AU ROI is
. Given all labels in the training-set, the conditional probability that AUequals 1 when AU equals 1 is calculated. Relation matrix of CC dimension is obtained, and then in order to transform into symmetric matrix , the following function is introduced as:
where denotes the (i,j)-th element of the matrix . Then, a threshold is set to convert into a 0-1 matrix as following:
In the next, graph with AU ROI nodes is built according , in which, is the number of AU ROIs. Firstly, the node in is connected to itself. Secondly, each node is connected with its symmetrical node, i.e.: AU1_left ROI and AU1_right ROI, AU23_up and AU23_down. Thirdly, if = 1, it shows that AU is strongly related to AU , so these nodes belonging to AU are connected to those nodes belonging to AU . Finally, the last node representing the whole facial image is connected to all nodes, which lets the global feature help the local features learn more AU information. By building the AU relation graph , we can obtain richer AU relation information and enlarge the ability of classifiers in subsequent inference process.
3.5 Convolutions on Graph
To perform reasoning on the graph, we apply the Graph Convolutional Networks (GCNs) proposed in . Different from standard convolutions which operates on a local regular grid, the graph convolutions allow us to compute the response of a node based on its neighbors defined by the graph relations. Thus performing graph convolutions is equal to performing message passing inside the graphs. The outputs of the GCNs are updated features of each ROI node. Inspired by the above, we design a GCN-based multi-label encoder for AU detection. We can represent one layer of graph convolution as:
where represents the adjacency graph we have introduced above with RR dimension. denotes the input features in the -th graph convolution operation, in particular, denotes the latent vectors with , and is the weight matrix of the layer. with dimension , with dimension . Thus, the output of two graph convolution layers is in
dimension. After each layer of graph convolutions, we apply two functions including the Dropout and then ReLU before the updated featureis forwarded to the next layer.
3.6 Facial AU Detection:
As illustrated in Figure 1, the updated feature after graph convolutions is flatten. Then, the flatten feature is forwarded to a fully connected network (FCN) for AU detection. Finally, we get the detection results with -dim.
Facial AU detection can be regarded as a multi-label binary classification problem with the following weighted multi-label softmax loss :
where denotes the ground-truth probability of occurrence for the -th AU,
which is if occurrence and otherwise, and denotes the corresponding predicted occurrence probability for the -th AU. The trade-off weight is introduced to alleviate the data imbalance problem. For most facial AU detection benchmarks, the occurrence rates of AUs are imbalanced [13, 14]. Since AUs are not mutually independent, imbalanced training data has a bad influence on this multi-label learning task. Particularly, we set = ,where is the occurrence rate of the -th AU in the training set.
In some cases, some AUs appear rarely in training samples, for which the softmax loss often makes the network prediction strongly biased towards absence. To overcome this limitation, a weighted multi-label Dice coefficient loss  is further introduced as following:
where is the smooth term. Dice coefficient is also known as F1-score: , the most popular metric for facial AU detection, where and
denote precision and recall respectively. With the help of the weighted Dice coefficient loss, we also take into account the consistency between the learning process and the evaluation metric. Finally, the AU detection loss is defined as:
where is a trade-off parameter.
Dataset: The effectiveness of our proposed AU-GCN is evaluated on two benchmark datasets: BP4D  and DISFA . For BP4D and DISFA, a 3-fold partition is adopted to ensure subjects were mutually exclusive in train/val/test sets by following previous related work [26, 11]. The frames with intensities equal or greater than 2 are considered as positive, while others are treated as negative. BP4D contains 2D and 3D videos of 41 young adults during various emotion inductions while interacting with an experimenter. We used 328 videos (41 participants8 videos each) with 10 AUs coded, resulting in 140,000 valid face images. For each AU, we sampled 100 positive frames and 200 negative frames for each video. DISFA  contains 27 subjects watching video clips, and provides 8 AU annotations with intensities. There were 130,000 valid face images. We used the frames with AU intensities with 2 or higher as positive samples, and the rest as negative ones. To be consistent with the 8-video setting of BP4D, we sampled 800 positive frames and 1600 negative frames for each video.
The AU detection performance was evaluated on two commonly used frame-based metrics: F1-score and area under curve (AUC). F1-score is the harmonic mean of precision and recall, and widely used in AU detection. AUC quantifies the relation between true and false positives. For each method, we computed average metrics over all AUs (denoted as Avg.).
Implementation: For each face image, we perform similarity transformation to obtain a 200200
3 color face. This transformation is shape-preserving and brings no change to the expression. In order to enhance the diversity of training data, the face images are flipped for data augmentation. Our AU-GCN is trained using PyTorch with stochastic gradient descent(SGD), a mini-batch size of 256, a momentum of 0.9 and a weight decay of 0.0005. We decay the learning rate by 0.1 after every 10 epochs. The structure parameters of AU-GCN are chosen as= 150, = 30, = 12 , is 25, is for BP4D and for DISFA, is 19 for BP4D and for DISFA. The graph connection matrix on BP4D and DISFA are shown in Table 1
. The hyperparameters, are obtained by cross validation. In our experiments, set = 3 and = 4. AU-GCN is firstly trained with AE optimized with 12 epochs. Next, we read the parameters before getting the latent vectors and train with all the modules optimized with 40 epochs.
4.2 Comparison with State-of-the-Art Methods
We compare our method AU-GCN against state-of-the-art single-image based AU detection works under the same 3-fold cross validation setting. These methods include both traditional methods, LSVM , JPML , and deep learning methods, LCN , DRML  and EAC-Net . Note that EAC-Net  is not compared AUC due to its metrics of accuracy instead of AUC.
Table 2 reports the F1-score and AUC results of different methods on BP4D. It can be seen that our AU-GCN outperforms all these previous works on the challenging BP4D dataset. AU-GCN is superior to all the conventional methods, which demonstrates the strength of deep learning based methods. Compared to the state-of-the-art methods, AU-GCN brings significant relative increments of 6.9% and 31.3% respectively for average F1-score and AUC, which verifies the effectiveness of AU relation modeling with GCN. In addition, our method obtains high accuracy without sacrificing F1-score, which is attributed to the integration of the softmax loss and the Dice coefficient loss.
Experimental results on DISFA dataset are shown in Table 3, from which it can be observed that our AU-GCN outperforms all the state-of-the-art works with even more significant improvements. Specifically, AU-GCN increases the average F1-score and AUC relatively by 6.5% and 22.3% over the state-of-the-art methods, respectively. Due to the serious data imbalance issue in DISFA, performances of different AUs fluctuate severely in most of the previous methods. For instance, the accuracy of AU 12 is far higher than that of other AUs for LSVM and APL. Although both AU-GCN and EAC-Net use the AU local features, the GCN better expresses the AU relation information.
4.3 Ablation Study
To investigate the effectiveness of each component in our framework, Table 4 present the average F1-score and AUC for different variants of AU-GCN on BP4D benchmark, where “w/o” is the abbreviation of “without”. Each variant is composed by different components of our framework. AU-Net is the framework without GCN relation modeling (GCN), Dice loss (D) and global facial information (F).
|AU-GCN w/o F,D||59.1||83.9|
|AU-GCN w/o F||61.5||85.8|
|AU-GCN w/o D||61.9||86.8|
Contribution of the GCN: By integrating the graph convolutional network (GCN), AU-GCN w/o F,D achieves higher F1-score and AUC results than AU-Net. In particular, the AU-Net is to concatenate node features and put the concatenated features into FCN instead of GCN. This result illuminates that GCN can capture the strong relationship between AUs, and strength the relation learning for AU detection.
Integrating of whole facial information: By adding the resized whole facial image as a node to GCN and connecting this node with all the other nodes, AU-GCN w/o D achieves better F1-score and AUC results compared to AU-GCN w/o F,D. Since the previous features are all local AU features, benefiting from the whole facial image to add global feature for GCN, our method obtains more significant performance, which demonstrates that global facial information is helpful for local AU detection.
Integrating of Dice loss: After integrating the weighted softmax loss with the Dice loss, AU-GCN w/o F attains higher average F1-score and AUC than AU-GCN w/o F,D. The softmax loss focus more on the classification accuracy, rather than the balance between precision and recall, the Dice loss which optimizes the network from the perspective of F1-score, so this loss can make F1-score and AUC achieve good results.
In this paper, to makes full use of AU local features and their relationship, we have presented a facial AU detection methods by integrating graph convolution network for explicit AU relation modeling. To the best of our knowledge, this is the first study that combines facial AU detection and GCN with one end-to-end framework. Extensive experiments on two benchmark AU datasets demonstrate that the proposed network outperformed state-of-the-art methods for AU detection, the effectiveness of the proposed modules in the framework are also validated through a series of ablation study.
This work is supported by the National Natural Science Foundation of China under Grants of 41806116 and 61503277. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.
-  (2016) Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1993–2001. Cited by: §2.
-  (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §2.
-  (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2.
-  (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §2.
-  (1997) What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (facs). Oxford University Press, USA. Cited by: §3.2.
LIBLINEAR: a library for large linear classification.
Journal of machine learning research9 (Aug), pp. 1871–1874. Cited by: §4.2.
-  (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129–150. Cited by: §2.
-  (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: §2.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2, §3.5.
-  (2019) Semantic relationships guided representation learning for facial action unit recognition. arXiv preprint arXiv:1904.09939. Cited by: §2.
-  (2017) Eac-net: a region-based deep enhancing and cropping approach for facial action unit detection. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 103–110. Cited by: §1, §2, §3.1, §3.2, §4.1, §4.2.
-  (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.
-  (2018) Conditional adversarial synthesis of 3d facial action units. arXiv preprint arXiv:1802.07421. Cited by: §3.6.
-  (2017) Automatic analysis of facial actions: a survey. IEEE transactions on affective computing. Cited by: §3.6.
-  (2011) Stacked convolutional auto-encoders for hierarchical feature extraction. In International Conference on Artificial Neural Networks, pp. 52–59. Cited by: §3.3.
-  (2013) Disfa: a spontaneous facial action intensity database. IEEE Transactions on Affective Computing 4 (2), pp. 151–160. Cited by: §3.2, §4.1.
-  (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §3.6.
Bayesian semi-supervised learning with graph gaussian processes. In Advances in Neural Information Processing Systems, pp. 1683–1694. Cited by: §2.
-  (2016) Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023. Cited by: §2.
-  (2015) Exploiting sparsity and co-occurrence structure for action unit recognition. In 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG), Vol. 1, pp. 1–8. Cited by: §2.
-  (2014) Deepface: closing the gap to human-level performance in face verification. In , pp. 1701–1708. Cited by: §4.2.
Facial action unit recognition and intensity estimation enhanced through label dependencies. IEEE Transactions on Image Processing 28 (3), pp. 1428–1442. External Links: Cited by: §1.
-  (2014) Capturing global semantic relationships for facial action unit recognition. In IEEE International Conference on Computer Vision, Cited by: §1, §2.
-  (2013) A high-resolution spontaneous 3d dynamic facial expression database. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–6. Cited by: §3.2, §4.1.
-  (2016) Joint patch and multi-label learning for facial action unit and holistic expression recognition. IEEE Transactions on Image Processing 25 (8), pp. 3931–3946. Cited by: §4.2.
-  (2016) Deep region and multi-label learning for facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3391–3399. Cited by: §2, §4.1, §4.2.
-  (2005) A new dynamic bayesian network (dbn) approach for identifying gene regulatory networks from time course microarray data.. Bioinformatics 21 (1), pp. 71–79. Cited by: §1, §2.