Facial action unit intensity estimation aims to estimate intensity levels of facial muscle actions, named Action Units (AUs). Progress in this area can facilitate the applications of chatbot [4, 3], empathic design (lin2019moel, lin2019moel), facial expression recognition (FER) (fan2018multi, fan2018multi), etc. In real-world scenarios, human facial expressions can be deconstructed using the varied combinations of AUs and their intensities. Therefore, estimating intensity levels of AUs is an important step for further interpreting facial expressions. To provide a comprehensive encoding method, the facial action coding system (FACS) (eckman1978facial, eckman1978facial) defines rules for scoring AU intensity on a six-point ordinal scale. However, large-scale acquisition of facial AU intensity data is often difficult and time consuming, since trained specialists are required to annotate the data. Furthermore, the facial appearance changes are subtle in terms of AU intensity, and individuals may have different levels of expressiveness due to their own physical characteristics. The six-level intensities data of AUs are also highly imbalanced as the highest level occurs rarely. Thus, compared to FER [1, 2] and AU recognition tasks (li2019semantic, li2019semantic), discerning different intensities of AUs remains a far more challenging task.
The co-occurrences of different AU intensity levels may influence the overall estimation results. The facial appearance changes caused by a certain AU are also affected by the intensities of other AUs. As an example in Figure 1
, AU6 (Cheek Raiser) and AU12 (Lip Corner Puller) are typically activated together in the case of the happy expression. More specifically, a high intensity of AU12 can increase the probability of AU6 occuring and vice versa. By contrast, a low intensity of AU12 might present independently without activating other AUs.
To infer the co-occurrences of AUs, prior studies undertaken by Walecki et al. walecki2016copula and Wang et al. wang2018facial are typically based on probabilistic graphical models that directly learn the AU dependencies. Intuitively, the feature level information should contain more comprehensive descriptions than the final outputs. Hence, we formulate the problem into a heatmap regression-based framework, where the feature maps preserve rich semantic information associated with AU intensities and locations. We infer that through activating various feature channels simultaneously, the framework would produce the final AU co-occurrence pattern accordingly. This conjecture naturally leads to the idea of using the semantic correspondence between feature channels for discovering the latent co-occurrence relationships of AU intensities. The key novelty of our method is that such relationships are automatically captured via dynamically computing the correspondences from feature maps layer by layer, rather than relying on AU probabilities as in previous approaches.
In this study, one motivation lies in the phenomenon that the specific AU with a certain intensity can cause visible facial appearance changes. This can be reflected by a heatmap, which is generally utilized to visualize the spatial distribution of a response. Heatmap regression has proven to be notably effective (sanchez2018joint, sanchez2018joint), and thus our framework inherits the advantage of the heatmap-based scheme for AU intensity estimation. In our framework, the intensity level of each AU is learned from the response at its corresponding location on the heatmap. For example, if there is a larger response, we can infer that this location may belong to an AU with higher intensity.
The other motivation comes from graph convolutional neural networks (GCNNs) (scarselli2008graph, scarselli2008graph), which are proposed to pass and aggregate information in the graph structure. Due to the advantages in extracting the discriminative feature representations, GCNNs have been widely used in image classification (wang2018zero, wang2018zero), semantic segmentation (liang2017interpretable, liang2017interpretable), relational reasoning (battaglia2016interaction, battaglia2016interaction), etc. More importantly, GCNN describes the intrinsic relationship between various vertex nodes of the graph by learning an adjacency matrix, thus providing a potential way for us to explore the relationships among multiple feature maps. In our method, we introduce a simple yet effective semantic correspondence convolution module, dubbed SCC, which automatically learns the semantic relationships among feature maps. We summarize the key contributions of our work as follows:
i) We propose to leverage the semantic correspondence for modeling the implicit co-occurrence relations of AU intensity levels in a heatmap regression framework, where the feature maps encode rich semantic descriptions and spatial distributions of AUs.
ii) We introduce a semantic correspondence convolution module to dynamically compute the correspondences among feature maps layer by layer. To our knowledge, this is the first work that brings the advantages of dynamic graph convolutions to AU intensity estimation.
iii) We evaluate our method111Code will be released at https://github.com/EvelynFan/FAU on two benchmark datasets, showing that our approach performs favorably against related deep models.
To date, most works have focused on facial AU detection. Due to the limited datasets available with AU intensity annotations, few works take into account AU relations for intensity estimation. We review both hand-crafted feature-based works and deep learning-based methods that leverage relationship modeling for AU intensity estimation.
Hand-Crafted Feature-Based Method
Traditional hand-crafted feature-based AU intensity estimation methods have sought to address this problem based on the probabilistic models, directly capturing the AU dependence. For instance, (sandbach2013markov, sandbach2013markov) combined Support Vector Regression (SVR) outputs with AU intensity combination priors in the Markov Random Field (MRF) structures to improve the AU intensity estimation. (kaltwang2015latent, kaltwang2015latent) formulated a generative latent tree (LT) model, which modeled the higher-order dependencies among the input features and target AU intensities. (walecki2016copula, walecki2016copula) proposed a Copula Ordinal Regression (COR) framework to model the co-occurring AU intensity levels with the power of copula functions and conditional random fields (CRFs). One issue of these shallow model-based methods is that they need to separate feature extraction from model training. To this end, our proposed approach learns the deep representations and semantic relationships jointly in an end-to-end learning system.
Deep Learning-Based Method
Researchers have just begun to investigate how to leverage deep models that consider AU relations for intensity estimation. (walecki2017deep, walecki2017deep) placed a CRF graph on top of the output layer of a CNN, in order to learn the relations between different AUs using pairwise connections. (zhang2018bilateral, zhang2018bilateral) exploited relationships among instances and incorporated domain knowledge to learn a frame-level intensity regressor. More recently, the work of (wang2018facial, wang2018facial) recognized AUs and estimated their intensities via hybrid Bayesian Networks (BNs), where the global dependencies among AUs were captured. Although these deep model-based works can learn better representations, some of them are highly dependent on the probabilistic modeling of AU dependencies. In contrast, our work investigates an alternative method to capture AU dependencies in a heatmap regression framework. With the proposed SCC module incorporated, more complete and high-order AU dependencies are explored.
The model introduced in (li2019semantic, li2019semantic) learned the semantic relationships with GNN for detecting the presence/absence of AUs. Unlike their method, we consider semantic similarity relationships among feature channels rather than regions. The closest work (sanchez2018joint, sanchez2018joint) directly regressed AU locations and intensities through a single Hourglass (newell2016stacked, newell2016stacked). In our work, our combine heatmap regression and semantic relationships learning in a unified end-to-end framework.
The basic framework is implemented by adding several deconvolutional layers on ResNet (xiao2018simple, xiao2018simple), as Figure 3 shows. Furthermore, the SCC modules are incorporated for dynamically computing the correspondences from multi-resolution feature maps layer by layer. In particular, we predefine the central location of each AU based on the coordinates of facial landmarks in Figure 2.
We integrate both the spatial and intensity information of AUs into an end-to-end trained heatmap regression framework. Given a set of images, we formalize the problem as that of jointly predicting multiple AU heatmaps, which potentially encode the intensity of each AU located at a certain position. As illustrated in Figure 3, each deconvolutional layer is followed with an SCC module that models the relationship among multiple feature maps at this specific resolution level. Finally, the last layer generates a set of heatmaps for all AUs, where is the total number of predefined AU locations. The ground-truth possibility heatmap for a predefined AU location () is generated by applying a Gaussian function centered on its corresponding coordinate
where is the labeled intensity of the specific AU, and
is the standard deviation. Thus, the probability foris the largest value in the generated heatmap. If the pixel is farther away from , its probability value would smoothly decrease. In our case, we expect the centre coordinate can reflect where the specific AU causes appearance changes, whereas can capture different degrees of the changes.
Let denote the predicted heatmap parametrized by the network weights and biases . We utilize the distance to minimize the difference between and
For the inference stage, the highest value of is taken as the predicted AU intensity222We take the average of those highest values if the AU location is not unique., along with its corresponding location given by .
SCC: Semantic Correspondence Convolution
Given the co-occurrences of different AU intensities, the semantic representations of feature maps are highly correlated in spatial distributions. We introduce the semantic correspondence convolution (SCC) operation, aiming to model the correlation among feature channels, where each channel encodes a specific visual pattern of AU. The idea behind the SCC layer is rather simple, mainly inspired by the dynamic graph convolutions in geometry modeling (wang2018dynamic, wang2018dynamic). Intuitively, the feature channels with similar semantic patterns would be activated simultaneously when a specific co-occurrence pattern of AU intensities emerges. In the SCC module, we first construct the k-NN graph by grouping sets of closest feature maps, thus learning to find different co-occurrence patterns. To further exploit the edge information of the graph, we then apply the convolution operations on the edges that connect feature maps sharing similar semantic patterns. Afterwards, the aggregation function, i.e., MAX, is applied to summarize the most discriminative features for improving AU intensity estimation.
From a space with a distance metric, the neighboring feature maps with similar patterns form a meaningful subset. Consider a directed graph , where and are the vertices and edges, respectively, the feature maps set is denoted by and the size of each feature map (channel) is given by . In our approach, we rearrange the feature map in a feature vector with the length of . Note that this permutation does not change the feature distribution. is constructed as the k-nearest neighbor (k-NN) graph of , and each node represents a specific feature map. To make use of the edge information, the edge feature is defined by , where is a nonlinear function with trainable parameters . From the feature space, we regard as a central feature and as its neighborhood features. We combine the global information encoded by , with its local neighborhood characteristics, captured by . Thus, the edge feature function is formulated as
where ( is the number of filters) parameterize ,
represents the inner product, and the ReLU function is selected as. This function can be implemented with the convolution operator. For each , the k-NN graph is built by computing a pairwise distance matrix and then taking the closest feature maps. The pairwise distance matrix is calculated based on the Euclidean distance, which is typically used for measuring the semantic similarity. Besides, we adopt a channel-wise aggregation function, i.e., MAX, to summarize the edge features, as it can capture the most salient features. The output of the SCC module at the i-th vertex is then produced by
Dynamic Graph Update
In our approach, the dynamic graph convolutions are performed on both low and high resolution feature maps, aiming to capture the high-order AU interactions. Specifically, at the -th layer, a different graph is constructed by computing the closest feature maps for each single . The adjacency matrix for is a pairwise distance matrix in the feature space, which represents the relationship among feature maps. Each row stands for a feature map and each entry in indicates the relationship between two corresponding features. The SCC module is flexible since it can take features of arbitrary size as input. It is necessary to constantly keep the graph information updated when new combinations of AU intensity levels happen. Therefore, rather than using a fixed constant graph, our model learns how to construct the graph in each layer during the inference stage.
Overall, the SCC module can be integrated into multiple convolutional layers, and learn to semantically group similar feature channels that would be activated together for a specific co-occurrence pattern of AU intensities.
Correspondence with AU Heatmaps
Generally, our model benefits from the relationship modeling ability of the SCC module, as well as the spatial representation capability of heatmaps. Since AUs can be combined in different intensities to create a vast array of co-occurrences, we would like to explore what the specific visual pattern each channel encodes. As Figure 3 shows, following the last SCC module is a convolutional layer, where the shape of each learned filter would be ( is the number of feature channels and is the number of output maps). Thus, for each feature channel, there are weights corresponding to output maps. Suppose that the number of SCC modules is , the feature maps set generated from the last SCC layer is given by . Let () denote the filter bank for a specific AU, the predicted heatmap for this AU is then computed as
means the tensor product. Hence,is better able to explain the AU co-occurrence patterns via the directed convolution, capturing the global and evident relations among AUs. Each () is expected to represent a particular visual pattern, and the corresponding weight measures its association with . The weight indicates the probability of the pattern being activated, and a larger one leads to higher intensities of AUs in the pattern. In this way, the final AU co-occurring pattern can be reflected by activating a set of different visual patterns. The additional visualization analysis in the following experiments have provided a more intuitive understanding of the visual patterns.
In this study, we conducted experiments on the BP4D (zhang2014bp4d, zhang2014bp4d) and DISFA (mavadati2013disfa, mavadati2013disfa) datasets, which provide per-frame AU intensity annotations and have been widely used for AU intensity estimation tasks. BP4D contains 328 facial videos from 41 subjects who were involved in 8 emotion-related elicitation experiments. It provides intensity labels for five AUs: AU6, AU10, AU12, AU14, and AU17, which are quantified into six discrete levels from 0 (absence) to 5 (maximum). In our experiments, we adopted the official training/development sets for evaluation. The DISFA database consists of 27 videos, where each frame is coded with the AU intensity on a six-point discrete scale. There are 12 AUs (1, 2, 4, 5, 6, 9, 12, 15, 17, 20, 25, and 26) annotated by the expert FACS coder. We evaluated the model using the 3-fold subject independent cross-validation protocol, i.e., 18 subjects for training and 9 subjects for testing. The distributions of the intensity levels are reflected in Figure 4. It is worth noting that they are extremely imbalanced in the six levels.
The training and testing processes were performed using NVIDIA GeForce GTX 1080Ti 11G GPUs based on the open-source Tensorflow (abadi2016tensorflow, abadi2016tensorflow) framework. The backbone network was initialized from ResNet-50 (xiao2018simple, xiao2018simple). For the k-NN graph implemented in the SCC layer, we setfor the BP4D database and for the DISFA database, which were determined by grid searching. The dlib333dlib.net library was utilized to locate the 68 facial landmarks for defining AU locations. The face images were then cropped and resized to pixels. During the training phase, we employed the Adam optimizer (kingma2014adam, kingma2014adam), with an initial learning rate of 5e-4 and a batch size of 32.
To compare the performance with the state-of-the-art approaches, we use Intra-class Correlation (shrout1979intraclass, shrout1979intraclass) (ICC(3,1)) and Mean Absolute Error (MAE) for evaluations. Statistically, ICC measures the agreement between judges who rate targets of data. In our case, means that one judge represents AU intensity labels and the other one stands for predicted values, and denotes the total number of testing examples. We also report MAE, which is widely used for measuring the regression performance.
Comparison with the State of the Art.
We compared our model, referred to as SCC-Heatmap, to deep learning-based methods that leverage relationship modeling (CCNN-IT (walecki2017deep, walecki2017deep), KBSS (zhang2018weakly, zhang2018weakly), BORMIR (zhang2018bilateral, zhang2018bilateral)) and the recent work KJRE (zhang2019joint, zhang2019joint).
Additionally, since our method is based on heatmap regression, we also directly applied two state-of-the-art regression models, Hourglass (newell2016stacked, newell2016stacked) and ResNet-Deconv (xiao2018simple, xiao2018simple), for AU intensity estimation. Table 1 shows the comparative results for the above mentioned methods evaluated on the BP4D and DISFA datasets. The heatmap regression-based methods were evaluated in the same settings by using the source code provided by the authors. The results of other methods are adapted from their corresponding papers. From Table 1, we observe that the proposed SCC-Heatmap performs better with higher ICC and lower MAE in most cases. Specifically, the average ICC of our baseline version (ResNet-Deconv) is higher than Hourglass for the BP4D database. This shows the advantages of ResNet-Deconv in heatmap regression for AU intensity estimation. SCC-Heatmap has an improved performance on both the BP4D and DISFA datasets, which suggests that relationship modeling is crucial. Moreover, compared to other approaches including CCNN-IT, KBSS, BORMIR, and KJRE, for the average ICC, the proposed SCC-Heatmap considering channel-wise feature map relationships shows an increase of on the BP4D database, and an increase of on the DISFA database. The better performance further demonstrates the superiority of the dynamic graph convolutions in capturing AU semantic relationships over other related approaches. To the best of our knowledge, the proposed model achieves the best performance with the highest average ICC, as well as the lowest average MAE, for the BP4D database.
Individual Subject Performance
We provide the qualitative results in order to further validate the effectiveness of the proposed method. Specifically, given a testing image, the network generates a set of heatmaps, as illustrated in Figure 3. Two examples of the predicted heatmaps are visualized in Figure 5. We can see that the heatmaps are produced according to the predicted AU locations and their intensities. Then the intensity of each AU is determined by the maximum value of its corresponding heatmap. Following this pipeline, we plot the estimated AU intensities of two subjects from the DISFA dataset which provides per-frame labels, compared to the corresponding ground-truth with titled ICC measure. It can be observed that the estimation line is smooth, stable, and close to the ground-truth.
To verify the effectiveness of the semantic correspondence convolutions, we compared the overall performance of the proposed SCC-Heatmap to the model without considering relationships by removing all the SCC layers. In the meantime, we also investigated the effects of the heatmap resolution by varying the number of deconvolutional layers, which could determine the size of the final output heatmap. Ablation studies were conducted on the DISFA database in four different settings while maintaining all the other settings the same. As shown in Table 2, compared to setting (c), setting (a) achieves a higher average ICC by and a lower average MAE by . Thus, three deconvolutional layers were used. Settings (b) and (d) show that discarding the SCC module gives a decrease in both ICC and MAE, which suggests that relationship modeling is beneficial in improving the performance of AU intensity estimation.
We further analyzed the effectiveness of the edge features for constructing the relationship graph. In Equation 4, and are used to parameterize the edge features, so as to take into account the local characteristics of feature maps while keeping the global information. As illustrated in Table 3, setting (e) denotes that both and are considered, whereas setting (f) and setting (g) discard the local and global information, respectively. As expected, setting (e) leads to a higher ICC value and a lower MAE value, which demonstrates that both the global structure and the local neighborhood information are important for modeling the semantic relationships among feature maps.
To search for an optimal value, we compared the results with different numbers of nearest neighbors , as shown in Figure 6. Generally, given that both the ICC and MAE have small fluctuations, the performance is not overly sensitive to . Studying the results, it is notable that larger gives slightly worse performance. We infer that a larger might fail to group feature maps appropriately with Euclidean distance.
As stated in the previous section, the AU co-occurring pattern is reflected by activating the corresponding feature channels, where each channel encodes a specific visual pattern of AU. To this end, we designed experiments to discover the hidden structure that governs the co-occurrence patterns of AU intensity levels. In Figure 3, through the directed convolution operation, the feature maps from the last SCC layer can better represent the patterns of AUs. For the model trained on the DISFA database, the weights of the first two feature channels in the last SCC layer are visualized in Figure 7. The first feature channel displays the pattern that the face presents “Cheek raiser”, “Lip corner puller”, “Lips part” and “Jaw drop” simultaneously, providing an indication of a positive emotion, e.g., happiness. The second feature channel reflects the pattern that “Inner brow raiser” and “Outer brow raiser” are likely to occur simultaneously, which suggests the emotions like angry, surprise, etc. Then, we visualize the responses of the first two feature channels using two samples from the DISFA database. It can be observed that one face image activates the first feature channel, whereas the other one activates the second feature channel. Moreover, a larger leads to a higher response in the location of the specific AU, which indicates higher probability of occurrence. Therefore, different feature channels represent different visual patterns, some of which might be activated together to form the final AU co-occurring pattern.
We also compared the predicted heatmaps between the SCC-Heatmap and the model with all the SCC modules removed in Figure 8. Given an input image from the BP4D dataset, both (a) and (b) predict a high intensity for AU12. While for AU6 and AU10, the refined heatmaps in (a) have much better prediction. We attribute this to the SCC module, where the complementary features are obtained from different channels. For the model with semantic correspondences considered, we infer that the feature channel associated with AU12 might enhance the response of the feature channel associated with AU6, which validates the effectiveness of the semantic correspondence in modeling the co-occurrence relationships of AU intensities.
Conclusions and Future Work
In this work, we studied AU intensity estimation from a new perspective by employing the dynamic graph convolutions to capture correlations between neighboring feature maps, thereby modeling the semantic relationships of AU intensity levels. Particularly, the proposed framework keeps updating the relationship graph layer by layer, which can enrich the representation power of AU co-occurrence patterns. In the future, we would like to apply the framework for other tasks, especially those under unsupervised settings. Currently, we choose the k-NN graph to model the relationships. Although it is reasonable for grouping similar feature maps and works well in practice, we plan to investigate alternative approaches in constructing the relationship graph.
-  (2018) Video-based emotion recognition using deeply-supervised neural networks. In Proceedings of the International Conference on Multimodal Interaction, pp. 584–588. Cited by: Introduction.
-  (2020) Facial expression recognition with deeply-supervised attention network. IEEE Transactions on Affective Computing. Cited by: Introduction.
-  (2020) XPersona: evaluating multilingual personalized chatbot. arXiv preprint arXiv:2003.07568. Cited by: Introduction.
-  (2019) CAiRE: an end-to-end empathetic chatbot. arXiv preprint arXiv:1907.12108. Cited by: Introduction.