Facial expression analysis is of great interest to many researchers in computer vision, computer graphics, and psychology. Automated facial expression analysis from a single image enables numerous applications such as human-robot interaction and behavioral and psychological research. The Facial Action Unit (AU) intensity estimation is one of the main building blocks in single-image facial expression analysis, which aims to estimate anatomically meaningful parameters (i.e., muscle movements) instead of PCA parameters of a 3D morphable model . The seminal work of Ekman and Friesen  develops a Facial Action Coding System (FACS) for describing facial expressions, which are produced by the movements of facial muscles under the skin. Nearly any anatomically possible facial expression can be coded by a combination of AUs.
Existing AU intensity estimation methods are broadly divided into fully supervised methods [1, 26, 54, 3] and weakly supervised methods [53, 47, 50]. Training a fully supervised model requires a large set of labeled samples. However, AU annotation requires strong domain expertise, resulting in very high time-and-labor costs to construct a large database. Due to this demanding labeling process, datasets in the literatures (e.g., CK+ , MMI , AM-FED , DISFA , BP4D 
) restrict the number of coded AUs, samples, and subjects. The weakly supervised methods focus on various types of human knowledge with limited annotation datasets. Nevertheless, they still require thousands of AU annotations, and the human defined knowledge restricts the space of anatomically plausible AUs. Besides, there are many challenges in automatic AU intensity estimation so that the performance of AU intensity estimation is often unsatisfactory. These challenges include large variations in pose, identity, illumination, and occlusion in the unconstrained environment. Among them, pose and identity have always been essential challenges because they can dramatically increase variances of face images of different people even when they have the same expression.
Considering the above challenges, we propose a framework for unsupervised estimation of facial action unit intensity from a single image, which can jointly estimate the facial parameters (i.e., head pose, AU parameters and identity parameters) without annotated AU data. The core of our framework is differentiable optimization, which iteratively updates the facial parameters to match the input image. It is achieved with the help of a novel network architecture called GE-Net. GE-Net consists of two modules: a generator which learns to “render” a face image from a set of facial parameters in a differentiable way, and a feature extractor which extracts deep features for measuring the similarity of the rendered image and input real image. After the two modules are trained, we integrate them into our framework, and the facial parameters can be fitted by minimizing the loss of the extracted features between the rendered image and the input real one.
Specifically, our generator is trained based on a 3D parametric face model called Justice Face from an online game Justice111https://n.163.com/. The model parameters have explicit physical meanings, including AU parameters and identity parameters. The AU parameters can express all the AUs from FACS, and the identity parameters describe identity-specific face shapes by controlling the attributes of facial components, including position, orientation, and scale, as introduced in the work . We adopt a face semantic segmentation network as our feature extractor, whose intermediate feature maps focus more on the facial shape contents rather than the raw appearance. Hence it can effectively reduce the large domain gap between the rendered image and the real one (see Sec. 4.3 for details).
Our method can be used in many applications, one of which is expression transfer. We can exchange the AU parameters of different individuals, which is shown in Fig. 1. With image A or B as input, our framework can predict their facial parameters. After re-combining the head pose, AU parameters, and identity parameters, we can transfer the pose and expression from B to A while keeping A’s identity.
Our main contributions are summarized as the following: To the best of our knowledge, we firstly propose a deep learning framework for unsupervised estimating the facial AU intensity from a single image, without requiring annotated AU data. What’s more, the novel framework enables differentiable optimization, which consists of a generator performing differentiable rendering and a feature extractor measuring similarity of the rendered image and real image. Large experiments show that the proposed method achieves state-of-the-art results on the AU intensity estimation. Meanwhile, our method can activate more AUs than existing AU annotated databases, such as jaw left and upper lip raiser.
2 Related Work
2.1 Face Reconstruction Methods based on 3DMM
In the past years, many methods [58, 56, 14, 55] describe a 3D facial shape using the parameters of the 3DMM , which are constructed with PCA. Early methods [57, 5] use a regressor to directly regress the facial landmark locations from a single image for fitting 3DMM parameters. Recently, the approaches of regressing 3DMM parameters using CNNs and fitting 3DMM to the 2D images have become popular. Jourabloo et al.  use cascade CNN regressors to regress the 3DMM parameters (identity, expression and pose parameters). Bindita et al.  present a single end-to-end network to jointly predict the bounding box locations and 3DMM parameters for multiple faces. Zhu et al. [56, 59] perform multiple iterations of a single CNN to fit the 3DMM parameters. Different from these 3DMM-based methods, our goal is to predict the facial AU intensity with explicit anatomical meanings, while the 3DMM parameters lack such physical meanings.
2.2 AU intensity Estimation Methods
use traditional machine learning frameworks to estimate AU intensity. Recently, some works apply CNNs for AU intensity estimation. Tadaset al.  present a facial action unit intensity estimation system based on geometry features (shape parameters and landmark locations) and appearance (Histograms of Oriented Gradients). The median-based feature normalization is used to account for a person-specific neutral expression. Li et al. [26, 25] design different CNNs to regress AU parameters, where the CNNs focus on different facial regions independently. Batista et al.  and Zhou et al.  apply DNNs to estimate AU intensity under multiple head poses. However, these supervised methods require a large number of training samples, while AU annotation needs strong domain expertise, and it requires great efforts to construct such extensive database. Besides, limited by the number of annotated samples, above methods usually suffer from an over-fitting problem. Due to the lack of AU annotation datasets, several works [52, 47, 50] use the weakly supervised training paradigm to train deep models with incomplete or inaccurate annotations, where prior knowledge and domain knowledge are involved. However, these methods still require thousands of annotations, and they add prior knowledge, which restricts the space of anatomically plausible AUs. In contrast, our method can estimate more AUs than existing AU annotated databases, without requiring any annotated data.
2.3 Neural Render
Recent works on differentiable rendering achieve differentiability in various ways. Kato et al.  achieve differentiability by an approximate gradient for the rasterization operation. GQN  shows a powerful neural renderer, which learns to represent synthetic scenes using the input images of a scene taken from different viewpoints, but the geometry of objects in the scene is simple. MOFA  learns a decoder using a neural renderer model, which is an expert-designed generative model. The expert-designed generative model is difficult to describe the complicated rendering process with the illumination condition and complex shading explicitly. Nguyen-Phuoc et al.  propose a CNN architecture to render 3D objects from a 3D voxel grid. Shi et al.  learn a generator as a neural render with fixed camera parameters. Inspired by , we extend the generator to imitate the face rendering process from meaningful facial parameters including identity, expression and head pose.
Our goal is to estimate the intensity of facial AUs as well as the face pose and identity from a single image via differentiable optimization, without requiring any labeled data.
The optimization framework GE-Net consists of a generator and a feature extractor .
The generator imitates the rendering process from the facial parameters (i.e., head pose , AU parameters and identity parameters ) to the rendered image , while the feature extractor extracts deep features from images, which are used to measure the similarity between the real input image and the rendered image for fitting the meaningful parameters.
The generator is trained with the help of a 3D parametric face model called Justice Face from the game Justice. The model parameters have explicit physical meanings, including AU parameters and identity parameters .
A complete process of our method is summarized as follows:
Stage 1. Train the generator and the feature extractor .
Stage 2. Fix the parameters of and , and iteratively update the facial parameters according to the loss between the input image and the generated one. Empirically, the facial parameters can closely describe the input image after 30 iterations.
3.1 3D Face Model
The parametric face model Justice Face contains the identity parameters and AU parameters. The identity parameters control the attributes of facial components to create a neutral face for a specific identity, including the position, orientation, and scale as recommended by the method . The AU parameters control the expression by driving the facial muscle movements relative to the neutral face. Specifically, these AU parameters include eyes closed, upper lid raiser, cheek raiser, inner brows raiser, outer brows raiser, brow lower, jaw open, nose wrinkle, upper lip raiser, down lip downer, lip corner pull, mouth press, lip pucker, upper lips close, lower lips close, cheek puff, lip corner depressor, jaw left and jaw right. The corresponding maps between AUs and our AU parameters are shown in Table 2 of the appendix. We can adjust the AU and identity parameters to generate an identity-specific expression, which is formulated below:
where is a synthesized 3D face and is the base face. is the facial component offsets to the base face in position, orientation, and scale, and the identity parameters are the weights of these offsets. is the offsets of facial muscle movements relative to the neutral face. The AU parameters are the corresponding weights and represent the facial AU intensity.
The generator learns to map a set of meaningful facial parameters to a rendered image based on the Justice Face.
is a 269-dimensional vector with head pose, AU parameters and identity parameters . The ground truth is the corresponding rendered image from the Justice Face, and the predicted one is marked as the generated image . Our generator consists of eight transposed convolution layers, which is similar to the network of DCGAN . We use instance normalization 
instead of batch normalization in the generator for improving training stability. The generator is differentiable so that the facial parameters can be updated by gradient descent.
Fig. 3(a) shows the architecture of our generator. To make the generated images indistinguishable from the rendered images, we adopt a content loss and a perceptual loss  between the rendered image and the predicted one . The content loss penalizes large changes in the raw pixel space, and we use loss function rather than for so as to encourage less blurring. The content loss is defined as:
means the uniform distribution. The input parametersare sampled from a multidimensional joint uniform distribution .
The perceptual loss is added to explicitly constrain the generated result to have the same content information as the rendered one, and it is defined as:
where denotes the relu2_2, relu3_3, relu4_3 feature maps in VGG16 , which is pre-trained on the image recognition task. In this way, we can train the generator by optimizing the following loss function:
where balances the multiple objectives.
During training, all the rendered images are aligned via five facial landmarks using affine transformation, including left eye, right eye, nose, left mouth corner and right mouth corner, which are obtained by OpenFace [2, 45]. The convolution kernel size in the generator isafter every transpose convolution layer, except for its output layers.
is trained with mini-batch stochastic gradient descent (SGD) and the mini-batch size is 64. All the weights are initialized from a zero-centered normal distribution with a standard deviation of 0.02. The learning rate is 0.001, and the learning rate decay is set to 1% per three epochs. After training of 500 epochs, the generator successfully learns how to map each facial parameter to its corresponding component/muscle movement and “render” a visually plausible image.
3.3 Feature Extractor
As there is a large domain gap between the generated image and the input real image, to effectively measure their similarity, a face segmentation network is trained as a feature extractor in Fig. 3(b). The similarity is computed on the feature maps of the first four layers of . Specifically, an attention mechanism  is deployed in the feature extractor. It allows the feature maps to attend to different parts of the face, which means the feature maps have different influences on the loss. The output probability maps of the facial segmentation network serve as the attention masks, which are represented as the pixel-wise weights in Eqn. (5). Specifically, the probability map of every facial local region is utilized as an attention mask of each feature map. For example, the feature map of the first layer is sensitive to the eyes, and the eyes’ probability map is element-wisely multiplied to the feature map of the first layer. Suppose represents the feature map, and the extracted features can be defined as:
where is the input of the feature extractor. feature maps are extracted, and represents the weight of each feature map . We set different attention masks for the feature maps by the probability maps of the facial segmentation result.
The architecture of the feature extractor network is based on Resnet-50 . We set the stride [2, 1, 1, 1] at four sub-layers, remove the fully connected layers and change the average pooling layer to a
convolution layer. The output is 1/8 size of the input. We further perform up-sampling with factor 8 to get an upsampled output with the same size as the input. The network is pre-trained on the ImageNet and is fine-tuned with the Helen Dataset  with the pixel-wise cross-entropy loss. The learning rate is set to 0.001.
After the generator and the feature extractor are learned, we fix their parameters and iteratively optimize the facial parameters (i.e., head pose, AU parameters and identity parameters) to match the input real image. To fit the facial parameters to a real face image with expression, the generator and the feature extractor serve as a bridge. We optimize the facial parameters by measuring the similarity of the extracted features from between the input real image and the generated image from . The extracted features can teach the generator to produce faces that match firmly with the real image in the feature space.
Fig. 2 shows the pipeline of our proposed method. The meaningful facial parameters can be updated using gradients because the renderer is differentiable. We update the facial parameters by back-propagating the loss on extracted features, which explicitly constrain the “rendered” result to have similar facial shape contents as the real image. The loss is defined as:
where is the input real image aligned with the base face using five facial landmarks. The parameters are updated with the gradient descent method by:
where represents the learning rate. The range of is clipped to [0,1].
However, the identity parameters can describe a large-scale 3D space, in which some points don’t exist in the real world. To solve the problem, we apply a facial constraint. Instead of directly predicting identity parameters in the high-dimensional space, a transform is performed to obtain an intrinsic low-dimensional subspace on the identity parameters. We collect 20K face images without expression based on Basel Face Model , and obtain a set of identity parameters using our GE-Net with AU parameters all set to zero.
Suppose is the mean of identity parameters and
is the projection matrix, which can be obtained by performing singular value decomposition on the covariance matrix of the identity parameters. The formula is defined as:
means the identity parameters in the low-dimensional subspace. Each of the parameters can be normalized to [0,1] according to the 20K face images in the subspace. The is back-projected to the high-dimensional space by multiplying the transpose of the projection matrix. The final predicted identity parameters can be reconstructed as:
Fig. 4 shows some randomly generated faces without and with the facial constraint. Before the input image is fed into the feature extractor, it is aligned to the “base face” via five facial landmarks. The “base face” is created by rendering a frontal emotionless face image, with the identity parameters set to 0.5.
In the fitting process, we set different learning rates for . For the identity parameters, the learning rate is . For the head pose and AU parameters, we find each parameter has a different impact on the generated facial expression, so we set different learning rates. Specifically, starting from the “base face”, to compute the learning rate for one parameter in , we set its value to the maximum value while keeping other parameters zero, and obtain a corresponding segmentation output. We count the different pixels compared with the segmentation output of “base face”, and set the inverse of the ratio of different pixels in the whole image as the learning rate. The optimization can converge well after 30 iterations in the experiment.
Public Data. We adopt the Extended Cohn-Kanade database (CK+) , the MMI database , FERA2015 , the DISFA  and the FaceWarehouse  to evaluate our method. The CK+ database contains 593 sequences from 123 subjects. Approximately 15% of the sequences are coded by a certified FACS coder, which coincides with the peak frames. The MMI database contains more than 2800 samples, which contain static images and image sequences in frontal and in profile view. The FERA 2015 contains about 140,000 images from 41 subjects. Intensities are annotated for 5 AUs (6, 10, 12, 14, 17). The DISFA database consists of 27 sequences from 27 subjects. Around 130,000 frames are annotated with AU intensity for 12 AUs (1, 2, 4, 5, 6, 9, 12, 15, 17, 20, 25, 26).
Rendered Data. We render 80,000 face images using Justice Face with random facial parameters. The facial parameters is a 269-dimensional vector with head pose , AU parameters and identity parameters . specifically, the range of is from to in pitch and yaw, and we convert the angles to be within . We only consider the head pose in pitch and yaw because the image is aligned in the roll, which means simple 2D in-plane rotation. The initial values of the head pose should represent a frontal view, so the values are set to 0.5. The identity parameters consist of 208-dimensional continuous parameters and 36-dimensional discrete parameters. These discrete parameters are encoded as a one-hot vector and are concatenated with the continuous ones. The range of the facial parameters is from 0 to 1. The face samples are generated randomly from a uniform distribution of all parameters. 90% face samples are used for training the generator , and the others are used for validating the generator.
4.2 Comparison with the state-of-the-art
Comparison with weakly and semi-supervised learning methods.
We compare the proposed method with several state-of-the-art weakly-supervised learning methods (KBSS, KJRE  and CFLF 
) and semi-supervised learning methods (LBA and 2DC ). LBA  encourages cycle-consistent association chains from the embedding of labeled samples to unlabeled samples. The back is based on the assumption that samples with similar labels have similar latent features. 2DC  combines variational auto-encoder and Gaussian Process for AU intensity estimation by joint learning of latent representations and classification of multiple ordinal outputs. KBSS  exploits four types of domain knowledge on AUs to train a deep model. KJRE  jointly learns the representation and estimator with limited annotations by incorporating various types of human knowledge. CFLF  uses a learnable task-related context and two types of attention mechanisms to estimate the AU intensity. Note that KBSS, CFLF and 2DC need labeled data to train their models. Unlike them, we propose a framework for facial AU intensity estimation without annotated AU data.
Intra-Class Correlation (ICC(3,1) ) and Mean Absolute Error (MAE) are adopted as the metrics for evaluation. The quantitative results are listed in Table 1, and more quantitative results are listed in the supplementary material. We can see that on the FERA2015 dataset, our method achieves superior average performance over other methods on both metrics. On DISFA, our method achieves the best average performance on ICC and the second-best on MAE. Note that ICC and MAE should be jointly considered to evaluate one method. LBA tends to predict the intensity to 0, which is the majority of AU intensity. It can achieve good MAE performance, but its ICC is much worse than ours. Our method outperforms KBSS and CFLF on DISFA and FERA 2015. During the training phase, these two approaches need multiple frames as input, hence it requires more than one image per face, which is not always available. Besides, the proposed GE-Net utilizes the attention mechanism for the feature extractor, which achieves better results than GENet-O, which does not use the attention mechanism. It demonstrates the effectiveness of the attention mechanism on the feature extractor. We also noticed that our method has bad performance on AU 12 in DISFA. It is because Justice Face restricts the expression of lip corner puller, whose movement is smaller than the real face so that the model can express elegant expression. These results show the superior performance of the proposed method over existing weakly and semi-supervised learning methods.
Comparison with supervised learning methods. We compare our method with several supervised learning methods of AU intensity estimation, including OpenFace , ResNet18 , CCNN-IT  and CNN . CNN  is composed of three convolutional layers and a fully connected layer. CCNN-IT  takes advantage of CNNs and data augmentation. ResNet18 is introduced in . OpenFace  is based on histograms of oriented gradients and landmark locations. These supervised methods require annotating AU intensity of each frame in sequences while ours does not need the AU annotations. The results are shown in Table 2. Our method achieves better performance on both metrics.
Qualitative AU intensity estimation results. We also show the qualitative results which are rendered using Justice Face with the estimated facial parameters. Fig. 5(a) provides several samples of the rendered images of different methods. Our method can achieve more accurate estimation than the state-of-the-art methods based on 3DMM. As shown in Fig. 5(b), the AU parameters of the inputs are transferred to another 3D face model. Fig. 8 shows our additional qualitative results on the mentioned datasets and internet images. (More results are given in the supplementary material.)
4.3 Ablation study
We perform an ablation study to verify the effectiveness of the feature extractor. A baseline using the loss in the raw pixel space is explored. In addition, instead of using the facial segmentation network, we use the pre-trained ResNet or VGG to extract the features of an image. Fig. 6 shows the comparison results. We can see that the results using the raw pixel loss or the pre-trained VGG16 are rather inferior, with a lot of artifacts. The pre-trained ResNet50 can not distinguish the input images with different expressions. In contrast, our facial segmentation network can be aware of the facial shape changes and produce similar expressions as the input images.
We further validate the effectiveness of the facial segmentation network by comparing it with the facial expression recognition model “ResNet” 222https://github.com/WuJie1010/Facial-Expression-Recognition.Pytorch and the face recognition model “EvoLVe”333https://github.com/ZhaoJ9014/face.evoLVe.PyTorch . A natural idea would be to use a facial expression recognition model to measure the expression similarity between the input photo and the neurally rendered one. However, due to the huge domain gap, existing pre-trained models fail to measure such similarity. The face recognition model also fails to distinguish the different expressions, which is illustrated in Fig. 6. As shown in Fig. 7, ten different expressions are selected to compute the losses on the FaceWarehouse dataset. We compare the performances of the expression recognition loss, face recognition loss and face segmentation loss using the expression similarity matrix. The expression similarity matrix is normalized and shown in the JET style. Obviously, a useful metric for expression similarity measurement should have low values on the diagonal of the matrix, while only the segmentation loss meets the requirement well.
In this paper, we propose an unsupervised framework for estimating the head pose, the AU parameters, and the identity parameters from a real facial image without annotated AU data. We formulate the method as a differentiable optimization problem by minimizing the differences of the extracted features between the real facial image and the generated one. The loss can be back-propagated to the facial parameters because the generator is differentiable. Experimental results demonstrate that the proposed method can fit accurate AU parameters and achieve state-of-the-art performance compared with existing methods. It can also get accurate AU estimation on internet photos. In the future, we plan to collect a large number of face videos with the various expression of different views. The database can be labeled using our GE-Net so that we can exploit faster methods for AU intensity estimation. Meanwhile, a more robust method to handle the large change of the head pose is another direction.
-  (2015) Cross-dataset learning and person-specific normalisation for automatic action unit detection. In Automatic Face and Gesture Recognition, Vol. 6, pp. 1–6. Cited by: §1, §2.2, §4.2, Table 2.
-  (2013) Constrained local neural fields for robust facial landmark detection in the wild. In International Conference on Computer Vision Workshops, pp. 354–361. Cited by: §3.2.
Aumpnet: simultaneous action units detection and intensity estimation on multipose facial images using a single convolutional neural network. In Automatic Face & Gesture Recognition, pp. 866–871. Cited by: §1, §2.2.
-  (1999) A morphable model for the synthesis of 3d faces. In Computer Graphics and Interactive Techniques, pp. 187–194. Cited by: §1, §2.1.
-  (2014) Displaced dynamic expression regression for real-time facial tracking and animation. Transactions on Graphics 33 (4), pp. 43. Cited by: §2.1.
-  (2013) Facewarehouse: a 3d facial expression database for visual computing. Transactions on Visualization and Computer Graphics 20 (3), pp. 413–425. Cited by: §A.1, §4.1.
-  (2019) Deep, landmark-free fame: face alignment, modeling, and expression estimation. International Journal of Computer Vision 127 (6-7), pp. 930–956. Cited by: Figure 5.
Joint face detection and facial motion retargeting for multiple faces.
Computer Vision and Pattern Recognition, pp. 9719–9728. Cited by: §2.1.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.3.
-  (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §3.3.
Neural scene representation and rendering. Science 360 (6394), pp. 1204–1210. Cited by: §2.3.
-  (1978) Facial action coding system: a technique for the measurement of facial movement. Palo Alto 3. Cited by: §1.
-  (2015) Deep learning based facs action unit occurrence and intensity estimation. In Automatic Face and Gesture Recognition (FG), Vol. 6, pp. 1–5. Cited by: §4.2, Table 2.
-  (2018) Cnn-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. Pattern Analysis and Machine Intelligence 41 (6), pp. 1294–1307. Cited by: §2.1.
-  (2017) Learning by association–a versatile semi-supervised training method for neural networks. In Computer Vision and Pattern Recognition, pp. 89–98. Cited by: Table 1, §4.2.
-  (2016) Deep residual learning for image recognition. In Computer Vision and Pattern Recognitionn, pp. 770–778. Cited by: §3.3, §4.2, Table 2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.2.
Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pp. 694–711. Cited by: §A.1, §3.2.
-  (2016) Large-pose face alignment via cnn-based dense 3d model fitting. In Computer Vision and Pattern Recognition, pp. 4188–4196. Cited by: §2.1.
-  (2012) Continuous pain intensity estimation from facial expressions. In International Symposium on Visual Computing, pp. 368–377. Cited by: §2.2.
-  (2015) Doubly sparse relevance vector machine for continuous facial behavior estimation. Pattern Analysis and Machine Intelligence 38 (9), pp. 1748–1761. Cited by: §2.2.
-  (2015) Latent trees for estimating intensity of facial action units. In Computer Vision and Pattern Recognition, pp. 296–304. Cited by: §2.2.
-  (2018) Neural 3d mesh renderer. In Computer Vision and Pattern Recognition, pp. 3907–3916. Cited by: §2.3.
-  (2012) Interactive facial feature localization. In European Conference on Computer Vision, pp. 679–692. Cited by: §3.3.
-  (2017) Eac-net: a region-based deep enhancing and cropping approach for facial action unit detection. In International Conference on Automatic Face & Gesture Recognition, pp. 103–110. Cited by: §2.2.
-  (2017) Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In Computer Vision and Pattern Recognition, pp. 1841–1850. Cited by: §1, §2.2.
Deepcoder: semi-parametric variational autoencoders for automatic facial action coding. In International Conference on Computer Vision, pp. 3190–3199. Cited by: Table 1, §4.2.
-  (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In Computer Vision and Pattern Recognition-Workshops, pp. 94–101. Cited by: §1, §4.1.
-  (2013) Disfa: a spontaneous facial action intensity database. Transactions on Affective Computing 4 (2), pp. 151–160. Cited by: §1, §4.1.
-  (2013) Affectiva-mit facial expression dataset (am-fed): naturalistic and spontaneous facial expressions collected. In Computer Vision and Pattern Recognition Workshops, pp. 881–888. Cited by: §1.
-  (2018) Rendernet: a deep convolutional network for differentiable rendering from 3d shapes. In Neural Information Processing Systems, pp. 7891–7901. Cited by: §2.3.
-  (2019) Local relationship learning with person-specific shape regularization for facial action unit detection. In Computer Vision and Pattern Recognition, pp. 11917–11926. Cited by: §2.2.
-  (2005) Web-based database for facial expression analysis. In International Conference on Multimedia and Expo, pp. 5–pp. Cited by: §1, §4.1.
-  (2009) A 3d face model for pose and illumination invariant face recognition. IEEE, Genova, Italy. Cited by: §3.4.
-  (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §3.2.
-  (2014) Context-sensitive dynamic ordinal regression for intensity estimation of facial action units. Pattern Analysis and Machine Intelligence 37 (5), pp. 944–958. Cited by: §2.2.
-  (2019) Face-to-parameter translation for game character auto-creation. In International Conference on Computer Vision, pp. 161–170. Cited by: §1, §2.3, §3.1.
-  (1979) Intraclass correlations: uses in assessing rater reliability.. Psychological bulletin 86 (2), pp. 420. Cited by: §4.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
-  (2017) Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In International Conference on Computer Vision, pp. 1274–1283. Cited by: §2.3.
-  (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: Figure 3, §3.2.
-  (2015) Fera 2015-second facial expression recognition and analysis challenge. In International Conference and Workshops on Automatic Face and Gesture Recognition, Vol. 6, pp. 1–8. Cited by: §4.1.
-  (2017) Deep structured learning for facial action unit intensity estimation. In Computer Vision and Pattern Recognition, pp. 3405–3414. Cited by: §4.2, Table 2.
-  (2016) Deep constrained local models for facial landmark detection. arXiv preprint arXiv:1611.08657 3 (5), pp. 6. Cited by: Figure 5.
-  (2017) Convolutional experts constrained local model for 3d facial landmark detection. In International Conference on Computer Vision, pp. 2519–2528. Cited by: §3.2.
-  (2013) A high-resolution spontaneous 3d dynamic facial expression database. In Automatic Face and Gesture Recognition, pp. 1–6. Cited by: §1.
-  (2018) Weakly-supervised deep convolutional neural network learning for facial action unit intensity estimation. In Computer Vision and Pattern Recognition, pp. 2314–2323. Cited by: §1, §2.2, Table 1, §4.2.
-  (2019) Context-aware feature and label fusion for facial action unit intensity estimation with partially labeled data. In International Conference on Computer Vision, pp. 733–742. Cited by: Table 1, §4.2.
-  (2019) Joint representation and estimator learning for facial action unit intensity estimation. In Computer Vision and Pattern Recognition, pp. 3457–3466. Cited by: Table 1, §4.2.
-  (2018) Bilateral ordinal relevance multi-instance regression for facial action unit intensity estimation. In Computer Vision and Pattern Recognition, pp. 7034–7043. Cited by: §1, §2.2.
-  (2018) Towards pose invariant face recognition in the wild. In Computer Vision and Pattern Recognition, pp. 2207–2216. Cited by: §4.3.
-  (2018) Learning facial action units from web images with scalable weakly supervised clustering. In Computer Vision and Pattern Recognition, pp. 2090–2099. Cited by: §2.2.
-  (2016) Facial expression intensity estimation using ordinal information. In Computer Vision and Pattern Recognition, pp. 3466–3474. Cited by: §1.
Pose-independent facial action unit intensity regression based on multi-task deep transfer learning. In Automatic Face & Gesture Recognition, pp. 872–877. Cited by: §1, §2.2.
-  (2019) Dense 3d face decoding over 2500fps: joint texture & shape convolutional mesh decoders. In Computer Vision and Pattern Recognition, pp. 1097–1106. Cited by: §2.1.
-  (2016) Face alignment across large poses: a 3d solution. In Computer Vision and Pattern Recognition, pp. 146–155. Cited by: §2.1, Figure 5.
-  (2015) High-fidelity pose and expression normalization for face recognition in the wild. In Computer Vision and Pattern Recognition, pp. 787–796. Cited by: §2.1.
-  (2015) High-fidelity pose and expression normalization for face recognition in the wild. In Computer Vision and Pattern Recognition, pp. 787–796. Cited by: §2.1.
-  (2017) Face alignment in full pose range: a 3d total solution. Pattern Analysis and Machine Intelligence, pp. 78–92. Cited by: §2.1.
Appendix A Appendix
a.1 More Experiments
Stability. We hope that the predicted AU parameters are stable in the continuous frames of one video. The stability is crucial for numerous applications such as expression transfer, facial expression analysis, and human-robot interaction. To evaluate the stability, we suppose that the continuous two frames are almost the same in one video. So, the stability is evaluated by the differences between the continuous frames, which is calculated by the standard deviation of each AU in the fixed-size sliding windows of continuous frames. We compute the standard deviation of ten video sequences from 2 to 10 frames. The ten videos are download from the internet. The results of our method are better than OpenFace on different sliding windows. The result of 5 frames is shown in Table 3. We also show the results of a video in different frames in Fig. 9.
Robustness to Varying Conditions. Our network uses facial segmentation features as the input of the loss, yielding results robust to changes in lighting, resolution, and style. To verify it, we add robustness experiments in FaceWarehouse dataset . Different lights, gaussian blur, and style transfer  are used to the dataset. Some results are shown in Fig. 10. We qualitatively demonstrate this robustness by varying conditions for a single subject, and the results show consistent output. The first row is the input image. The rendered images and the corresponding histograms of the facial AU parameters are shown in the second and the third row. We also noticed that our method has bad performance on the upper lid raiser (column 4 in the histogram). It is because Justice Face restricts the expression of the upper lid raiser, whose movement is smaller. (please refer to Fig. 11.)
In our paper, the useful action unit and the corresponding name are listed in Table 4 based by row and then column. Each of AU parameters is set to the max value, and the corresponding rendered image to base face is listed in Fig. 11.
|Left Eye Blink||45||Right Eye Blink||45-2||Upper Lid Raiser||5|
|Cheek Raiser||6||Inner Brow Raiser||1||Left Outer Brow Raiser||2|
|Right Outer Brow Raiser||3||Brow Lowerer||4||Mouth Stretch||27|
|Nose Wrinkler||9||Upper Lip Raiser||10||Down Lip Down||25|
Lip Corner Puller
|12||left mouth press||14||right mouth press||14-2|
|18||Lip Stretcher||20||Lip Upper Close||24|
Lip Lower Close
|17||Cheek Puff||34||Lip Corner Depressor||15|
Compared with OpenFace. We provide a comparative experiment to investigate the effectiveness of our GE-Net. The results are shown in Table 5. The GE-Net achieves better performance in MAE and ICC on average and many AUs, which demonstrates the effectiveness of the attention mechanism.
Appendix B AU parameters expression on the 3d facial model
There are a little bit differences between the AU parameters from the 3d facial model and the AUs from the FACS system (e.g. we separate the dimple to left dimple and right dimple), as shown in Fig. 11.
Appendix C Training samples of our generator
In the training process, we train our generator with randomly generated game faces, as shown in Fig. 12, which are decided by the head pose, AU parameters and identity parameters.