Unsupervised Facial Action Unit Intensity Estimation via Differentiable Optimization

04/13/2020 ∙ by Xinhui Song, et al. ∙ NetEase, Inc. Zhejiang University 12

The automatic intensity estimation of facial action units (AUs) from a single image plays a vital role in facial analysis systems. One big challenge for data-driven AU intensity estimation is the lack of sufficient AU label data. Due to the fact that AU annotation requires strong domain expertise, it is expensive to construct an extensive database to learn deep models. The limited number of labeled AUs as well as identity differences and pose variations further increases the estimation difficulties. Considering all these difficulties, we propose an unsupervised framework GE-Net for facial AU intensity estimation from a single image, without requiring any annotated AU data. Our framework performs differentiable optimization, which iteratively updates the facial parameters (i.e., head pose, AU parameters and identity parameters) to match the input image. GE-Net consists of two modules: a generator and a feature extractor. The generator learns to "render" a face image from a set of facial parameters in a differentiable way, and the feature extractor extracts deep features for measuring the similarity of the rendered image and input real image. After the two modules are trained and fixed, the framework searches optimal facial parameters by minimizing the differences of the extracted features between the rendered image and the input image. Experimental results demonstrate that our method can achieve state-of-the-art results compared with existing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 9

page 13

page 14

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Facial expression analysis is of great interest to many researchers in computer vision, computer graphics, and psychology. Automated facial expression analysis from a single image enables numerous applications such as human-robot interaction and behavioral and psychological research. The Facial Action Unit (AU) intensity estimation

[12] is one of the main building blocks in single-image facial expression analysis, which aims to estimate anatomically meaningful parameters (i.e., muscle movements) instead of PCA parameters of a 3D morphable model [4]. The seminal work of Ekman and Friesen [12] develops a Facial Action Coding System (FACS) for describing facial expressions, which are produced by the movements of facial muscles under the skin. Nearly any anatomically possible facial expression can be coded by a combination of AUs.

Existing AU intensity estimation methods are broadly divided into fully supervised methods [1, 26, 54, 3] and weakly supervised methods [53, 47, 50]. Training a fully supervised model requires a large set of labeled samples. However, AU annotation requires strong domain expertise, resulting in very high time-and-labor costs to construct a large database. Due to this demanding labeling process, datasets in the literatures (e.g., CK+ [28], MMI [33], AM-FED [30], DISFA [29], BP4D [46]

) restrict the number of coded AUs, samples, and subjects. The weakly supervised methods focus on various types of human knowledge with limited annotation datasets. Nevertheless, they still require thousands of AU annotations, and the human defined knowledge restricts the space of anatomically plausible AUs. Besides, there are many challenges in automatic AU intensity estimation so that the performance of AU intensity estimation is often unsatisfactory. These challenges include large variations in pose, identity, illumination, and occlusion in the unconstrained environment. Among them, pose and identity have always been essential challenges because they can dramatically increase variances of face images of different people even when they have the same expression.

Considering the above challenges, we propose a framework for unsupervised estimation of facial action unit intensity from a single image, which can jointly estimate the facial parameters (i.e., head pose, AU parameters and identity parameters) without annotated AU data. The core of our framework is differentiable optimization, which iteratively updates the facial parameters to match the input image. It is achieved with the help of a novel network architecture called GE-Net. GE-Net consists of two modules: a generator which learns to “render” a face image from a set of facial parameters in a differentiable way, and a feature extractor which extracts deep features for measuring the similarity of the rendered image and input real image. After the two modules are trained, we integrate them into our framework, and the facial parameters can be fitted by minimizing the loss of the extracted features between the rendered image and the input real one.

Specifically, our generator is trained based on a 3D parametric face model called Justice Face from an online game Justice111https://n.163.com/. The model parameters have explicit physical meanings, including AU parameters and identity parameters. The AU parameters can express all the AUs from FACS, and the identity parameters describe identity-specific face shapes by controlling the attributes of facial components, including position, orientation, and scale, as introduced in the work [37]. We adopt a face semantic segmentation network as our feature extractor, whose intermediate feature maps focus more on the facial shape contents rather than the raw appearance. Hence it can effectively reduce the large domain gap between the rendered image and the real one (see Sec. 4.3 for details).

Our method can be used in many applications, one of which is expression transfer. We can exchange the AU parameters of different individuals, which is shown in Fig. 1. With image A or B as input, our framework can predict their facial parameters. After re-combining the head pose, AU parameters, and identity parameters, we can transfer the pose and expression from B to A while keeping A’s identity.

Our main contributions are summarized as the following: To the best of our knowledge, we firstly propose a deep learning framework for unsupervised estimating the facial AU intensity from a single image, without requiring annotated AU data. What’s more, the novel framework enables differentiable optimization, which consists of a generator performing differentiable rendering and a feature extractor measuring similarity of the rendered image and real image. Large experiments show that the proposed method achieves state-of-the-art results on the AU intensity estimation. Meanwhile, our method can activate more AUs than existing AU annotated databases, such as jaw left and upper lip raiser.

2 Related Work

2.1 Face Reconstruction Methods based on 3DMM

In the past years, many methods [58, 56, 14, 55] describe a 3D facial shape using the parameters of the 3DMM [4], which are constructed with PCA. Early methods [57, 5] use a regressor to directly regress the facial landmark locations from a single image for fitting 3DMM parameters. Recently, the approaches of regressing 3DMM parameters using CNNs and fitting 3DMM to the 2D images have become popular. Jourabloo et al[19] use cascade CNN regressors to regress the 3DMM parameters (identity, expression and pose parameters). Bindita et al[8] present a single end-to-end network to jointly predict the bounding box locations and 3DMM parameters for multiple faces. Zhu et al[56, 59] perform multiple iterations of a single CNN to fit the 3DMM parameters. Different from these 3DMM-based methods, our goal is to predict the facial AU intensity with explicit anatomical meanings, while the 3DMM parameters lack such physical meanings.

Figure 2: The pipeline of the proposed method. A generator and a feature extractor are trained in the training phase. In the testing stage, we fix the parameters of the generator and the feature extractor. The framework shows the process that updates the facial parameters by minimizing the differences between the real image and a generated image. Empirically, the optimal facial parameters can be obtained after 30 iterations for an image.

2.2 AU intensity Estimation Methods

Similar to AU detection [32], some methods [20, 22, 21, 36]

use traditional machine learning frameworks to estimate AU intensity. Recently, some works apply CNNs for AU intensity estimation. Tadas

et al[1] present a facial action unit intensity estimation system based on geometry features (shape parameters and landmark locations) and appearance (Histograms of Oriented Gradients). The median-based feature normalization is used to account for a person-specific neutral expression. Li et al[26, 25] design different CNNs to regress AU parameters, where the CNNs focus on different facial regions independently. Batista et al[3] and Zhou et al[54] apply DNNs to estimate AU intensity under multiple head poses. However, these supervised methods require a large number of training samples, while AU annotation needs strong domain expertise, and it requires great efforts to construct such extensive database. Besides, limited by the number of annotated samples, above methods usually suffer from an over-fitting problem. Due to the lack of AU annotation datasets, several works [52, 47, 50] use the weakly supervised training paradigm to train deep models with incomplete or inaccurate annotations, where prior knowledge and domain knowledge are involved. However, these methods still require thousands of annotations, and they add prior knowledge, which restricts the space of anatomically plausible AUs. In contrast, our method can estimate more AUs than existing AU annotated databases, without requiring any annotated data.

2.3 Neural Render

Recent works on differentiable rendering achieve differentiability in various ways. Kato et al[23] achieve differentiability by an approximate gradient for the rasterization operation. GQN [11] shows a powerful neural renderer, which learns to represent synthetic scenes using the input images of a scene taken from different viewpoints, but the geometry of objects in the scene is simple. MOFA [40] learns a decoder using a neural renderer model, which is an expert-designed generative model. The expert-designed generative model is difficult to describe the complicated rendering process with the illumination condition and complex shading explicitly. Nguyen-Phuoc et al[31] propose a CNN architecture to render 3D objects from a 3D voxel grid. Shi et al[37] learn a generator as a neural render with fixed camera parameters. Inspired by [37], we extend the generator to imitate the face rendering process from meaningful facial parameters including identity, expression and head pose.

3 Method

Our goal is to estimate the intensity of facial AUs as well as the face pose and identity from a single image via differentiable optimization, without requiring any labeled data. The optimization framework GE-Net consists of a generator and a feature extractor . The generator imitates the rendering process from the facial parameters (i.e., head pose , AU parameters and identity parameters ) to the rendered image , while the feature extractor extracts deep features from images, which are used to measure the similarity between the real input image and the rendered image for fitting the meaningful parameters. The generator is trained with the help of a 3D parametric face model called Justice Face from the game Justice. The model parameters have explicit physical meanings, including AU parameters and identity parameters . A complete process of our method is summarized as follows:
Stage 1. Train the generator and the feature extractor .
Stage 2. Fix the parameters of and , and iteratively update the facial parameters according to the loss between the input image and the generated one. Empirically, the facial parameters can closely describe the input image after 30 iterations.

Figure 3: The generator (a) and the feature extractor (b). The generator learns to map a set of meaningful facial parameters to a rendered image by combining the content loss and perceptual loss. IN means instance normalization [41]. A face segmentation network is trained as the feature extractor. The similarity between the input real image and the generated image is measured via the

loss of the feature maps pixel-wisely multiplied with the attention masks. The output probability map of every local region serves as an attention mask for one of the feature maps.

3.1 3D Face Model

The parametric face model Justice Face contains the identity parameters and AU parameters. The identity parameters control the attributes of facial components to create a neutral face for a specific identity, including the position, orientation, and scale as recommended by the method [37]. The AU parameters control the expression by driving the facial muscle movements relative to the neutral face. Specifically, these AU parameters include eyes closed, upper lid raiser, cheek raiser, inner brows raiser, outer brows raiser, brow lower, jaw open, nose wrinkle, upper lip raiser, down lip downer, lip corner pull, mouth press, lip pucker, upper lips close, lower lips close, cheek puff, lip corner depressor, jaw left and jaw right. The corresponding maps between AUs and our AU parameters are shown in Table 2 of the appendix. We can adjust the AU and identity parameters to generate an identity-specific expression, which is formulated below:

(1)

where is a synthesized 3D face and is the base face. is the facial component offsets to the base face in position, orientation, and scale, and the identity parameters are the weights of these offsets. is the offsets of facial muscle movements relative to the neutral face. The AU parameters are the corresponding weights and represent the facial AU intensity.

3.2 Generator

The generator learns to map a set of meaningful facial parameters to a rendered image based on the Justice Face.

is a 269-dimensional vector with head pose

, AU parameters and identity parameters . The ground truth is the corresponding rendered image from the Justice Face, and the predicted one is marked as the generated image . Our generator consists of eight transposed convolution layers, which is similar to the network of DCGAN [35]. We use instance normalization [41]

instead of batch normalization

[17] in the generator for improving training stability. The generator is differentiable so that the facial parameters can be updated by gradient descent.

Fig. 3(a) shows the architecture of our generator. To make the generated images indistinguishable from the rendered images, we adopt a content loss and a perceptual loss  [18] between the rendered image and the predicted one . The content loss penalizes large changes in the raw pixel space, and we use loss function rather than for so as to encourage less blurring. The content loss is defined as:

(2)

where and

means the uniform distribution. The input parameters

are sampled from a multidimensional joint uniform distribution .

The perceptual loss is added to explicitly constrain the generated result to have the same content information as the rendered one, and it is defined as:

(3)

where denotes the relu2_2, relu3_3, relu4_3 feature maps in VGG16 [39], which is pre-trained on the image recognition task. In this way, we can train the generator by optimizing the following loss function:

(4)

where balances the multiple objectives.

During training, all the rendered images are aligned via five facial landmarks using affine transformation, including left eye, right eye, nose, left mouth corner and right mouth corner, which are obtained by OpenFace [2, 45]. The convolution kernel size in the generator is

, and the stride of each transposed convolution layer is 2. The Instance-Normalization and ReLU activation are embedded in

after every transpose convolution layer, except for its output layers.

is trained with mini-batch stochastic gradient descent (SGD) and the mini-batch size is 64. All the weights are initialized from a zero-centered normal distribution with a standard deviation of 0.02. The learning rate is 0.001, and the learning rate decay is set to 1% per three epochs. After training of 500 epochs, the generator successfully learns how to map each facial parameter to its corresponding component/muscle movement and “render” a visually plausible image.

3.3 Feature Extractor

As there is a large domain gap between the generated image and the input real image, to effectively measure their similarity, a face segmentation network is trained as a feature extractor in Fig. 3(b). The similarity is computed on the feature maps of the first four layers of . Specifically, an attention mechanism [9] is deployed in the feature extractor. It allows the feature maps to attend to different parts of the face, which means the feature maps have different influences on the loss. The output probability maps of the facial segmentation network serve as the attention masks, which are represented as the pixel-wise weights in Eqn. (5). Specifically, the probability map of every facial local region is utilized as an attention mask of each feature map. For example, the feature map of the first layer is sensitive to the eyes, and the eyes’ probability map is element-wisely multiplied to the feature map of the first layer. Suppose represents the feature map, and the extracted features can be defined as:

(5)

where is the input of the feature extractor. feature maps are extracted, and represents the weight of each feature map . We set different attention masks for the feature maps by the probability maps of the facial segmentation result.

The architecture of the feature extractor network is based on Resnet-50 [16]. We set the stride [2, 1, 1, 1] at four sub-layers, remove the fully connected layers and change the average pooling layer to a

convolution layer. The output is 1/8 size of the input. We further perform up-sampling with factor 8 to get an upsampled output with the same size as the input. The network is pre-trained on the ImageNet

[10] and is fine-tuned with the Helen Dataset [24] with the pixel-wise cross-entropy loss. The learning rate is set to 0.001.

Database FERA2015 DISFA
AU 06 10 12 14 17 Avg 01 02 04 05 06 09 12 15 17 20 25 26 Avg
ICC KBSS [47]* .76 .73 .84 .45 .45 .65 .23 .11 .48 .25 .50 .25 .71 .22 .25 .06 .83 .41 .36
KJRE [49]* .71 .61 .87 .39 .42 .60 .27 .35 .25 .33 .51 .31 .67 .14 .17 .20 .74 .25 .35
LBA [15]* .71 .64 .81 .23 .50 .58 .04 .06 .39 .01 .41 .12 .73 .13 .27 .10 .82 .43 .29
2DC [27]* .76 .71 .85 .45 .53 .66 .70 .55 .69 .05 .59 .57 .88 .32 .10 .08 .90 .50 .50
CFLF [48]* .77 .70 .83 .41 .60 .66 .26 .19 .46 .35 .52 .36 .71 .18 .34 .21 .81 .51 .41
GENet-O .67 .80 .71 .61 .50 .66 .51 .63 .58 .65 .53 .62 .48 .55 .50 .37 .70 .63 .59
GE-Net .69 .85 .73 .63 .53 .68 .66 .67 .73 .71 .60 .59 .60 .66 .58 .45 .80 .70 .64


MAE
KBSS [47]* .74 .77 .69 .99 .90 .82 .48 .49 .57 .08 .26 .22 .33 .15 .44 .22 .43 .36 .336
KJRE [49]* .82 .95 .64 1.08 .85 .87 1.0 .92 1.9 .70 .79 .87 .77 .60 .80 .72 .96 .94 .91
LBA [15]* .64 .80 .56 1.10 .62 .74 .43 .29 .51 .10 .30 .19 .30 .11 .31 .14 .40 .38 .29

2DC [27]* .87 .84 .92 .67 .73 .81 .57 .62 .73 .51 .66 .55 .50 .52 .78 .42 .61 .74 .61

CFLF [48]* .62 .83 .62 1.00 .63 .74 .33 .28 .61 .13 .35 .28 .43 .18 .29 .16 .53 .40 .329

GENet-O .49 .72 .45 .43 .34 .49 .39 .35 .37 .17 .34 .39 .55 .19 .43 .58 .22 .32 .36
GE-Net .45 .60 .37 .40 .27 .42 .36 .33 .38 .14 .34 .36 .57 .18 .42 .26 .27 .32 .329

Table 1: The Intra-Class Correlation (ICC) and Mean Absolute Error (MAE) on FERA2015 and DISFA. Bold numbers indicate the best performance. Underline numbers indicate the second best. (*) indicates results taken from the reference.

3.4 Fitting

After the generator and the feature extractor are learned, we fix their parameters and iteratively optimize the facial parameters (i.e., head pose, AU parameters and identity parameters) to match the input real image. To fit the facial parameters to a real face image with expression, the generator and the feature extractor serve as a bridge. We optimize the facial parameters by measuring the similarity of the extracted features from between the input real image and the generated image from . The extracted features can teach the generator to produce faces that match firmly with the real image in the feature space.

Fig. 2 shows the pipeline of our proposed method. The meaningful facial parameters can be updated using gradients because the renderer is differentiable. We update the facial parameters by back-propagating the loss on extracted features, which explicitly constrain the “rendered” result to have similar facial shape contents as the real image. The loss is defined as:

(6)

where is the input real image aligned with the base face using five facial landmarks. The parameters are updated with the gradient descent method by:

(7)

where represents the learning rate. The range of is clipped to [0,1].

Figure 4: Randomly generated faces without and with the facial constraint. The facial constraint can restrict the space of the identity parameters and generate reasonable and neutral faces.

However, the identity parameters can describe a large-scale 3D space, in which some points don’t exist in the real world. To solve the problem, we apply a facial constraint. Instead of directly predicting identity parameters in the high-dimensional space, a transform is performed to obtain an intrinsic low-dimensional subspace on the identity parameters. We collect 20K face images without expression based on Basel Face Model [34], and obtain a set of identity parameters using our GE-Net with AU parameters all set to zero.

Suppose is the mean of identity parameters and

is the projection matrix, which can be obtained by performing singular value decomposition on the covariance matrix of the identity parameters. The formula is defined as:

(8)
(9)

means the identity parameters in the low-dimensional subspace. Each of the parameters can be normalized to [0,1] according to the 20K face images in the subspace. The is back-projected to the high-dimensional space by multiplying the transpose of the projection matrix. The final predicted identity parameters can be reconstructed as:

(10)

Fig. 4 shows some randomly generated faces without and with the facial constraint. Before the input image is fed into the feature extractor, it is aligned to the “base face” via five facial landmarks. The “base face” is created by rendering a frontal emotionless face image, with the identity parameters set to 0.5.

In the fitting process, we set different learning rates for . For the identity parameters, the learning rate is . For the head pose and AU parameters, we find each parameter has a different impact on the generated facial expression, so we set different learning rates. Specifically, starting from the “base face”, to compute the learning rate for one parameter in , we set its value to the maximum value while keeping other parameters zero, and obtain a corresponding segmentation output. We count the different pixels compared with the segmentation output of “base face”, and set the inverse of the ratio of different pixels in the whole image as the learning rate. The optimization can converge well after 30 iterations in the experiment.

4 Experiments

4.1 Dataset

Public Data. We adopt the Extended Cohn-Kanade database (CK+) [28], the MMI database [33], FERA2015 [42], the DISFA [29] and the FaceWarehouse [6] to evaluate our method. The CK+ database contains 593 sequences from 123 subjects. Approximately 15% of the sequences are coded by a certified FACS coder, which coincides with the peak frames. The MMI database contains more than 2800 samples, which contain static images and image sequences in frontal and in profile view. The FERA 2015 contains about 140,000 images from 41 subjects. Intensities are annotated for 5 AUs (6, 10, 12, 14, 17). The DISFA database consists of 27 sequences from 27 subjects. Around 130,000 frames are annotated with AU intensity for 12 AUs (1, 2, 4, 5, 6, 9, 12, 15, 17, 20, 25, 26).

Figure 5: Qualitative AU intensity estimation results (a) and the transferred results (b). The results of DCLM [44], 3DDFA [56] and FEN [7] are from original paper. The results of AU parameters transferred to another 3D model are shown in (b).

Rendered Data. We render 80,000 face images using Justice Face with random facial parameters. The facial parameters is a 269-dimensional vector with head pose , AU parameters and identity parameters . specifically, the range of is from to in pitch and yaw, and we convert the angles to be within . We only consider the head pose in pitch and yaw because the image is aligned in the roll, which means simple 2D in-plane rotation. The initial values of the head pose should represent a frontal view, so the values are set to 0.5. The identity parameters consist of 208-dimensional continuous parameters and 36-dimensional discrete parameters. These discrete parameters are encoded as a one-hot vector and are concatenated with the continuous ones. The range of the facial parameters is from 0 to 1. The face samples are generated randomly from a uniform distribution of all parameters. 90% face samples are used for training the generator , and the others are used for validating the generator.

4.2 Comparison with the state-of-the-art

Comparison with weakly and semi-supervised learning methods.

We compare the proposed method with several state-of-the-art weakly-supervised learning methods (KBSS

[47], KJRE [49] and CFLF [48]

) and semi-supervised learning methods (LBA

[15] and 2DC [27]). LBA [15] encourages cycle-consistent association chains from the embedding of labeled samples to unlabeled samples. The back is based on the assumption that samples with similar labels have similar latent features. 2DC [27] combines variational auto-encoder and Gaussian Process for AU intensity estimation by joint learning of latent representations and classification of multiple ordinal outputs. KBSS [47] exploits four types of domain knowledge on AUs to train a deep model. KJRE [49] jointly learns the representation and estimator with limited annotations by incorporating various types of human knowledge. CFLF [48] uses a learnable task-related context and two types of attention mechanisms to estimate the AU intensity. Note that KBSS, CFLF and 2DC need labeled data to train their models. Unlike them, we propose a framework for facial AU intensity estimation without annotated AU data.

Figure 6:

The results of different feature extractors. The raw pixels and the features of the pre-trained VGG16 are used in the loss, whose results are inferior. The pre-trained ResNet50, the facial expression recognition(“exp-rec.”) model “ResNet” and the face recognition model(“face-rec.”) “EvoLVe” can not distinguish the input images with different expressions. Only the results of the facial segmentation network (“face-seg.”) are similar to the real facial images.

Method ICC MAE
OpenFace [1] .392 .421
CNN [13]* .328 .423
ResNet18 [16]* .270 .483
CNN-IT [43]* .377 .663
GE-Net .641 .329
Table 2: Comparison with the state-of-the-art supervised methods. (*) indicates results taken from the reference.

Intra-Class Correlation (ICC(3,1) [38]) and Mean Absolute Error (MAE) are adopted as the metrics for evaluation. The quantitative results are listed in Table 1, and more quantitative results are listed in the supplementary material. We can see that on the FERA2015 dataset, our method achieves superior average performance over other methods on both metrics. On DISFA, our method achieves the best average performance on ICC and the second-best on MAE. Note that ICC and MAE should be jointly considered to evaluate one method. LBA tends to predict the intensity to 0, which is the majority of AU intensity. It can achieve good MAE performance, but its ICC is much worse than ours. Our method outperforms KBSS and CFLF on DISFA and FERA 2015. During the training phase, these two approaches need multiple frames as input, hence it requires more than one image per face, which is not always available. Besides, the proposed GE-Net utilizes the attention mechanism for the feature extractor, which achieves better results than GENet-O, which does not use the attention mechanism. It demonstrates the effectiveness of the attention mechanism on the feature extractor. We also noticed that our method has bad performance on AU 12 in DISFA. It is because Justice Face restricts the expression of lip corner puller, whose movement is smaller than the real face so that the model can express elegant expression. These results show the superior performance of the proposed method over existing weakly and semi-supervised learning methods.

Comparison with supervised learning methods. We compare our method with several supervised learning methods of AU intensity estimation, including OpenFace [1], ResNet18 [16], CCNN-IT [43] and CNN [13]. CNN [13] is composed of three convolutional layers and a fully connected layer. CCNN-IT [43] takes advantage of CNNs and data augmentation. ResNet18 is introduced in [16]. OpenFace [1] is based on histograms of oriented gradients and landmark locations. These supervised methods require annotating AU intensity of each frame in sequences while ours does not need the AU annotations. The results are shown in Table 2. Our method achieves better performance on both metrics.

Figure 7: Expression similarity matrices (plot using color map “JET”) for the losses between the real inputs and the rendered images. The x-axis and y-axis are the inputs and the rendered images, respectively. Ten different expressions are listed on the right. The smaller the value is, the closer is the distance. Only the segmentation loss is low on the diagonal of the matrix, so it can be used to measure the differences.

Qualitative AU intensity estimation results. We also show the qualitative results which are rendered using Justice Face with the estimated facial parameters. Fig. 5(a) provides several samples of the rendered images of different methods. Our method can achieve more accurate estimation than the state-of-the-art methods based on 3DMM. As shown in Fig. 5(b), the AU parameters of the inputs are transferred to another 3D face model. Fig. 8 shows our additional qualitative results on the mentioned datasets and internet images. (More results are given in the supplementary material.)

4.3 Ablation study

We perform an ablation study to verify the effectiveness of the feature extractor. A baseline using the loss in the raw pixel space is explored. In addition, instead of using the facial segmentation network, we use the pre-trained ResNet or VGG to extract the features of an image. Fig. 6 shows the comparison results. We can see that the results using the raw pixel loss or the pre-trained VGG16 are rather inferior, with a lot of artifacts. The pre-trained ResNet50 can not distinguish the input images with different expressions. In contrast, our facial segmentation network can be aware of the facial shape changes and produce similar expressions as the input images.

Figure 8: Some visual results of GE-Net, where the input images, generated images, and the estimated AU parameters are demonstrated. In the parameter histogram, columns 0-1 correspond to the head pose and columns 2-24 correspond to AU parameters.

We further validate the effectiveness of the facial segmentation network by comparing it with the facial expression recognition model “ResNet” 222https://github.com/WuJie1010/Facial-Expression-Recognition.Pytorch and the face recognition model “EvoLVe”333https://github.com/ZhaoJ9014/face.evoLVe.PyTorch [51]. A natural idea would be to use a facial expression recognition model to measure the expression similarity between the input photo and the neurally rendered one. However, due to the huge domain gap, existing pre-trained models fail to measure such similarity. The face recognition model also fails to distinguish the different expressions, which is illustrated in Fig. 6. As shown in Fig. 7, ten different expressions are selected to compute the losses on the FaceWarehouse dataset. We compare the performances of the expression recognition loss, face recognition loss and face segmentation loss using the expression similarity matrix. The expression similarity matrix is normalized and shown in the JET style. Obviously, a useful metric for expression similarity measurement should have low values on the diagonal of the matrix, while only the segmentation loss meets the requirement well.

5 Conclusion

In this paper, we propose an unsupervised framework for estimating the head pose, the AU parameters, and the identity parameters from a real facial image without annotated AU data. We formulate the method as a differentiable optimization problem by minimizing the differences of the extracted features between the real facial image and the generated one. The loss can be back-propagated to the facial parameters because the generator is differentiable. Experimental results demonstrate that the proposed method can fit accurate AU parameters and achieve state-of-the-art performance compared with existing methods. It can also get accurate AU estimation on internet photos. In the future, we plan to collect a large number of face videos with the various expression of different views. The database can be labeled using our GE-Net so that we can exploit faster methods for AU intensity estimation. Meanwhile, a more robust method to handle the large change of the head pose is another direction.

References

  • [1] T. Baltrušaitis, M. Mahmoud, and P. Robinson (2015) Cross-dataset learning and person-specific normalisation for automatic action unit detection. In Automatic Face and Gesture Recognition, Vol. 6, pp. 1–6. Cited by: §1, §2.2, §4.2, Table 2.
  • [2] T. Baltrusaitis, P. Robinson, and L. Morency (2013) Constrained local neural fields for robust facial landmark detection in the wild. In International Conference on Computer Vision Workshops, pp. 354–361. Cited by: §3.2.
  • [3] J. C. Batista, V. Albiero, O. R. Bellon, and L. Silva (2017)

    Aumpnet: simultaneous action units detection and intensity estimation on multipose facial images using a single convolutional neural network

    .
    In Automatic Face & Gesture Recognition, pp. 866–871. Cited by: §1, §2.2.
  • [4] V. Blanz and T. Vetter (1999) A morphable model for the synthesis of 3d faces. In Computer Graphics and Interactive Techniques, pp. 187–194. Cited by: §1, §2.1.
  • [5] C. Cao, Q. Hou, and K. Zhou (2014) Displaced dynamic expression regression for real-time facial tracking and animation. Transactions on Graphics 33 (4), pp. 43. Cited by: §2.1.
  • [6] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou (2013) Facewarehouse: a 3d facial expression database for visual computing. Transactions on Visualization and Computer Graphics 20 (3), pp. 413–425. Cited by: §A.1, §4.1.
  • [7] F. Chang, A. T. Tran, T. Hassner, I. Masi, R. Nevatia, and G. Medioni (2019) Deep, landmark-free fame: face alignment, modeling, and expression estimation. International Journal of Computer Vision 127 (6-7), pp. 930–956. Cited by: Figure 5.
  • [8] B. Chaudhuri, N. Vesdapunt, and B. Wang (2019) Joint face detection and facial motion retargeting for multiple faces. In

    Computer Vision and Pattern Recognition

    ,
    pp. 9719–9728. Cited by: §2.1.
  • [9] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.3.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §3.3.
  • [11] S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al. (2018)

    Neural scene representation and rendering

    .
    Science 360 (6394), pp. 1204–1210. Cited by: §2.3.
  • [12] E. Friesen and P. Ekman (1978) Facial action coding system: a technique for the measurement of facial movement. Palo Alto 3. Cited by: §1.
  • [13] A. Gudi, H. E. Tasli, T. M. Den Uyl, and A. Maroulis (2015) Deep learning based facs action unit occurrence and intensity estimation. In Automatic Face and Gesture Recognition (FG), Vol. 6, pp. 1–5. Cited by: §4.2, Table 2.
  • [14] Y. Guo, J. Cai, B. Jiang, J. Zheng, et al. (2018) Cnn-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. Pattern Analysis and Machine Intelligence 41 (6), pp. 1294–1307. Cited by: §2.1.
  • [15] P. Haeusser, A. Mordvintsev, and D. Cremers (2017) Learning by association–a versatile semi-supervised training method for neural networks. In Computer Vision and Pattern Recognition, pp. 89–98. Cited by: Table 1, §4.2.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Computer Vision and Pattern Recognitionn, pp. 770–778. Cited by: §3.3, §4.2, Table 2.
  • [17] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.2.
  • [18] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    .
    In European Conference on Computer Vision, pp. 694–711. Cited by: §A.1, §3.2.
  • [19] A. Jourabloo and X. Liu (2016) Large-pose face alignment via cnn-based dense 3d model fitting. In Computer Vision and Pattern Recognition, pp. 4188–4196. Cited by: §2.1.
  • [20] S. Kaltwang, O. Rudovic, and M. Pantic (2012) Continuous pain intensity estimation from facial expressions. In International Symposium on Visual Computing, pp. 368–377. Cited by: §2.2.
  • [21] S. Kaltwang, S. Todorovic, and M. Pantic (2015) Doubly sparse relevance vector machine for continuous facial behavior estimation. Pattern Analysis and Machine Intelligence 38 (9), pp. 1748–1761. Cited by: §2.2.
  • [22] S. Kaltwang, S. Todorovic, and M. Pantic (2015) Latent trees for estimating intensity of facial action units. In Computer Vision and Pattern Recognition, pp. 296–304. Cited by: §2.2.
  • [23] H. Kato, Y. Ushiku, and T. Harada (2018) Neural 3d mesh renderer. In Computer Vision and Pattern Recognition, pp. 3907–3916. Cited by: §2.3.
  • [24] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang (2012) Interactive facial feature localization. In European Conference on Computer Vision, pp. 679–692. Cited by: §3.3.
  • [25] W. Li, F. Abtahi, Z. Zhu, and L. Yin (2017) Eac-net: a region-based deep enhancing and cropping approach for facial action unit detection. In International Conference on Automatic Face & Gesture Recognition, pp. 103–110. Cited by: §2.2.
  • [26] W. Li, F. Abtahi, and Z. Zhu (2017) Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In Computer Vision and Pattern Recognition, pp. 1841–1850. Cited by: §1, §2.2.
  • [27] D. Linh Tran, R. Walecki, S. Eleftheriadis, B. Schuller, M. Pantic, et al. (2017)

    Deepcoder: semi-parametric variational autoencoders for automatic facial action coding

    .
    In International Conference on Computer Vision, pp. 3190–3199. Cited by: Table 1, §4.2.
  • [28] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In Computer Vision and Pattern Recognition-Workshops, pp. 94–101. Cited by: §1, §4.1.
  • [29] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn (2013) Disfa: a spontaneous facial action intensity database. Transactions on Affective Computing 4 (2), pp. 151–160. Cited by: §1, §4.1.
  • [30] D. McDuff, R. Kaliouby, T. Senechal, M. Amr, J. Cohn, and R. Picard (2013) Affectiva-mit facial expression dataset (am-fed): naturalistic and spontaneous facial expressions collected. In Computer Vision and Pattern Recognition Workshops, pp. 881–888. Cited by: §1.
  • [31] T. H. Nguyen-Phuoc, C. Li, S. Balaban, and Y. Yang (2018) Rendernet: a deep convolutional network for differentiable rendering from 3d shapes. In Neural Information Processing Systems, pp. 7891–7901. Cited by: §2.3.
  • [32] X. Niu, H. Han, S. Yang, Y. Huang, and S. Shan (2019) Local relationship learning with person-specific shape regularization for facial action unit detection. In Computer Vision and Pattern Recognition, pp. 11917–11926. Cited by: §2.2.
  • [33] M. Pantic, M. Valstar, R. Rademaker, and L. Maat (2005) Web-based database for facial expression analysis. In International Conference on Multimedia and Expo, pp. 5–pp. Cited by: §1, §4.1.
  • [34] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter (2009) A 3d face model for pose and illumination invariant face recognition. IEEE, Genova, Italy. Cited by: §3.4.
  • [35] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §3.2.
  • [36] O. Rudovic, V. Pavlovic, and M. Pantic (2014) Context-sensitive dynamic ordinal regression for intensity estimation of facial action units. Pattern Analysis and Machine Intelligence 37 (5), pp. 944–958. Cited by: §2.2.
  • [37] T. Shi, Y. Yuan, C. Fan, Z. Zou, Z. Shi, and Y. Liu (2019) Face-to-parameter translation for game character auto-creation. In International Conference on Computer Vision, pp. 161–170. Cited by: §1, §2.3, §3.1.
  • [38] P. E. Shrout and J. L. Fleiss (1979) Intraclass correlations: uses in assessing rater reliability.. Psychological bulletin 86 (2), pp. 420. Cited by: §4.2.
  • [39] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
  • [40] A. Tewari, M. Zollhofer, H. Kim, P. Garrido, F. Bernard, P. Perez, and C. Theobalt (2017) Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In International Conference on Computer Vision, pp. 1274–1283. Cited by: §2.3.
  • [41] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: Figure 3, §3.2.
  • [42] M. F. Valstar, T. Almaev, J. M. Girard, G. McKeown, M. Mehu, L. Yin, M. Pantic, and J. F. Cohn (2015) Fera 2015-second facial expression recognition and analysis challenge. In International Conference and Workshops on Automatic Face and Gesture Recognition, Vol. 6, pp. 1–8. Cited by: §4.1.
  • [43] R. Walecki, V. Pavlovic, B. Schuller, M. Pantic, et al. (2017) Deep structured learning for facial action unit intensity estimation. In Computer Vision and Pattern Recognition, pp. 3405–3414. Cited by: §4.2, Table 2.
  • [44] A. Zadeh, T. Baltrušaitis, and L. Morency (2016) Deep constrained local models for facial landmark detection. arXiv preprint arXiv:1611.08657 3 (5), pp. 6. Cited by: Figure 5.
  • [45] A. Zadeh, Y. Chong Lim, T. Baltrusaitis, and L. Morency (2017) Convolutional experts constrained local model for 3d facial landmark detection. In International Conference on Computer Vision, pp. 2519–2528. Cited by: §3.2.
  • [46] X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, and P. Liu (2013) A high-resolution spontaneous 3d dynamic facial expression database. In Automatic Face and Gesture Recognition, pp. 1–6. Cited by: §1.
  • [47] Y. Zhang, W. Dong, B. Hu, and Q. Ji (2018) Weakly-supervised deep convolutional neural network learning for facial action unit intensity estimation. In Computer Vision and Pattern Recognition, pp. 2314–2323. Cited by: §1, §2.2, Table 1, §4.2.
  • [48] Y. Zhang, H. Jiang, B. Wu, Y. Fan, and Q. Ji (2019) Context-aware feature and label fusion for facial action unit intensity estimation with partially labeled data. In International Conference on Computer Vision, pp. 733–742. Cited by: Table 1, §4.2.
  • [49] Y. Zhang, B. Wu, W. Dong, Z. Li, W. Liu, B. Hu, and Q. Ji (2019) Joint representation and estimator learning for facial action unit intensity estimation. In Computer Vision and Pattern Recognition, pp. 3457–3466. Cited by: Table 1, §4.2.
  • [50] Y. Zhang, R. Zhao, W. Dong, B. Hu, and Q. Ji (2018) Bilateral ordinal relevance multi-instance regression for facial action unit intensity estimation. In Computer Vision and Pattern Recognition, pp. 7034–7043. Cited by: §1, §2.2.
  • [51] J. Zhao, Y. Cheng, Y. Xu, L. Xiong, J. Li, F. Zhao, K. Jayashree, S. Pranata, S. Shen, J. Xing, et al. (2018) Towards pose invariant face recognition in the wild. In Computer Vision and Pattern Recognition, pp. 2207–2216. Cited by: §4.3.
  • [52] K. Zhao, W. Chu, and A. M. Martinez (2018) Learning facial action units from web images with scalable weakly supervised clustering. In Computer Vision and Pattern Recognition, pp. 2090–2099. Cited by: §2.2.
  • [53] R. Zhao, Q. Gan, S. Wang, and Q. Ji (2016) Facial expression intensity estimation using ordinal information. In Computer Vision and Pattern Recognition, pp. 3466–3474. Cited by: §1.
  • [54] Y. Zhou, J. Pi, and B. E. Shi (2017)

    Pose-independent facial action unit intensity regression based on multi-task deep transfer learning

    .
    In Automatic Face & Gesture Recognition, pp. 872–877. Cited by: §1, §2.2.
  • [55] Y. Zhou, J. Deng, I. Kotsia, and S. Zafeiriou (2019) Dense 3d face decoding over 2500fps: joint texture & shape convolutional mesh decoders. In Computer Vision and Pattern Recognition, pp. 1097–1106. Cited by: §2.1.
  • [56] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li (2016) Face alignment across large poses: a 3d solution. In Computer Vision and Pattern Recognition, pp. 146–155. Cited by: §2.1, Figure 5.
  • [57] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li (2015) High-fidelity pose and expression normalization for face recognition in the wild. In Computer Vision and Pattern Recognition, pp. 787–796. Cited by: §2.1.
  • [58] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li (2015) High-fidelity pose and expression normalization for face recognition in the wild. In Computer Vision and Pattern Recognition, pp. 787–796. Cited by: §2.1.
  • [59] X. Zhu, X. Liu, Z. Lei, and S. Z. Li (2017) Face alignment in full pose range: a 3d total solution. Pattern Analysis and Machine Intelligence, pp. 78–92. Cited by: §2.1.

Appendix A Appendix

a.1 More Experiments

Stability. We hope that the predicted AU parameters are stable in the continuous frames of one video. The stability is crucial for numerous applications such as expression transfer, facial expression analysis, and human-robot interaction. To evaluate the stability, we suppose that the continuous two frames are almost the same in one video. So, the stability is evaluated by the differences between the continuous frames, which is calculated by the standard deviation of each AU in the fixed-size sliding windows of continuous frames. We compute the standard deviation of ten video sequences from 2 to 10 frames. The ten videos are download from the internet. The results of our method are better than OpenFace on different sliding windows. The result of 5 frames is shown in Table 3. We also show the results of a video in different frames in Fig. 9.

AU 01 02 03 04 05 06 07 09 10 12 14 15 15-2
OpenFace 0.247 0.261 0.114 0.267 0.139 0.248 0.027 0.178 0.144 0.218 0.055

Ours
0.032 0.013 0.034 0.045 0.096 0.059 0.055 0.034 0.038 0.047 0.044 0.064 0.034
AU 17 18 20 23 25 27 28 29 34 45 45-2 47 48
OpenFace 0.017 0.162 0.077 0.226 0.221
Ours 0.042 0.054 0.091 0.073 0.067 0.071 0.053 0.050 0.102 0.124 0.024 0.060

Table 3: The standard deviation on the sliding windows. Bold numbers indicate the best performance.
Figure 9: A video expression transfer example. The upper row shows the several frames of one video, and the bottom row shows the generated images.

Robustness to Varying Conditions. Our network uses facial segmentation features as the input of the loss, yielding results robust to changes in lighting, resolution, and style. To verify it, we add robustness experiments in FaceWarehouse dataset [6]. Different lights, gaussian blur, and style transfer [18] are used to the dataset. Some results are shown in Fig. 10. We qualitatively demonstrate this robustness by varying conditions for a single subject, and the results show consistent output. The first row is the input image. The rendered images and the corresponding histograms of the facial AU parameters are shown in the second and the third row. We also noticed that our method has bad performance on the upper lid raiser (column 4 in the histogram). It is because Justice Face restricts the expression of the upper lid raiser, whose movement is smaller. (please refer to Fig. 11.)

Figure 10: Robustness test. The network is robust to change in lighting, resolution, and style. From top to bottom, we show the five input images, the rendered images, and the corresponding AU parameter histograms.

In our paper, the useful action unit and the corresponding name are listed in Table 4 based by row and then column. Each of AU parameters is set to the max value, and the corresponding rendered image to base face is listed in Fig. 11.

Name AU Name AU Name AU
Left Eye Blink 45 Right Eye Blink 45-2 Upper Lid Raiser 5
Cheek Raiser 6 Inner Brow Raiser 1 Left Outer Brow Raiser 2
Right Outer Brow Raiser 3 Brow Lowerer 4 Mouth Stretch 27
Nose Wrinkler 9 Upper Lip Raiser 10 Down Lip Down 25

Lip Corner Puller
12 left mouth press 14 right mouth press 14-2


Lip Pucker
18 Lip Stretcher 20 Lip Upper Close 24

Lip Lower Close
17 Cheek Puff 34 Lip Corner Depressor 15

Jaw Left
47 Jaw Right 48

Table 4: The AUs and corresponding name.

Compared with OpenFace. We provide a comparative experiment to investigate the effectiveness of our GE-Net. The results are shown in Table 5. The GE-Net achieves better performance in MAE and ICC on average and many AUs, which demonstrates the effectiveness of the attention mechanism.

Database AU 01 02 04 05 06 07 09 10 12 17 23 24 25 Avg

MAE
CK+ OPENFACE .45 .41 .74 .33 .65 1.34 .58 .43 .46 .94 .56 .60 .63 .624

GE-NET .40 .27 .66 .27 .30 .30 .51 .14 .68 .77 .48 - .39 .443


AU 01 02 04 05 06 07 09 10 12 17 23 25 26 Avg
MMI OPENFACE 1.09 .86 .23 .68 .12 .21 .16 .27 .20 .60 .27 .46 .32 .421


GE-Net .12 .18 .23 .09 .11 .11 .17 .33 .56 .24 .324 .32 .10 .231

AU 01 02 04 05 06 07 09 10 12 17 23 24 25 Avg

ICC
CK+ OPENFACE .36 .48 .24 .51 .29 .35 .51 .34 .58 .53 .48 .31 .61 .430

GE-NET .58 .53 .46 .78 .70 .71 .80 .36 .42 .75 .61 - .63 .611

AU 01 02 04 05 06 07 09 10 12 17 23 25 26 Avg
MMI OPENFACE .50 .31 .24 .44 .43 .47 .31 .45 .38 .27 .48 .46 .32 .389


GE-Net .53 .48 .38 .63 .47 .53 .57 .49 .36 .48 .41 .60 .51 .495

Table 5: The MAE on CK+ and MMI. Bold numbers indicate the best performance. GENet-O means our method without attention.

Appendix B AU parameters expression on the 3d facial model

There are a little bit differences between the AU parameters from the 3d facial model and the AUs from the FACS system (e.g. we separate the dimple to left dimple and right dimple), as shown in Fig. 11.

Figure 11: The generated images with the corresponding AU parameters.

Appendix C Training samples of our generator

In the training process, we train our generator with randomly generated game faces, as shown in Fig. 12, which are decided by the head pose, AU parameters and identity parameters.

Figure 12: Training samples of our generator and the corresponding parameters.

Appendix D More examples of our method

More examples of our method are shown in Fig. 13 and 14. Due to richer expressions on FaceWarehouse dataset, we show more examples.

Figure 13: More results of our method on MMI.
Figure 14: More results of our method on FaceWarehouse.