Explicit Facial Expression Transfer via Fine-Grained Semantic Representations

09/06/2019 ∙ by Zhiwen Shao, et al. ∙ Shanghai Jiao Tong University Deakin University 6

Facial expression transfer between two unpaired images is a challenging problem, as fine-grained expressions are typically tangled with other facial attributes such as identity and pose. Most existing methods treat expression transfer as an application of expression manipulation, and use predicted facial expressions, landmarks or action units (AUs) of a source image to guide the expression edit of a target image. However, the prediction of expressions, landmarks and especially AUs may be inaccurate, which limits the accuracy of transferring fine-grained expressions. Instead of using an intermediate estimated guidance, we propose to explicitly transfer expressions by directly mapping two unpaired images to two synthesized images with swapped expressions. Since each AU semantically describes local expression details, we can synthesize new images with preserved identities and swapped expressions by combining AU-free features with swapped AU-related features. To disentangle the images into AU-related features and AU-free features, we propose a novel adversarial training method which can solve the adversarial learning of multi-class classification problems. Moreover, to obtain reliable expression transfer results of the unpaired input, we introduce a swap consistency loss to make the synthesized images and self-reconstructed images indistinguishable. Extensive experiments on RaFD, MMI and CFD datasets show that our approach can generate photo-realistic expression transfer results between unpaired images with different expression appearances including genders, ages, races and poses.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Facial expression transfer aims at transferring the expression from a source image to a target image, such that the transformed target image has the source expression while preserving other attributes like identity, pose and texture. It has recently gained remarkable attentions in the computer vision community 

[Ding, Sricharan, and Chellappa2018, Song et al.2018, Qiao et al.2018, Pumarola et al.2018]. However, in literature preserving fine-grained details for expression transfer has remained a challenging problem, since fine-grained expressions are typically tangled with other facial attributes.

Recently, several facial expression manipulation methods have been proposed and can also be applied to expression transfer. Choi et al. [Choi et al.2018] treated each discrete expression as a domain respectively, while Ding et al. [Ding, Sricharan, and Chellappa2018] modeled the expression intensities to generate a wider range of expressions. An expression category only describes overall facial emotion, which has a limited capacity to capture fine details. Song et al. [Song et al.2018] and Qiao et al. [Qiao et al.2018] exploited facial landmarks to more finely guide the expression synthesis. However, transforming the landmarks from a source image to adapt to a target image with significantly different facial shape is difficult and may cause artifacts in the synthesized image.

Considering each facial action unit (AU) [Ekman and Rosenberg1997] represents local muscle actions and can semantically describe fine-grained expression details, Pumarola et al. [Pumarola et al.2018] took AUs with continuously varied intensities as a guidance to synthesize expressions. Nevertheless, only global AU features are learned, which limits the performance of editing local expressions. These methods all treat expression transfer as an application of expression manipulation, and require predicted expressions, landmarks or AUs of the source image to guide the expression edit of the target image. This offline prediction process is unnecessary and may degrade the performance of expression transfer due to the inaccurate predictions.

To tackle the above limitations, we propose to explicitly transfer fine-grained expressions by directly mapping two unpaired input images to two synthesized images with swapped expressions. The AU intensity estimation process is integrated into our framework by supervising the learning of AU-related features and the inheritance of AU information from the source image. Figure 1 shows the overview of our framework. In particular, the input images are first disentangled into two semantic representations (AU-related and AU-free features) by a novel adversarial training method. Compared to the conventional adversarial loss [Goodfellow et al.2014] for binary classification problems only, our proposed adversarial training method can solve the adversarial learning of multi-class classification problems.

To capture fine expression details in each local region, we adopt an independent branch to extract a related local feature for each AU respectively and then combine these features as the AU-related feature. After the feature disentanglement, the AU-related features of the two images are swapped and combined with the AU-free features to generate two new images with swapped expressions. To enable the reliability of expression transfer for unpaired input, we introduce a swap consistency loss to make the generated images and self-reconstructed images indistinguishable. Another disentanglement-swap-generation process is further applied to the generated images to complete the crossed cycle. At test time, taking two unpaired images as input, our method automatically outputs two synthesized images with swapped expressions.

We refer to our framework, explicit expression transfer, as EET. The main contributions of this paper are summarized as follows. First, we propose a novel explicit facial expression transfer framework to transfer fine-grained expressions between two unpaired images. Second, we propose an adversarial training method to disentangle the AU-related feature and the AU-free feature, which can be applied to the adversarial learning of multi-class classification problems. Third, we introduce a swap consistency loss to ensure the reliability of expression transfer for unpaired input. Finally, extensive experiments on benchmark datasets demonstrate that our approach can generate photo-realistic expression transfer results between unpaired images with various expression appearance differences.

Related Work

We review previous techniques that are closely related to our work, in terms of facial expression manipulation and feature disentanglement.

Facial Expression Manipulation. There are many facial expression manipulation methods resorting to computer graphics techniques including 2D or 3D image warping [Garrido et al.2014], flow mapping [Yang et al.2011] and image rendering [Yang et al.2012]. Although these types of approaches can often generate realistic images with high resolution, the elaborated yet complex processes cause expensive computations. Recently, some works exploited the prevailing generative adversarial networks (GANs) [Goodfellow et al.2014] to edit facial attributes including expressions.

Choi et al. [Choi et al.2018]

proposed a StarGAN method that can perform image-to-image translation for multiple domains using only a single model. It shows a superiority in facial attribute transfer and expression synthesis, but only eight emotional expressions were synthesized. Another work 

[Ding, Sricharan, and Chellappa2018] designed an Expression Generative Adversarial Network (ExprGAN) for expression edit with controllable expression intensities. To control finer details, Pumarola et al. [Pumarola et al.2018] utilized AUs as the guidance to synthesize expressions in a continuous domain. This approach allows controlling the intensity of each AU and combining several of them to synthesize an expression. However, only global AU features are learned for expression synthesis, which limits the performance of editing local details. Considering the geometry characteristics of expressions, Song et al. [Song et al.2018] and Qiao et al. [Qiao et al.2018] proposed Geometry-Guided GANs to generate expressions with the geometry formed by facial landmarks. Nevertheless, transforming source-image landmarks to match a target image with significantly different facial shape is difficult and usually causes artifacts in the generated image. These methods all require the predictions of expressions, AUs or landmarks and cannot estimate them automatically.

Feature Disentanglement. Similar to previous works [Tran, Yin, and Liu2017, Lee et al.2018, Shu et al.2018], our method also uses feature disentanglement to factorize an image into different representations by GANs. Each disentangled representation is distinct and can be specialized for a certain task. Tran et al. [Tran, Yin, and Liu2017]

proposed a disentangled representation learning GAN for pose-invariant face recognition. Lee et al. 

[Lee et al.2018] utilized the disentangled features to produce diverse and realistic images, and also employed a cyclic structure [Zhu et al.2017] to deal with unpaired training data. Shu et al. [Shu et al.2018] proposed a generative model which disentangles shape from appearance in an unsupervised manner. In this method, shape is represented as a deformation, and appearance is modeled in a deformation-invariant way. These methods enforce the disentangled feature to be close to a prior distribution, or exploit an opposite feature containing specific information to implicitly encourage the disentangled feature to discard the information. As a result, they have limited applicability. In contrast, our approach can disentangle representations for any multi-class classification problems by a novel adversarial training method.

Explicit Facial Expression Transfer

Overview

Given two unpaired input images , our main goal is to generate two new images with swapped facial expressions while preserving other original attributes like identity, pose and texture. AU intensity labels and identity labels of the two input images are provided, and pose labels are also available if the training set contains images with different poses. Taking as an example, denotes the intensities of AUs, where , . represents the maximum intensity level. and , where and are the numbers of identity classes and pose classes, respectively.

Figure 1 illustrates the overall architecture of our EET framework. During training, for two unpaired input images , we first utilize a feature encoder to extract their facial features which contain rich information such as expression, identity, pose and texture. As shown in Figure 1(b), consists of a feature disentanglement process with an AU-related encoder and an AU-free encoder , and an image generation process with a generator . Specifically, and disentangle into AU-related features and AU-free features , respectively. further combines the AU-free features with the swapped AU-related features to generate two new images . Then, another disentanglement-swap-generation process is applied to generate the cross-cyclically reconstructed images . The key to our proposed explicit expression transfer lies in combining AU-free features with swapped AU-related features to generate photo-realistic images.

Figure 2: The procedure of our proposed adversarial training. and with dotted lines denote that their parameters are fixed in the second stage.

Expression Transfer

Explicit expression transfer requires the disentanglement of two semantic representations: AU-related feature and AU-free feature. To remove AU information for the AU-free feature, an alternative solution is to use the typical GAN with a two-player minimax game [Goodfellow et al.2014] to adversarially train the AU-free encoder and an AU discriminator so that cannot discriminate AU attribute from the output of . Since AU intensity estimation is a multi-label regression problem, we regard it as a multi-label multi-class classification problem by discretizing the AU intensity labels:

(1)

where denotes the operation of rounding a number to the nearest integer. However, the two-player minimax game is designed for binary classification problems, and cannot work for multi-class classification problems.

To solve this issue, we propose an adversarial training method with two stages, as illustrated in Figure 2. In the first training stage, our goal is to jointly train , , and so that and

have certain abilities of classifying AU intensities and identities, respectively. Given

, we first obtain and , which are further input to and , respectively. outputs an

-dimensional vector with an AU intensity discrimination loss

:

(2)

where denotes the -th value of the -th AU output by , and denotes the indicator function. is encouraged to output for the ground-truth intensity index of each AU while outputting for the remaining indexes, as visualized in Figure 2(a).

In the second training stage, we train and with the parameters of and fixed. To enforce to discard AU information, we define an AU intensity confusion loss as

(3)

where is trained to make

output the average probability

for each intensity index. The least-squares loss employed in Eqs. (2) and (3) is beneficial for stable adversarial training [Mao et al.2017]. Since cannot be classified with AU intensities by , it is AU-free. Meanwhile, we want to be free of identity information so that the identity will not be transferred from to . Similarly, we can define an identity discrimination loss and an identity confusion loss which are used in the first and second training stages, respectively. The adversarial training between and encourages to be identity-free.

Moreover, to ensure contains AU-related information, an AU intensity estimation loss is applied to . To capture fine-grained expression details in each local region, uses an independent branch to extract a related local feature for each AU separately. The top of each branch in is a convolutional layer followed by a global average pooling [Lin, Chen, and Yan2014] layer and a one-dimensional fully-connected layer, in which the convolutional layer outputs

and a sigmoid function is applied to the fully-connected layer to predict the normalized intensity

. Following the loss weighting strategy [Shao et al.2018] to suppress the data imbalance issue, we introduce a weighted L2 loss for :

(4)

where is the weight parameter. is the occurrence rate of the -th AU in the training set, in which the AU intensities greater than are treated as occurrence and non-occurrence otherwise. By integrating local features of all the AUs, we can obtain a fine-grained semantic representation for :

(5)

where denotes element-wise sum.

As shown in Figure 1, two new images and with swapped expressions are generated through : and , in which the channels of and are concatenated as input to G. Since there are no ground-truth expression transfer results for two unpaired images, we introduce a swap consistency loss to ensure the reliability of generated images:

(6)

where encourages and the self-reconstructed image to be indistinguishable with a swap consistency discriminator . To constrain to have the transferred expression from while preserving other attributes, we apply in Eq. (4) to with label . Besides, the facial feature of is input to an attribute constraint module which outputs an -dimensional vector and a -dimensional vector. We formulate the attribute constraint loss as

(7)

where is a hyper-parameter for the trade-off between the first term of identity classification and the second term of pose classification, and denote the -th identity value and the -th pose value respectively, and denotes a softmax function.

Full Objective

To facilitate the disentanglement-swap-generation process, a reconstruction loss [Lee et al.2018] is employed:

(8)

where the first and second terms constrain the self-reconstruction and cross-cycle reconstruction, respectively. To make the synthesized images look real and indistinguishable from the original images, we impose an image adversarial loss with an image discriminator :

(9)

For stable adversarial training, we use the least-squares loss [Mao et al.2017] to train and .

In our EET framework, the losses introduced above are applied for both two input images so that expressions can be mutually transferred. During the first training stage, we jointly train , , , , and . The full objective function is formulated as

(10)

where is applied for input images to train the attribute constraint module . During the second training stage, we train the overall framework shown in Figure 1 by fixing the parameters of , , and . The full objective function is formulated as

(11)

where is imposed to and for supervising the learning of and the inheritance of expression from respectively, and the hyper-parameters weigh the importance of each loss term. At inference time, the two input images are simply disentangled and swapped to synthesize new images with swapped expressions.

Figure 3: Expression transfer results of our method EET and its variants for two pairs of RaFD images, in which the expression pairs are and respectively. and are generated images with swapped expressions.

Experiments

Datasets and Settings

Datasets. We evaluate our framework on three facial expression benchmark datasets: RaFD [Langner et al.2010], MMI [Pantic et al.2005, Valstar and Pantic2010] and CFD [Ma, Correll, and Wittenbrink2015]. RaFD and CFD do not contain AU labels, and MMI only annotates a few AUs for a small set of images. To enable the training and testing on these datasets, we employ a powerful AU recognition library OpenFace [Baltrušaitis, Mahmoud, and Robinson2015] to annotate continuous intensities of AUs (, , , , , , , , , , and ) for each image of these datasets.

  • RaFD consists of subjects with facial expressions, gaze directions and camera angles, which exhibits diverse expression variations. We select the samples recorded by the camera angles of , and to evaluate expression transfer between images with pose differences, with a total number of images. These images are randomly partitioned into a training set with images of subjects and a test set with images of subjects.

  • MMI contains videos and images of subjects. We remove the videos under undesirable situations like severe side views and very low quality, and obtain images of subjects. These images are randomly partitioned into a training set with images of subjects and a test set with images of subjects.

  • CFD includes images of subjects, in which subjects are represented with neutral, happy, angry and fearful expressions, and the remaining subjects only have a neutral expression. We randomly divide these images into a training set with images of subjects and a test set with images of subjects.

Implementation Details.

We utilize PyTorch to implement our EET framework. Our framework comprises of

, , , , , , , and . contains up-sampling residual blocks [Zheng, Cham, and Cai2019], and and are based on the structure of PatchGAN [Isola et al.2017]. Other modules are mainly composed of stacked convolutional layers proposed by VGGNet [Simonyan and Zisserman2015]. To obtain stable adversarial training, we conduct Spectral Normalization [Miyato et al.2018] on each convolutional layer and deconvolutional layer in , , , , , and .

In our experiments, each image is cropped to the size of and further randomly mirrored for data augmentation. The number of AUs and the maximum intensity level are and , respectively. The hyper-parameters of different loss terms are empirically set as: , , , , , and . If the training set involves only one pose, is set to . We employ the Adam solver [Kingma and Ba2015], and set , and an initial learning rate of during the first training stage, as well as set , and an initial learning rate of

during the second training stage. For each stage, the learning rate is kept unchanged during the first half of training epochs and linearly decayed at each epoch during the remaining half of training epochs. We use all the RaFD, MMI and CFD training images in the first stage, and then use each dataset to train individual models. We utilize

, and epochs for RaFD, MMI and CFD respectively, in terms of both the first and second stages.

Evaluation Metrics. To quantitatively evaluate our method for expression transfer, we use Pearson’s correlation coefficient (PCC) and mean square error (MSE) to measure the correlation and difference between AU intensities of source images and generated target images, respectively. The average results of PCC and MSE over all AUs (Avg) are also shown. Besides, we evaluate identity preservation by conducting face verification between real images and generated images, with accuracy and true accept rate at false accept rate (TAR@FAR=) reported.

AU 1 2 4 5 6 9 12 15 17 20 25 26 Avg
PCC StarGAN 0.27 0.23 0.21 0.38 0.15 0.35 0.40 0.09 0.12 0.09 0.23 0.17 0.22
HomoInterpGAN 0.33 0.35 0.11 0.38 0.16 0.29 0.40 0.05 0.13 0.07 0.30 0.19 0.23
EET 0.37 0.40 0.25 0.47 0.28 0.39 0.37 0.23 0.31 0.11 0.37 0.18 0.31
MSE StarGAN 1.23 0.75 1.35 1.20 0.81 0.71 1.04 0.40 0.42 0.68 1.68 0.66 0.91
HomoInterpGAN 0.92 0.62 1.43 1.24 0.74 0.75 1.02 0.46 0.40 0.75 1.66 0.64 0.89
EET 1.07 0.61 1.35 1.13 0.75 0.69 1.15 0.52 0.34 0.70 1.48 0.67 0.87
Table 1: Quantitative results of expression transfer for StarGAN, HomoInterpGAN and our EET, respectively. We compute the PCC (higher is better) and MSE (lower is better) results for AUs between source images and generated target images.

Ablation Study

In this section, we evaluate the main loss terms in our framework. We design a baseline method B-Net using the same architecture and two training stages as EET but with only the losses , and . Other variants of EET are built based on B-Net by further adding , and all of , , and , denoted as BC-Net, and BCA-Net, respectively. We also implement a variant named EET-Single by using a single branch for all the AUs in the structure of , in which the single branch outputs followed by a global average pooling layer and an -dimensional fully-connected layer. Their results for two example pairs of RaFD images are illustrated in Figure 3.

We can see that B-Net transfers the entire face from to rather than only the expression, and BC-Net still fails to preserve the identity during expression transfer after applying the attribute constraint loss . Although B-Net and BC-Net employ the commonly used to obtain good self-reconstructed results and , there are many artifacts in the expression transfer results. When using our proposed adversarial training method to disentangle the AU-related feature and the AU-free feature, BCA-Net significantly improves the performance of expression transfer, but expressions are not completely swapped and a few artifacts appear around the mouth regions, especially for images with large pose differences. By employing the swap consistency loss to make expression transfer and self-reconstruction indistinguishable, our EET generates photo-realistic expression transfer results while preserving original identities for unpaired images with different genders and poses.

When using a single branch for , we can observe that the generated by EET-Single of the first pair still looks a little happy rather than angry, and the closed lips looks unnatural. In addition, EET generates more realistic mouth regions in of the second pair than EET-Single. This demonstrates that a single branch for extracting global AU features has a limited capacity to capture fine-grained expression details. We also show the expression transfer results of our EET on example images of MMI and CFD in Figure 4. It can be seen that our method is capable of transferring fine details such as glances and local muscle actions. Although MMI images are a little blurry and may be partially occluded by eyeglasses, our approach can successfully transfer expressions and synthesize high-quality images. By observing the results on CFD, we can see that our EET is robust to unpaired images with different ages and races.

Figure 4: Expression transfer results using our method on MMI and CFD. Some MMI images are a little blurry and partially occluded, and the CFD images have different races.
Figure 5: Comparison of expression transfer results using StarGAN, HomoInterpGAN and our EET for six pairs of RaFD images. (a) The expression pairs are , and , respectively. (b) The expression pairs are , and , respectively.

Comparison with State-of-the-Art Methods

To validate our framework, we compare it against two state-of-the-art facial expression manipulation methods, StarGAN [Choi et al.2018] and HomoInterpGAN [Chen et al.2019]. We implement these two methods by running their released code on RaFD dataset. Since the combination of AUs with certain intensities describes an expression, multiple AUs with continuously varied intensities can obtain an infinite number of expressions. Thus, StarGAN and HomoInterpGAN which are designed for discrete expressions cannot process continuously varied AUs. In our experiments, StarGAN and HomoInterpGAN are trained using the expression labels, in which HomoInterpGAN can directly transfer the expression from a source image to a target image and StarGAN generates a new target image by giving the expression label of a source image. Note that a few other works with code released such as ExprGAN [Ding, Sricharan, and Chellappa2018] and GANimation [Pumarola et al.2018] are not compared, because their code hardly generates valid results when training on RaFD.

Qualitative Results. Figure 5 presents the expression transfer results of these methods between unpaired images with both identical poses and different poses on RaFD. It can be observed that the results of StarGAN for some pairs are a little blurry with shadows, especially for the lips. The expressions are partially transferred such as of the first pair in Figure 5(a) showing mixed expressions of fear and disgust. Although HomoInterpGAN generates images with higher quality than StarGAN, the identities of target images are significantly changed. For example, it seems that the gender of of the third pair in Figure 5(a) is changed from male to female. By contrast, our method EET is able to transfer expressions to generate realistic images while preserving original identity information. Moreover, EET has a stronger ability of transferring fine expression details than StarGAN and HomoInterpGAN such as lifting one side of the lip corners, as illustrated in of the first pair in Figure 5(b). For unpaired images with large pose differences, our approach can generate high-quality images by disentangling fine-grained expressions from other attributes including poses.

Quantitative Results. Besides the above visual comparisons, we also quantitatively evaluate our method in terms of both expression transfer and identity preservation on RaFD. For each image in the test images, we randomly select images from other subjects, resulting in pairs. StarGAN, HomoInterpGAN and our EET are employed to generate two new images with swapped expressions for each input pair, respectively. To perform a reliable and fair evaluation about expression transfer, we use OpenFace to annotate continuous intensities of AUs for the generated images. Table 1 shows the PCC and MSE between the AU intensities of the original source images and generated target images. It can be observed that our EET outperforms other methods especially for PCC results, which demonstrates the effectiveness of our method for transferring fine-grained AU details.

To evaluate the preservation of identity, we randomly select pairs of RaFD test images from the same subjects and pairs from different subjects. We replace one real image of each pair using a synthesized image with a changed expression for all the methods, in which the synthesized image should preserve identity information of the real image. We utilize the state-of-the-art face recognition model in a public library dlib [King2009] to perform face verification by determining whether two images belonging to the same subject. Table 2 presents the accuracy and TAR@FAR= of face verification for the pairs of images. “Real” denotes each pair includes two real images, which shows upper-bound face verification results. We can see that our method significantly outperforms StarGAN and HomoInterpGAN, especially in terms of TAR@FAR=. Our method achieves a comparative accuracy () to Real, which indicates that our method can preserve identity well during the process of expression transfer.

Method Accuracy TAR@FAR=
StarGAN 95.63 87.16
HomoInterpGAN 70.82 17.02
EET 96.63 90.90
Real 98.23 97.86
Table 2: Quantitative results () of identity preservation.

Conclusion

In this paper, we have proposed an EET framework to explicitly transfer fine-grained expressions by straightforwardly mapping the unpaired input to two synthesized images with swapped expressions. We have also proposed an adversarial training method to disentangle AU-related features and AU-free features, which can solve the adversarial learning of multi-class classification problems. Moreover, we have introduced a swap consistency loss to ensure the reliability of expression transfer. Extensive experiments show that our approach can generate photo-realistic expression transfer results while preserving identity information, even in the presence of challenging expression appearance differences.

References

  • [Baltrušaitis, Mahmoud, and Robinson2015] Baltrušaitis, T.; Mahmoud, M.; and Robinson, P. 2015. Cross-dataset learning and person-specific normalisation for automatic action unit detection. In IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, volume 6, 1–6. IEEE.
  • [Chen et al.2019] Chen, Y.-C.; Xu, X.; Tian, Z.; and Jia, J. 2019.

    Homomorphic latent space interpolation for unpaired image-to-image translation.

    In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2408–2416.
    IEEE.
  • [Choi et al.2018] Choi, Y.; Choi, M.; Kim, M.; Ha, J.-W.; Kim, S.; and Choo, J. 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In IEEE Conference on Computer Vision and Pattern Recognition, 8789–8797. IEEE.
  • [Ding, Sricharan, and Chellappa2018] Ding, H.; Sricharan, K.; and Chellappa, R. 2018. Exprgan: Facial expression editing with controllable expression intensity. In

    AAAI Conference on Artificial Intelligence

    , 6781–6788.
  • [Ekman and Rosenberg1997] Ekman, P., and Rosenberg, E. L. 1997. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, USA.
  • [Garrido et al.2014] Garrido, P.; Valgaerts, L.; Rehmsen, O.; Thormahlen, T.; Perez, P.; and Theobalt, C. 2014. Automatic face reenactment. In IEEE Conference on Computer Vision and Pattern Recognition, 4217–4224. IEEE.
  • [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2672–2680.
  • [Isola et al.2017] Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2017.

    Image-to-image translation with conditional adversarial networks.

    In IEEE Conference on Computer Vision and Pattern Recognition, 1125–1134. IEEE.
  • [King2009] King, D. E. 2009.

    Dlib-ml: A machine learning toolkit.

    Journal of Machine Learning Research 10:1755–1758.
  • [Kingma and Ba2015] Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
  • [Langner et al.2010] Langner, O.; Dotsch, R.; Bijlstra, G.; Wigboldus, D. H.; Hawk, S. T.; and Van Knippenberg, A. 2010. Presentation and validation of the radboud faces database. Cognition and emotion 24(8):1377–1388.
  • [Lee et al.2018] Lee, H.-Y.; Tseng, H.-Y.; Huang, J.-B.; Singh, M.; and Yang, M.-H. 2018. Diverse image-to-image translation via disentangled representations. In European Conference on Computer Vision, 36–52. Springer.
  • [Lin, Chen, and Yan2014] Lin, M.; Chen, Q.; and Yan, S. 2014. Network in network. In International Conference on Learning Representations.
  • [Ma, Correll, and Wittenbrink2015] Ma, D. S.; Correll, J.; and Wittenbrink, B. 2015. The chicago face database: A free stimulus set of faces and norming data. Behavior research methods 47(4):1122–1135.
  • [Mao et al.2017] Mao, X.; Li, Q.; Xie, H.; Lau, R. Y.; Wang, Z.; and Smolley, S. P. 2017. Least squares generative adversarial networks. In IEEE International Conference on Computer Vision, 2813–2821. IEEE.
  • [Miyato et al.2018] Miyato, T.; Kataoka, T.; Koyama, M.; and Yoshida, Y. 2018. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations.
  • [Pantic et al.2005] Pantic, M.; Valstar, M.; Rademaker, R.; and Maat, L. 2005. Web-based database for facial expression analysis. In IEEE International Conference on Multimedia and Expo, 5–pp. IEEE.
  • [Pumarola et al.2018] Pumarola, A.; Agudo, A.; Martinez, A. M.; Sanfeliu, A.; and Moreno-Noguer, F. 2018. Ganimation: Anatomically-aware facial animation from a single image. In European Conference on Computer Vision, 818–833. Springer.
  • [Qiao et al.2018] Qiao, F.; Yao, N.; Jiao, Z.; Li, Z.; Chen, H.; and Wang, H. 2018. Emotional facial expression transfer from a single image via generative adversarial nets. Computer Animation and Virtual Worlds 29(3-4):e1819.
  • [Shao et al.2018] Shao, Z.; Liu, Z.; Cai, J.; and Ma, L. 2018. Deep adaptive attention for joint facial action unit detection and face alignment. In European Conference on Computer Vision, 725–740. Springer.
  • [Shu et al.2018] Shu, Z.; Sahasrabudhe, M.; Alp Guler, R.; Samaras, D.; Paragios, N.; and Kokkinos, I. 2018.

    Deforming autoencoders: Unsupervised disentangling of shape and appearance.

    In European Conference on Computer Vision, 650–665. Springer.
  • [Simonyan and Zisserman2015] Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.
  • [Song et al.2018] Song, L.; Lu, Z.; He, R.; Sun, Z.; and Tan, T. 2018. Geometry guided adversarial facial expression synthesis. In ACM international conference on Multimedia, 627–635. ACM.
  • [Tran, Yin, and Liu2017] Tran, L.; Yin, X.; and Liu, X. 2017. Disentangled representation learning gan for pose-invariant face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 1283–1292. IEEE.
  • [Valstar and Pantic2010] Valstar, M., and Pantic, M. 2010. Induced disgust, happiness and surprise: an addition to the mmi facial expression database. In International Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect,  65. Paris, France.
  • [Yang et al.2011] Yang, F.; Wang, J.; Shechtman, E.; Bourdev, L.; and Metaxas, D. 2011. Expression flow for 3d-aware face component transfer. ACM Transactions on Graphics 30(4):60.
  • [Yang et al.2012] Yang, F.; Bourdev, L.; Shechtman, E.; Wang, J.; and Metaxas, D. 2012. Facial expression editing in video using a temporally-smooth factorization. In IEEE Conference on Computer Vision and Pattern Recognition, 861–868. IEEE.
  • [Zheng, Cham, and Cai2019] Zheng, C.; Cham, T.-J.; and Cai, J. 2019. Pluralistic image completion. In IEEE Conference on Computer Vision and Pattern Recognition, 1438–1447. IEEE.
  • [Zhu et al.2017] Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, 2223–2232. IEEE.