Over the years, facial expression synthesis has been drawing considerable attention in the field of both computer vision and computer graphics. However, synthesizing easy-to-use and fine-grained facial images with desired expression remains challenging because of the complexity of this task. Recently, the proposal of generative adversarial networks[Goodfellow2014Generative, Mirza2014Conditional] sheds light on image synthesis, introducing significant advances with well-known architectures like [Choi_2018_CVPR_stargan, Zhang2018sagenerative, liu2019stgan, he2019attgan]. However, these work suffer from fine-grained expression editing because they either rely on several binary emotion labels (e.g., smiling, mouth open) to synthesize target expressions, or suffer from limited image naturalness and low quality.
As one of the most successful generative models, GANimation [Pumarola_ijcv2019] pushes the limits of facial expression manipulation by building a conditional GAN which relies on attention-based generator and discrete facial action units activation (action units [friesen1978facial](AUs), a kind embedding which indicates the facial muscles movement). As an novel expression editing method, GANimation is able to edit image in a continuous manner and outperforms other popular multi-domain image-to-image translation methods [li2016diat, zhu2017cyclegan, perarnau2016invertible, Choi_2018_CVPR_stargan]. Despite the novelty and generality, GANimation suffers from two drawbacks.
First, by taking absolute AUs as input condition, the generator needs to estimate the current facial muscles state so that it can apply a desired expression change to the input image. Besides, from the user perspective, exploiting the entire set of AUs as condition input imposes a restriction on fine-grained expression editing because a user always needs to acquire accurate underlying real value of each AU in the input image, even though he does not intend to have the generator to modify these facial regions. Second, the attention mechanism which is introduced for learning desirable change from expression of input image to desired expression, virtually applies a learned weighted sum between the input image and the generated one. This kind of operation, as pointed out in [Pumarola_ijcv2019], brings about transparent artifacts around face deformation regions. Furthermore, spatial attention networks for attribute-specific region editing [Zhang2018sagenerative] are effective only for local attributes and not designed for arbitrary attribute editing [liu2019stgan].
To address these limitations, this work investigates arbitrary facial expression editing with relative AUs () and proposes a novel method. In terms of relative
, which is defined as the difference between target AUs and source AUs, our model is capable of (i) only considering the facial components to be modified while keeping the remaining parts unchanged, and (ii) freely strengthening or suppressing the intensity of specified AUs or arbitrary emotions by user-input real numbers. This brings several benefits. First, by using relative AUs, the generator is not required to compare the current AUs with desired AUs before applying image transformation. Second, the values of the relative AUs indicate the desirable change to facial muscles. In particular, non-zero values correspond to AUs of interest and zero values correspond to unchanged AUs. Hence, our generator can learn to manipulate single AU with scalable one-hot vector, eliminating demand for all other AUs intensities.
For the purpose of higher image quality and better expression manipulation ability, we start from U-Net-based generator and analyze its limitations. Note that the features from encoder are directly concatenated with decoder features in U-Net structure. Here, in this work, we resort to learn the model by simultaneously fusing and transforming image features at different spatial resolutions. Particularly, we propose to introduce multi-resolution feature fusion mechanism and involve several multi-scale feature fusion (MSF) modules in basic U-Net architecture for image transformation. Taking relative AUs as conditional input, our MSF module adaptively fuse and modify both the features from encoder and all lower resolution, and output fusion features with multi-resolution representation. The fusion features are further concatenated with decoder features for image decoding. Experiments results in Table 1 and Fig 9 reveal the better attribute manipulation ability and higher image quality brought by MSF mechanism. An overview of our approach is provided in Fig. 2.
The contributions of this work can be summarized as follows:
We introduce relative AUs as a condition to guarantee fine-grained expressions editing in a continuous manner, avoiding the need to know explicit AUs of an input image.
We develop a sub-module called multi-scale feature fusion (MSF) module to learn image transformation at multiple feature resolutions, and embed such modules into a U-Net based generator for improving image quality and expression editing performance.
Comprehensive experimental results on both quantitative and qualitative evaluation demonstrate the efficacy of our proposed approach compared with state-of-the-art methods.
2 Related Work
Generative Adversarial Networks. As one of the most promising unsupervised deep generative models, GANs [Goodfellow2014Generative] have achieved a series of impressive results. DCGAN [radford2015dcgan]
extends GANs’ performance by leveraging deep convolution neural networks. Conditional GAN[Mirza2014Conditional] generates images with desired properties under the constraint of extra conditional variables. WGAN [arjovsky2017wassersteinGAN] stabilizes GAN training and alleviates model collapse problems by introducing Wasserstein distance. WGAN-GP [gulrajani2017wgangp] is suggested to improve WGAN training by enforcing gradient penalty. Up to now, GANs have become one of the most prominent generative models in image synthesis [zhang2017stackgan, Karras_2019_CVPR_stylegan, peng2018variational]wang2018esrgan, chu2018tecoGAN-videosr] and image-to-image translation [he2019attgan, park2019GauGAN].
Image-to-image translation can be treated as a cGAN that conditions on an image, aiming at learning an image mapping from one domain to another in supervised or unsupervised learning settings. Liu et al.[liu2017unsupervised] introduces a shared-latent space assumption and an unsupervised image-to-image translation framework based on Coupled GANs [liu2016coupled]. Pix2Pix [isola2017image-to-image] as well as [park2019GauGAN] is a supervised cGANs based approach which relies on an abundance of paired images. However, the absence of adequate paired data limits the performance of conditional GAN. To alleviate the dependency on paired images, Zhu et al. [zhu2017cyclegan] proposes a cycle consistent framework for unpaired image-to-image translation. GANimation [Pumarola_ijcv2019] utilizes an encoder-decoder network to take images and entire action units as input to generate animated images, but suffers from undesired artifacts in generated images.
Facial Expression Manipulation. Facial expression manipulation is an interesting image-to-image translation problem, which has drawn prevalent attention recently. Some popular works tackle this task with multiple facial attributes editing [Choi_2018_CVPR_stargan, he2019attgan, liu2019stgan, powei_2019relgan], modifying attribute categories such as to smiling, mouth open, mouth closed, adding beard, swapping gender and changing hair color, etc. However, these methods cannot simply generalize to an arbitrary human facial expression synthesizing task due to the limitations of discrete emotion categories(e.g., happy, neutral, surprised, contempt, anger, disgust and sad). Several studies, aiming at manipulating human facial expression from facial geometric representation [song2018geometry-guided, qiao2018geometry-contrastive], conditioning on face fiducial points to synthesize animated faces but suffers from fine-grained details. Geng et al. [Geng_2019_CVPR_3DGuided] proposes a 3D parametric face guided model to manipulate geometry of facial components, while requiring real existent target face images rather than a simple vector. Nonetheless, robust and easy-to-use approach for fine-grained expression manipulation remains a challenge to be solved.
3 Proposed Method
In this section, we present the components of our approach for fine-grained facial expression editing. We consider an input RGB image as with arbitrary facial expression. The expression is characterized by a one-dimensional source AUs vector , where each AU is a meaningful normalized value between 0 and 1, also indicates the intensity of the -th action unit. With the goal of translating an input image into a photo-realistic image, our generator takes relative AUs
as condition to renders an image with target expression. In the following parts, relative action units, MSF module, network structure and loss functions are presented.
3.1 Relative Action Units
Previous methods [Pumarola_ijcv2019] take both absolute target action units vector and source image as input to the generator. However, this input setting is flawed in that the generator needs to estimate the real AUs of input image to determine whether to edit image. From an application perspective, if we do not want change its AU, we still need to provide a value which must be strictly equal to the corresponding AU in source image (i.e., , where
). Otherwise, the generator will probably introduce unintended modifications to editing results.
Compared with absolute AUs, relative AUs describe the desired change on selected action units. This is in accordance with the definition of action units [friesen1978facial] that indicates the activation state of facial muscles. Denote the source AUs and target AUs as and . Therefore, the difference between target and source action units can be defined as:
Introducing relative action unit as input brings several benefits. First, the relative AUs represented by the difference between source and target images is intuitive and user friendly. For example, if we only intend to suppress AU10 (Upper Lip Raiser), we could assign an arbitrary real negative value to , while making the other values zero. Second, in comparison to entire target action units, the values in are zero-centered and can provide more expressive information for guiding expression editing and stabilize the training process. Moreover, with relative AUs, the generator learns to edit and reconstruct facial parts with respect to non-zero and zero values, which alleviates the cost for action units preserving. In our experiments, with zero values hardly introduces artifacts and errors.
Additionally, we propose to edit interpolated expressionsamong two different expressions and . The interpolated AUs is denoted as Equation 2.
3.2 Multi-Scale Feature Fusion
Encoder-decoder architecture is insufficient to manipulate image with high quality but U-Net based architectures support the rise of generating quality, according to [liu2019stgan]. Taking these basics into consideration, we propose to modify the image features in different spatial resolution, simultaneously. For this purpose, we alter the structure in [sun2019deep_high_low] and then build a learnable sub-network, namely our multi-scale feature fusion (MSF) module, to manipulate features in multi-scale level. In Fig. 3(a), we show the overall architecture of multi-scale feature fusion module.
In our approach, the MSF module takes the features across the encoder and the MSF modules as well as relative AUs as input and learns to manipulate image features at different spatial sizes. Without the loss of generality, we take the MSF module in -th layer for example. Denote the input encoder features as from -th layer of encoder, and fusion feature as from the -th MSF module. Firstly, the encoder features are concatenated with relative AU in depth-wise fashion. Then a convolutional unit and a down-sample layer are applied to acquire two feature maps in different spatial size. The down-sampled features are then concatenated with higher-level features from -th MSF module. One more parallel feature fusion unit is applied across high and low resolution representation, and then formulated into the output . The fusion feature will be the input of decoder and -th MSF module. In this way, our generator learns and transforms the image features collaboratively in a multi-scale manner.
3.3 Network Structure
As presented in Fig. 3(b), our generator is built on the image-to-image translation architecture proposed by Pumarola et al. [johnson2016perceptual]
. In our generator, several skip connections which are modulated by our MSF modules in both high and low resolution representation are added. The encoder consists of four convolutional layers with stride 2 for down-sampling, while the decoder is composed of four transposed convolutional layers with stride 2 for up-sampling. Furthermore, MSF module is applied as skip unit to fuse features from both higher and lower resolution in our generator. The kernel sizes are allin down-sampling and up-sampling layers, while in the rest convolutional layers.
The discriminator is trained to evaluate the generated images both in realism score and desired expression fulfillment. The two branches of discriminator, namely and , share a fully convolutional sub-network comprised of six convolutional layers with kernel size 4 and stride 2. On top of
, we add a convolutional layer with kernel size 3, padding 1 and stride 1, thus resulting an output of 2x2 spatial resolution. For conditional critic, we add an auxiliary regression head to predict target AUs.
3.4 Loss Functions
Denote the conditional generated image as , where input image and relative attributes are considered as inputs of the generator. In the following, we will introduce the loss functions employed in our framework.
Adversarial Loss. To synthesize photo-realistic images with GANs, we use the improved divergence criterion of standard GAN [Goodfellow2014Generative] proposed by WGAN-GP [gulrajani2017wgangp]. The adversarial loss can be written as:
where is a penalty coefficient and is randomly interpolated between and generated image . The discriminator is unsupervised and aims to distinguish between real images and the generated fake images. The generator tries to generate images which look realistic as the real.
Conditional Fulfillment. We require not only that the image synthesized by our model should look realistic, but also possess desired AUs. To this end, we adopt the core idea of conditional GANs [Mirza2014Conditional] and employ an action units regressor which shares convolutional weights with , and define the following manipulation loss for training and :
where the AUs regression loss of real images is used to optimize , thus can learn to generate images which minimize the AUs regression loss .
Reconstruction Regularization. Our generator is trained to generate an output image which not only looks realistic but also possesses desired facial action units. However, there is no ground-truth supervision provided in the dataset for our model to modify facial components while preserving identity information. To this end, we add extra constraints to guarantee the faces in both input and output images are from same person in appearance.
On one hand, we utilize a self-reconstruction loss to enforce the generator to manipulate nothing when fed with zero-value relative AUs (i.e., ). On the other hand, we adopt the concept of cycle consistency [zhu2017cyclegan] and formulate the cycle-reconstruction loss which penalizes the difference between and the input source . Hence, thses two reconstruction losses can be written as:
where denotes a zero-padded vector with the same shape of .
Total Variation Regularization. To ensure smooth spatial transformation and naturalness of output images in RGB color space, we follow prior work [johnson2016perceptual, Pumarola_ijcv2019] and perform a regularization over the synthesized fake samples .
Model Objective. Taking the above losses into account, we finally build our total loss functions for and by combining all previous partial losses, respectively, as:
where , , and are tradeoff parameters that control the impact of each loss.
4.1 Implementation Details
Dataset and Preprocessing. We randomly choose a subset of 200,000 samples from AffectNet [mollahosseini2017affectnet] dataset. Besides, we remove some repeated images or cartoon faces in the validation set and take 3234 images as our testing samples to assess the training process. The images are centered cropped and resized to by bicubic interpolation. All continuous AUs annotations are extracted by [baltrusaitis2018openface].
Baseline. As the current state-of-the-art method, GANimation [Pumarola_ijcv2019], outperforming plenty of representative facial expression synthesis models [Choi_2018_CVPR_stargan, zhu2017cyclegan, li2016diat, perarnau2016invertible], is taken as our baseline model. For fair comparison, we use the code111https://github.com/albertpumarola/GANimation released by the authors and train their model on the same dataset with default hyper-parameters.
Experiment Settings. We train the model by Adam [kingma2014adam] optimizer with settings of ,
for 30 epochs at initial learning rate of, and then linearly decay the rate to for fine-tuning. We perform every single optimization step of the generator with four optimization steps of the discriminator. The weight coefficients for Eqn.8 and 9 are set to
. All experiments are conducted in PyTorch[paszke2017automatic] environment with a PC equipped with Intel(R) Xeon(R) E5-1660 v3 CPU 3.00GHz and 2 NVIDIA TITAN XP GPUs.
4.2 Evaluation Metrics
Evaluating a GAN model with respect to one criterion does not reliably reveal good performance. In this work, we conduct model evaluation from two perspectives, which are network-based and human-based evaluation. Both methods measure the performance in three aspects, namely expression fulfillment, relative realism and identity preserving ability.
For network-based evaluation metrics, we evaluate 3234 images from AffectNet testing-set, each of which is transformed to 7 randomly selected expressions. We get 22638 generated images and then perform quantitative evaluation.
Inception Score (IS). Following the metrics employed in our baseline model [Pumarola_ijcv2019], we calculate the Inception Score(IS) of image synthesized by our approach and baseline. IS [salimans2016IS] utilizes an Inception network to extract image representation and calculates the KL divergence between the conditional distribution and marginal distribution. Although previous work [barratt2018note] has revealed the limitations of IS in intra-class images, it is still widely used to evaluate the model performance in image quality [Pumarola_ijcv2019, brock2018biggan].
Average Content Distance (ACD). ACD [Tulyakov2018MoCoGAN] measures
-distance between embedded facial features of the input and generated images. We employ a famous facial recognition network222https://github.com/ageitgey/face_recognition, as GANimation did in [Pumarola_ijcv2019]
, to extract face code for each individual and calculate content distance for each expression editing result. The lower average content distance (ACD) indicates the better identity similarity between images before and after editing.
Expression Distance (ED). To consistently evaluate the ability of our model in expression editing. We reuse OpenFace2.0 [baltrusaitis2018openface] to acquire the AUs of edited images, and calculate -distance between the generated and target AUs (the lower, the better). Performing such objective evaluation is not trival, as a categorized expression often related with two different AU intensity [friesen1978facial].
Human-based Metrics. For each metric in human-based evaluation, twenty users who engage in our test are asked to evaluate 100 pairs of images which are generated by baseline and our method. During the test, we randomly display the images and ensure that the users do not know which image is edited by our model.
Relative realism. In each comparison, we randomly select two images which are generated by GANimation and our model, respectively. The user is asked to pick the more realistic image they think.
Identity preserving. One more user study for identity similarity metric is conducted to verify if humans agree that the given two images are from the same person. The display order of synthesized images from GANimation or our model is random.
. Due to the complex distribution of human facial expressions, it is not very reasonable to classify expression into a specific category. To this end, we alleviate this limitations by asking the users to rate the similarity of two facial expressions instead of reporting the emotion labels. In every trail of human preference study, we randomly select two images (one with target expression, and the other one is edited by GANimation or our method). The users have to examine and score similarity of facial expressions in the two images. If the given two images is considered to be different in their opinion, the user is allowed to give 0 point. When the user think the image is totally same expression, 2 points will be given. If the user is not sure about the similarity of two expressions or these two expressions are partly same (e.g., same action units for mouth but different for eyebrows), 1 point will be given.
4.3 Qualitative Evaluation
We first qualitatively compare our model with GANimation in edition of single or multiple AUs. Fig. 4 shows two typical examples of AU2 (Outer Brow Raiser) and AU15 (Lip Corner Depressor). From sample results of (a) in Fig. 4, it can be observed that GANimation fails to focus on Outer Brow and wrinkles the mouth, yielding less satisfying results than ours. In sample results of sub-figure (b), our model produces more plausible and better manipulated results, especially in regions around lip corner.
Fig. 5 shows more results in single/multiple AUs editing. By adopting relative action units as conditional input, our model convincingly learns to edit a single or multiple AUs instead of entire action units of input image.
We proceed to compare our model against GANimation. From the observation in Fig. 6, we can find that our model successfully transform source image in accordance with desired AUs, with fewer artifacts and manipulation cues. While the baseline model is less likely to generate high-quality details or preserve the facial regions corresponding to unchanged AUs, especially for eyes and mouth.
We next evaluate our network and discuss the model performance when dealing with extreme situations, which includes but not limited to image occlusions, portraits, drawings and non-human faces. In Fig. 7, for instance, the first image shows occlusions created by a finger. To edit the expression for this kind image, GANimation requires the entire set of AUs, including the activation status of Lip Corner and Chin, which imposes an extra burden on the user and brings an undesirable increase of visual artifacts. On the contrary, our method is able to edit expression without the need of source AUs. In the third and fourth row of Fig. 7, we present face editing examples from paintings and drawings, respectively. GANimation is either fails to efficiently manipulate input image with fully same expression (third row, left and fourth row, right) or introduces unnatural artifacts and deformation (third row, right and fourth row, left). We can easily find the improvements of our method when compared with GANimation, although GANimation achieves plausible results on these images.
To better understand the benefits of continuous editing, we exploit AUs interpolation between different expressions and present results in Fig. 8. The plausible results verify the continuity in the action units space and demonstrate the generalization performance of our model.
4.4 Quantitative Evaluation
Here we will conduct quantitative evaluations to verify the qualitative comparisons above. As described in Sec. 4.2, we resort to three alternative measures for quantitative evaluation of our method. First, we calculate metrics of IS, ACD and ED for both GANimation and proposed approach. The comparison results are given in Table 1. It can be observed that our approach consistently achieves competitive results against GANimation for IS and ED. Ours generator without MSF module attains the lowest score on ACD but the highest score in ED. This is reasonable because the accuracy of a facial recognition network inevitably suffers from expression variation.
Table 4.4, as a supplement to metric ED, offers a human-based evaluation on expression editing ability. Benefiting from MSF modules which serve as skip connections from encoder to decoder, our approach outperforms GANimation by a large margin. Nearly a quarter of test samples transformed by GANimation are considered failures. The proposed model is slightly favorable to the baseline in terms of identity preservation and our model performs better in image realism score, according to human preference results in Table 4.4.
4.5 Ablation Study
In this section, we exploit the importance of each component within proposed method. To begin with, we investigate the improvement brought by relative AUs. We compare our model with baseline model in action units preserving from reconstruction perspective. To perform facial image reconstruction, we respectively apply GANimation by taking source AUs as absolute condition, and apply our model by taking a zero-valued vector as relative condition. We present results of L1 norm, PSNR, and SSIM [wang2004image] between input and generated images in Table 4. From second and third row, it can be seen that GANimation trained with relative AUs is slightly better than our approach without using relative AUs. When trained with our full approach (fourth row), we achieve the best reconstruction results.
We next examine the importance of MSF module based on IS/ACD/ED metrics. Note that our model is built on U-Net, we carefully replace the skip connection with our MSF module and gradually train these generators separately. Quantitative comparison results are shown in Table 1. The first case is ours model without MSF module (fourth row), which reduces to U-Net architecture. U-Net-based modle acquires the best ACD result and the worst expression distance, which implies inefficient performance in expression editing. A conclusion can be drawn from the comparison results that a model has greater potential to attain lower ACD if the ED gets higher. One proper explanation is that the expression editing intensity inevitably change the face features for facial recognition network. Fig. 9 shows the loss optimization process in our experiments. As can be found, the trend of loss curves are almost the same during the period of training discriminator (left figure). From the right figure, we can find that the generator that has two MSF modules converges faster than those with less MSF modules, which implies the definite improvements are brought by our MSF mechanism.
In this study, we propose a novel approach by incorporating multi-scale fusion mechanism in U-Net based architecture for arbitrary facial expression editing. As a simple but competitive method, relative condition setting is proved to improve model performance by a large margin, especially for action units preserving, reconstruction quality and identity preserving. We achieve the better experimental results in visual quality, manipulation ability and human preference compared with the state-of-the-art methods.