With the rapid development of generative models in computer vision, especially generative adversarial networks (GANs), there has been an increasing focus on challenging tasks, such as realistic photograph generation, image-to-image translation, text-to-image translation, and super resolution. Face reenactment is one of these challenging tasks that requires 3D modeling of the geometry and movement of faces. It has many applications in image editing/enhancement and interactive systems (e.g., animating an on-screen agent with natural human poses/expressions)
. Producing photo-realistic face reenactment requires a large amount of images from the same identity for appearance modeling. In this paper, we focus on a more challenging task, one-shot face reenactment, that only requires one image of a given identity to produce a photo-realistic face. More specifically, we propose a deep learning model that takes one image from a random source identity alongside target expression landmarks. The model then outputs a face image that has the same appearance information as the source identity, but with the target expression. This requires the model to transform a source face shape (e.g., facial expression and pose) to a target face shape, while also simultaneously preserving the appearance and the identity of the source face, and even the background.
Figure 1 shows the reenacted faces produced by the proposed method. Given an input face image of a source identity, the proposed one-shot face reenactment model, FaR-GAN, is able to transform the expression from the input image to any target expression. The reenacted faces have the same expression captured by the target landmarks, while also retaining the same identity, background, and even clothes as the input image. Therefore, the proposed one-shot face reenactment model requires no assumption about the source identity, facial expression, head pose, and image background.
The main contributions of this paper are summarized as follows:
We develop a GAN-based method that addresses the task of one-shot face reenactment.
The proposed FaR-GAN is able to compose appearance and expression information for effective face modeling.
The reenacted images produced by the proposed method achieve higher image quality than the compared methods.
2 Related Work
Face Reenactment by 3D Modeling. Modeling faces in 3D helps in accurately capturing their geometry and movement, which in turn improves the photorealism of any reenacted faces. Thies et al.  propose a real-time face reenactment approach based on the 3D morphable face model (3DMM)  of the source and target faces. The transfer is done by fitting a 3DMM to both faces and then applying the expression components of one face onto the other . To achieve face synthesis based on imperfect 3D model information, they further improve their method by introducing a learnable feature map (i.e. neural texture) alongside the UV map from the coarse 3D model as input to the rendering system . During 2D rendering, they also design a learnable neural rendering system that is based on U-Net  to output the 2D reenacted image. The entire rendering pipeline is end-to-end trainable.
Face Reenactment by GANs. Generative adversarial networks have been successfully used in this area due to their ability to generate photo-realistic images. They are able to achieve high quality and high resolution unconditional face generation [12, 11, 13]. ReenactGAN, proposed by Wu et al. , first maps the face that contains the target expression into an intermediate boundary latent space that contains the information of facial expressions but no identity-related information. Then the boundary information is used for an identity-specific decoder network to produce the reenacted face of the specific identity. Therefore, their model cannot be used for the reenactment of unknown identities.
To solve this issue, few-shot or even one-shot face reenactment methods have also been developed in the recent work [29, 31, 33]. Wiles et al.  propose a model, namely X2Face, that is able to use facial landmarks or audio to drive the input source image to a target expression. Instead of directly learning the transformation of expressions, their model first learns the frontalization of the source identity. “Frontalization” is the process of synthesizing frontal facing views of faces appearing in single unconstrained photos 
. Then it produces an intermediate interpolation map given the target expression to be used for transferring the frontalized face. Zakharovet al.  present a few-shot learning approach that achieves the face reenactment given a few, or even one, source images. Unlike the X2Face model, their method is able to directly transfer the expression without the intermediate boundary latent space  or interpolation map . Zhang et al.  propose a one-shot face reenactment model that only requires one source image for training and inferencing. They use an auto-encoder-based structure to learn the latent representation of faces, and then inject these features using the SPADE module  for the face reenactment task. The SPADE module in our proposed method is inspired by their work. However, instead of using the multi-scale landmark masks used by , we use learnable features from convolution layers as the input to the SPADE module.
3 Proposed Method
3.1 Model Architecture
Figure 2 shows the generator architecture of the proposed FaR-GAN model. The model consists of two parts: embedder and transformer. The embedder model aims to learn the feature representation of facial expressions given a set of facial landmarks. In this work, we adopt a similar color encoding method proposed in 
to represent the facial landmarks. More specifically, we use distinct colors for eyes, eyebrows, nose, mouth outlier, mouth inlier, and face contour. We also tried to use a binary mask to represent the landmark information (i.e. set 1 for the facial region and set 0 for the background), but it did not give us a better result. We will show the comparison of results with different landmark representations in Section 4. The transformer model aims to use the landmark features from the embedder model to reenact the input source identity with the target landmarks. The transformer architecture is based on the U-Net model . The U-Net model is a fully convolutional network for image segmentation. Besides its encoder-decoder structure for local information extraction, it also utilizes skip connections (the gray arrows in Figure 2) to retain global information.
A similar generator architecture can be found in 
but with several differences. First, instead of using the embedder to encode appearance information of the source identity, we use it to extract the target landmark information. The embedder model is a fully convolutional network that continuously downsamples the feature resolution with maxpooling or average-pooling layers. Therefore, the spatial information of the input image will be lost due to the downsampling process. To encode the appearance information of the source identity, the output features are required to represent a large amount of information including the identifiable information, hair style, body parts (neck and shoulders), and even background. Therefore, it is challenging for the embedder model to learn precise appearance information with the loss of the spatial information. In our approach, we use the embedder model to encode the facial landmarks, which contains much less information than the aforementioned appearance features. Moreover, instead of outputting a single 1D embedding vector, we use the embedder features from all resolutions obtained after the downsampling process. With this, we can assure the embedder features contain the required spatial information for expression transformation.
The adaptive instance normalization (AdaIN) module has been successfully used for face generation in previous work [11, 13, 31]. In , they use AdaIN modules to inject the appearance information into the generator model to produce the reenacted face by assigning a new bias and scale of the convolution features based on the embedder features. However, since we need to inject landmark information, which comes from a sparse landmark mask, we cannot simply adopt the AdaIN module in our method. This is because, the instance normalization (e.g. AdaIN) tends to wash away semantic information when applied to uniform or flat segmentation masks , such as our input landmark masks. Instead, we propose using the spatially-adaptive normalization (SPADE)  module to inject the landmark information. As the name indicates, the SPADE module is a feature normalization approach that uses the learnable spatial information from the input features. Similar to batch normalization , the input convolution features are first normalized in a channel-wise manner, and then modulated with a learned scale and bias, as shown in Figure 3. The output of the SPADE module can be formulated, as shown in Equation 1.
where m is the input landmark mask or intermediate convolution features from the embedder, is the input convolution feature from mini-batch , channel , dimension , and dimension , is the new scale, and is the new bias. The meanof the activation in channel are defined in Equation 2 and 3.
This SPADE module has been successfully used for the face reenactment task in . As shown in Figure 2, in our method, the input to the SPADE block is the convolution features from the embedder network. In 
, they use a group of multi-scale landmark masks as the input to the SPADE blocks, instead of the deep features from our proposed method. However, in our experiment, if we use these multi-scale masks instead of deep features as input to the SPADE blocks, the output reenacted faces will contain the artifacts from the input landmark contours, as shown in Figure4. Similar to , we use the features from the embedder network to inject the landmark information into the transformer model.
There are many aspects in human portraits that can be regarded as stochastic, such as the exact placement of hairs, stubble, freckles, or skin pores . Inspired by StyleGAN , we introduce stochastic variation into our transformer model by injecting noise. The noise injection is executed for each resolution of the decoder part of the transformer model. More specifically, we first sample an independent and identically distributed standard Gaussian noise map of size , where and are the spatial resolution of the input feature. Then a noise block with the number of channels is obtained by scaling the noise map with a set of learnable scaling factors for each channel. We inject the noise block by adding it element-wise with the input features.
We adopt the design of [18, 9] for our discriminator. More specifically, the input to our discriminator is the reenacted face concatenated with the target landmark mask, or the ground truth face image with its corresponding landmark mask. Therefore, the discriminator aims to guide the generator to produce a realistic face and also faces with the correct target landmarks. In Section 4, we will provide an ablation study to show the importance of the discriminator.
3.2 Loss Function
The proposed model including both embedder and transformer is trained end-to-end. Assume we have a set of videos that contain the moving face/head of multiple identities. We denote as the -th video and -th frame. Assume and are two random frames from a video. Therefore, the two frames and contain the same identity but with different facial expressions and head poses. We formulate our generator function as follows:
is the landmark mask. The generator loss function is defined in Equation5.
is the generator adversarial loss, which is based on LSGAN . We compared the results from the vanilla-GAN , LSGAN , and WGAN-GP  and chose LSGAN based on the visual quality of reenacted images. is the pixel-wise L1 loss to minimize the pixel difference of the generated image and the ground truth image. is the perceptual loss for minimizing the semantic difference, which was originally proposed by . is a collection of convolution layers from the perceptual network and is the activation from the -th layer. In our work, the perceptual network is a VGG-19 model 
pretrained on the ImageNet dataset. To enforce the reenacted face to have the same identifiable information as the input source identity, we add an identity loss , which is similar to the perceptual loss, but with a VGGFace model  pretrained for face verification.
The discriminator loss function is based on the LSGAN loss function, which is defined as follows:
3.3 Implementation Details
layers to stabilize the training process. The transformer network consists of input/output convolution layers, downsampling convolution layers, upsampling convolution layers, and SPADE convolution blocks. The input/output convolution layers only contain convolution layers; so the feature resolutions do not change. The downsampling convolution layers consist of an average-pooling layer, convolution layer, and spectral normalization layer. The upsampling convolution layers consist of a de-convolution layer followed by a spectral normalization layer to upsample the feature resolution by a factor of 2. The SPADE convolution block contains the noise injection layer followed by the SPADE module. For the discriminator, we use the same structure proposed by, with the two downsampling convolution layers.
Previously, the self-attention mechanism has been successfully used for GANs that generate high quality synthetic images . To ensure that the generator learns from a long-range of information within the entire input image, we adopt the self-attention module in both the generator and discriminator. More specifically, for the generator, we place the self-attention module after the upsampling convolution layers of the feature resolutions of and , which is similar to the implementation in . For the discriminator, we place the self-attention module after the second downsampling convolution layer.
During training, in order to balance the magnitude of each term in the loss function, we choose the weights for , , and as 20, 2, and 0.2, respectively. These weights could be different when using different datasets or different perceptual networks. We use the Adam optimizer  for both the generator and discriminator with the initial learning rate as
. The learning rate decays linearly and decreases to 0 after 100 epochs.
In this paper, we use the VoxCeleb1 dataset  for training and testing. It contains 24,997 videos from 1251 different identities. The dataset provides cropped face images extracted at 1 frame per second and we resize these images to . Dlib package  is used for extracting 68-point facial landmarks. We split the identities into training and testing sets with the ratio of in order to assure that our model is generalizable to new identities.
4.2 Experimental Results
We compare the proposed method against two methods, the X2Face model  and the few-shot talking face generation model (Few-Shot) . X2Face contains two parts: an embedder network and a driver network. Instead of directly mapping the input source image to the reenacted image, their embedder learns to frontalize the input source image and the driver network produces a interpolation map given the target expression to transform the frontalized image. To compare with the X2Face model, we use their model with pretrained weights provided by the authors and evaluate on the VoxCeleb1 dataset. The Few-Shot model also contains two parts: an embedder network and a generator network. As described in Section 3, their embedder learns to encode the appearance information of the source image, while the generator learns to generate the reenacted image given the appearance information and target landmark mask. For the Few-Shot model, since the authors only provide the testing results, we directly use these results for comparison. Both the X2Face and Few-Shot method require two stages of training. The first stage uses two frames from the same video, while the second stage requires the frames from two different videos. By doing so, they can ease the training process at the beginning by using the frames that contain the same identity and similar background information. Then for the second stage, they use the frames from two different videos to ensure that the reenacted face contains the same identifiable information as the input source identity. As mentioned in Section 3, the proposed method requires only the first stage training.
In this section, we provide both a qualitative and quantitative results comparison. For the quantitative analysis, we use structured similarity index (SSIM)  and Fréchet-inception distance (FID)  to measure the quality of the generated images. SSIM measures low-level similarity between the ground truth images and reenacted images . The higher the SSIM is, the better the quality of the generated images are. FID measures perceptual realism based on an InceptionV3 network that was pretrained on ImageNet dataset for image classification (the weights are fixed during the FID evaluation). It has been used for image quality evaluation in many works [12, 11]. In our work, the FID score is computed using the default setting 111The implementation is in https://github.com/mseitzer/pytorch-fid, so we use the final average pooling features from the InceptionV3 network. The lower the FID is, the better the quality of the generated images are.
Table 1 shows the results of the proposed and compared methods. The SSIM and FID scores of the compared methods are obtained from the original paper . Although the SSIM results are similar for all three methods, the proposed method outperforms the compared methods in terms of FID. Figure 5 shows the qualitative comparison from the testing set. The results from X2Face contains wrinkle artifacts, because it uses the interpolation mask to transfer the source image, instead of directly learning the mapping function from the source image to the reenacted image. Although the X2Face result in the first row shows its effectiveness when the change of head pose is relatively small, the results in the second and third rows show that the wrinkle artifacts get more visible when the background becomes complex and the change of head pose is larger. Both Few-Shot method and the proposed method obtain the results with a good visual quality, including transferring accurate target expression and also preserving the background information. Due to the proposed method of injecting the noise into the transformer network, the reenacted faces contain more high frequency information than the Few-Shot model, especially for the woman’s hair from the third testing case. Because the FID computes the statistical difference from a collection of synthetic images and real images, it measures both high frequency and low frequency components. Therefore, the proposed method achieves much lower FID than the two compared methods.
4.3 Ablation Study
Figure 6 shows our results with and without the discriminator. The result with discriminator contains more details, like hair, teeth, and background, compared to the result without discriminator. Therefore, the discriminator does guide the generator (both embedder and transformer) in producing better synthetic images.
Table 2 shows the SSIM and FID results of the model using the contour-based mask and binary mask for landmark representation. An example of the landmark binary mask is shown in Figure 7. Although the SSIM scores are similar, the FID score of the binary mask is much higher than the contour-based representation. Due to the use of different colors for different parts of the facial components, the contour-based mask provides additional information for the embedder to treat different parts of face separately. Thus, it can achieve a better understanding of facial pose and expression.
|FaR-GAN (w/o attention and w/o noise)||0.67||63.9|
|FaR-GAN (w/ attention and w/o noise)||0.66||35.3|
|FaR-GAN (w/ attention and w/ noise)||0.68||27.1|
Table 3 shows the ablation study of different model components, including self-attention and noise injection. Although the SSIM scores show the similar performance of the three experiments, the FID scores indicate the improvement when using these components. Adding the self-attention module reduces the FID from 63.9 to 35.3 and with the noise injection module, the FID drops to 27.1. Therefore, the two components indeed help improve the model performance. We also show the visual comparison of these experiments in Figure 8. The results without self-attention and noise injection contain blob-like artifacts that are also mentioned in . In general, both of the results with and without noise injection achieve a good visual quality. However, the results without noise injection have some artifacts, as seen in the ear region in the first example and right shoulder region in the second example. As shown in Figure 9, noise injection can improve the reenacted image quality by adding high frequency details in the hair region. Therefore, the model with both self-attention and noise injection modules achieves the best image quality.
5 Conclusion and Future Work
In this paper, we propose a one-shot face reenactment model, FaR-GAN, that is able to transform a face image to the target expression. The proposed method takes only one image from any identity as well as target facial landmarks and it is able to produce a high quality reenacted face image of the same identity but with the target expression. Therefore, it makes no assumption about the identity, facial pose, and expression of the input face image and target landmarks. We evaluate our method using the VoxCeleb1 dataset and show that the proposed model is able to generate face images with better visual quality than the compared methods.
Although the results from our method achieve a high visual quality, in some cases, when the identity that provides the target landmarks has a large appearance difference from the source identity, such as different genders or face sizes, there is still a visible identity gap between the input source identity and the reenacted face. In future work, we will continue improving our model to bridge this identity gap, such as using an additional finetuning step to explicitly direct the model to reduce the identity changes, as proposed from [29, 31]. Furthermore, in the current model setting, we do not consider the pupil movement in our landmark representation. As proposed by , we can add the gaze information in the landmark mask to make the reenacted face contain more realistic facial movement. Although the proposed method achieves a good performance in terms of FID, compared with the unconditional face generation methods (ProgressiveGAN  and StyleGAN ), our generated images are still qualitatively poorer. To further improve our method, we can adopt the progressive training approach from the aforementioned methods. We first train a small portion of the model to produce a good quality image in a small resolution, and then gradually add the rest of the model to produce higher resolution images. By doing so, we can stabilize the training process to produce images with better visual quality with higher resolution.
-  (1999-08) A morphable model for the synthesis of 3D faces. Conference on Computer Graphics and Interactive Techniques, pp. 187–194. Note: Los Angeles, CA External Links: Cited by: §2.
-  (2018-05) VGGFace2: a dataset for recognising faces across pose and age. Proceedings of the International Conference on Automatic Face and Gesture Recognition. Note: Xi’an, China External Links: Cited by: §3.2.
-  (2014-12) Generative adversarial nets. Proceedings of Advances in Neural Information Processing Systems, pp. 2672–2680. Note: Montréal, Canada External Links: Cited by: §3.2.
-  (2017-12) Improved training of Wasserstein GANs. International Conference on Neural Information Processing Systems, pp. 5769–5779. External Links: Cited by: §3.2.
-  (2015-06) Effective face frontalization in unconstrained images. IEEE Conference on Computer Vision and Pattern Recognitionn, pp. 4295–4304. Note: Boston, MA External Links: Cited by: §2.
Deep residual learning for image recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Note: Las Vegas, NV External Links: Cited by: §3.3.
-  (2017) GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, pp. 6629–6640. Note: Long Beach, CA External Links: Cited by: §4.2.
Semantic image synthesis with spatially-adaptive normalization.
International Conference on Machine Learning, pp. 448–456. External Links: Cited by: §3.1.
Image-to-image translation with conditional adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5967–5976. Note: Las Vegas, NV External Links: Cited by: §3.1, §3.3.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. European Conference on Computer Vision, pp. 694–711. Note: Amsterdam, Netherlands External Links: Cited by: §3.2.
-  (2019-06) A style-based generator architecture for generative adversarial networks. IEEE Conference on Computer Vision and Pattern Recognition, pp. 4396–4405. External Links: Cited by: §2, §3.1, §3.1, §4.2, §5.
-  (2018-04) Progressive growing of GANs for improved quality, stability, and variation. International Conference on Learning Representations. External Links: Cited by: §2, §4.2, §5.
-  (2019-12) Analyzing and improving the image quality of StyleGAN. arXiv:1912.04958v1. External Links: Cited by: §2, §3.1, §4.3.
-  (2009-12) Dlib-ml: a machine learning toolkit. Journal of Machine Learning Research 10, pp. 1755–1758. External Links: Cited by: §4.1.
-  (2015-05) Adam: a method for stochastic optimization. Proceedings of the IEEE Conference on International Conference for Learning Representations. External Links: Cited by: §3.3.
-  (2019-12) Learning to predict layout-to-image conditional convolutions for semantic image synthesis. Proceedings of Advances in Neural Information Processing Systems, pp. 570–580. Note: Vancouver, Canada External Links: Cited by: §3.1.
-  (2017-10) Least squares generative adversarial networks. IEEE International Conference on Computer Vision, pp. 2813–2821. External Links: Cited by: §3.2.
-  (2014-11) Conditional generative adversarial nets. arXiv:1411.1784v1. External Links: Cited by: §3.1.
-  (2018-04) Spectral normalization for generative adversarial networks. International Conference on Learning Representations. External Links: Cited by: §3.3.
-  (2017-09) VoxCeleb: a large-scale speaker identification dataset. Conference of the International Speech Communication Association. External Links: Cited by: §4.1.
-  (2019-10) FSGAN: subject agnostic face swapping and reenactment. IEEE International Conference on Computer Vision, pp. 7183–7192. Note: Seoul, Korea External Links: Cited by: §2.
-  (2019-06) Semantic image synthesis with spatially-adaptive normalization. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2332–2341. External Links: Cited by: §2, Figure 3, §3.1.
-  (2015-11) U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention 9351, pp. 234–241. External Links: Cited by: §2, §3.1.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. External Links: Cited by: §3.2.
-  (2015-05) Very deep convolutional networks for large-scale image recognition. Proceedings of the International Conference on Learning Representations. Note: San Diego, CA External Links: Cited by: §3.2.
-  (2019) Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics 38 (4). External Links: Cited by: §2.
-  (2018-12) Face2Face: real-time face capture and reenactment of RGB videos. Communications of the ACM 62 (1), pp. 96–104. External Links: Cited by: §2.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. External Links: Cited by: §4.2.
-  (2018-09) X2Face: a network for controlling face generation by using images, audio, and pose codes. European Conference on Computer Vision. External Links: Cited by: §1, §2, §4.2, Table 1, §5.
-  (2018-09) ReenactGAN: learning to reenact faces via boundary transfer. European Conference on Computer Vision. Note: Munich, Germany External Links: Cited by: §2, §2.
-  (2019-09) Few-shot adversarial learning of realistic neural talking head models. arXiv:1905.08233v2. External Links: Cited by: §2, §3.1, §3.1, §3.1, §3.3, §3.3, §4.2, §4.2, §4.2, Table 1, §5.
-  (2019-06) Self-attention generative adversarial networks. Proceedings of the IEEE International Conference on Machine Learning 97, pp. 7354–7363. External Links: Cited by: §3.3.
-  (2019-09) One-shot face reenactment. British Machine Vision Conference. External Links: Cited by: §2, §3.1, §5.