Synthesizing facial animation from portrait images has a wide range of creative applications, including visual effects, multimedia messaging apps, and visual dubbing. Using someone’s facial expressions and head pose in videos to drive the face of another person in a single image (known as portrait reenactment) is particularly popular due to its intuitive control and accessibility.
Traditionally, producing a realistically animated face is achieved by rendering a carefully digitized 3D head model with texture maps. Although many digital humans are still being produced this way for high-end visual effects and video games , this approach involves a tedious production effort including large teams of digital artists and often relies on complex 3D capture equipment .
More recently, deep learning-based methods have gained significant attention due to their success in producing realistic face reenactment. In particular, “DeepFakes” (e.g. ) have become a widely used approach for end-to-end video-based face-swapping. However, it cannot generalize well to unseen subjects, requiring thousands of internet photos and often days of training for each subject.
Currently, some advanced deep learning techniques for face image manipulation combine conditional GANs  with facial geometry information, such as 3D facial models [40, 35, 34] or 2D landmarks [64, 42], to provide both better control and generalization capabilities w.r.t. arbitrary identities. These 3D model-based methods only work under specific conditions. They mostly rely on statistical face models and are often limited to certain face regions and linear shape variations. Furthermore, they lose accuracy for non-frontal portraits. However, our goal is to synthesize highly complex head poses, facial expressions and facial appearances (facial hair, stylized content, complex lighting conditions) as well as the image regions surrounding the face such as hair, background, etc. On the other hand, the 2D landmark-based methods are unable to properly preserve the accurate identity and complex facial expressions with the lack of appropriate landmark adaptation.
In this work, we wish to achieve a one-shot portrait reenactment of novel subjects (with no subject-specific training) for the cross-subject setting (meaning the ability to accommodate any driver). In particular, we aim to address the problem of portrait reenactment from a single image of someone (the ”target”) using a sequence of 2D facial landmarks from a video of another person (the ”source”).
Our goal is to improve the preservation of identities within a cross-subject setting. Several challenges need to be addressed for identity-preserving face reenactment, particularly when a target and a source are different subjects. First, it is a non-trivial task to properly extract facial expressions/poses from person-specific facial features encoded using 2D facial landmark coordinates. For example, how can we determine whether a person has narrow eyes or is squinting and thus properly transfer identity-invariant motion to the target? The second challenge lies in synthesizing photorealistic and recognizable results with arbitrary expression and novel views. The third challenge is to achieve the above from a single reference image of the target subject without relying on any person-specific training or fine-tuning .
We achieve this by introducing two novel sub-networks. The first, Landmark Disentanglement Network (LD-Net), learns to disentangle the identity from the head poses and expressions, predicting facial landmarks that preserve the identity of the target while combining expressions and poses from another driving subject. The second, Feature Dictionary-based Generative Adversarial Network (FD-GAN), learns to transform the landmark positions into a personalized video portrait of the subject depicted in a single target image, which allows subject-agnostic reenactment of a portrait that preserves the target’s identity and can be applied to unseen subjects without subject-specific training. To summarize, we make the following technical contributions: (1) We introduce a novel one-shot learning method that enables portrait reenactment using the identity from a single image and expressions and poses from videos of another subject. (2) We present a novel network that disentangles the identity from 2D face landmarks for cross-subject portrait reenactment. (3) We also propose a feature dictionary-based generative network to synthesize a high-fidelity face image, which is applicable to new subjects. We evaluate each sub-network as well as the full method extensively via quantitative measurements and qualitative comparisons with the state-of-the-art methods, and demonstrate our method’s ability to preserve the target subject’s identity and to generalize to unseen subjects for cross-subject face reenactment.
2 Related Work
3D Graphics-based Methods.
Research in 3D facial animation and rendering dates back several decades. The seminal work of  showed that a morphable principal component model is effective in modeling human facial geometry and appearance. Over the years, a number of technical advances  have been made in capturing high-resolution textures and geometric details, as well as subject-specific expression deformations that are necessary to create realistic renderings of human face animations. To capture facial textures, recent work uses deep learning-based analysis to infer photorealistic skin albedo maps from a single image [48, 20]. Olszewski et al.  showed that realistic face puppeteering is possible from a single picture by rendering a sequence of dynamic textures that are synthesized using generative adversarial networks (GANs). For face geometry, other recent approaches employ a non-linear morphable model  to improve the fidelity, use a local regressor to enhance high-resolution details , or use a pixel-to-pixel translation network to learn mesoscopic geometry  or comprehensive skin reflectance [63, 49].
The reconstructed 3D mesh can be used to re-render animated faces using single-view face tracking for video dubbing [14, 18], realistic facial reenactment [54, 56], facial replacement , or lip-syncing . Previous work also explored the techniques for animating a photorealistic avatar by building person-specific skin deformation and texture models from RGB-D  or RGB scans . In 3D-based methods, 3D face modeling and retargeting from a single-view input [6, 8, 19, 27, 28, 38, 47, 61, 37] can be performed to properly decompose person-specific facial shapes from expressions and pose. Alternatively, 2D facial landmarks and image-space detail transfer can be used to animate still portraits .
Deep Learning-based Methods.
For still portrait synthesis, GANs  have been extensively studied for synthesizing a high-resolution human face, including person-specific details such as pores and facial hair [32, 33]. A conditional GAN  has been introduced for pixel-to-pixel translation applications to manipulate a human face image from edge drawings  or image sequences . For facial expression editing, Choi et al.  extended previous work to multi-domain pixel translations, enabling facial expression editing using discrete expression labels. Pumarola et al.  showed that a conditional GAN with a cycle consistency loss 
can be used for unsupervised learning of continuous facial animation editing from a single image. However, the controls provided are too coarse to capture the fine-scale nature of human facial expressions. For face swapping applications, ”DeepFake” frameworks (e.g. ) employ an encoder and decoder architecture to achieve a video-based face swapping of a pair of subjects. However, the framework cannot handle arbitrary pairs of subjects without additional subject-specific training. For many-to-many subject face swapping, Bao et al.  proposed a GAN-based training framework to decompose facial identity from other attributes such as expression, pose and illumination, allowing an end-to-end face swapping for unseen subjects.
An alternative approach for decomposing facial identity from other attributes is to explicitly model it as facial geometry such as 2D landmarks or 3D face models. For 2D geometry-guided methods, Natsume et al.  proposed a framework to achieve single-image face swapping between unseen identities conditioned on 2D landmarks. Nirkin et al.  proposed a recursive approach for improved identity preservation for a subject-agnostic face image synthesis. Siarohin et al.  introduced a first order motion model which can animate an image of a variety of categories via keypoints and local affine transformations including a human face portrait. However, it is challenging to properly separate person-specific identity, facial expressions and pose from coarse 2D landmarks (typically 68 fiducial points), and thus the previous work can still suffer from noticeable artifacts and identity mismatches. Zakharov et al.  relaxed the requirement for one-shot learning, thereby showing that few-shot learning could be employed to improve the identity preservation for portrait reenactment. However, it cannot adapt the landmarks when the source and target subjects are different and does not address cross-subject face reenactment. Wang et al.  proposed a few-shot framework for general video-to-video translation applications and applied it to animating portraits. Unlike any of the above methods, our method only requires a single image of the target, it does not require subject-specific training, and it can handle cross-subject reenactment of unseen subjects (i.e., it is subject-agnostic).
For 3D geometry-based methods, Kim et al. proposed a hybrid approach combining 3D morphable models and an image translation network to translate 3D rendering of the target face to a synthetic video for video portrait reenactment  or visual dubbing  between pairs of subjects. Nagano et al.  proposed a generalized solution that can synthesize arbitrary expressions of an unseen identity from a single picture, but only operates in the face region. Previous work [21, 22] also addressed identity-agnostic face image synthesis using 3D face fitting and deep neural nets, but also addressed full portrait manipulation including the background using background warping  or blending .
Our goal is to transfer the head pose and facial expression from a source video of one person to a target image of another subject while preserving the target’s identity. Based on the observation that 2D facial landmarks contain information about the pose and expression as well as person-specific identity features (e.g. the size, shape, proportion, and layout of the facial features), we propose to disentangle the identity and pose/expression from the landmarks and use them for landmark synthesis. As shown in Fig. 2, our method consists of two sub-networks. The Landmark Disentanglement Network (LD-Net) first synthesizes new landmarks with the target’s identity and the source’s pose/expression. Then the Feature Dictionary-based Generative Adversarial Network (FD-GAN) takes both target and synthetic landmarks as input and translates them into a new face image.
3.1 Landmark Disentanglement Network (LD-Net)
Disentangling landmarks into identity and pose/expression is difficult due to the lack of accurate numerical labeling for pose/expression. Inspired by , which can disentangle two complementary factors of variations with only one of them labeled, we propose a landmark disentanglement network (LD-Net) to disentangle identity and pose/expression using data with only the subject’s identity labeled. More importantly, our network generalizes well to novel identities (i.e., those unseen during training), unlike previous works (e.g. ).
Given 2D facial landmarks from a pair of face images, LD-Net first disentangles the landmarks into a pose/expression latent code and an identity latent code, then combines the target’s identity code with the source’s pose/expression code to synthesize new landmarks. As shown in Fig. 3, the training procedure of LD-Net is divided into two stages. Stage 1 aims to train a stable pose/expression encoder and Stage 2 generalizes to predict an identity code from landmarks instead of using identity labels so as to handle unseen identities.
In Stage 1, similar to , the network consists of four modules: (1) a pose/expression encoder that computes a code from the input landmarks that encodes only pose/expression without information about identity; (2) a one-layer network that maps the input one-hot identity label to an identity code; (3) a generator that combines a pose/expression code and an identity code to reconstruct landmarks
; and (4) a classifierthat tries to classify the generated landmarks based on their identity.
As shown in Fig. 3, Stage 1 is trained with two branches for each iteration, which share the same generator but are associated with their own input and output. In the first branch, the input (identity label) and (landmark locations) are from the same subject so the reconstructed landmarks should be as same as the input , which can be used to define a reconstruction loss. In the second branch, the input and are from different subjects and the reconstructed landmarks should contain no information about the identity of .
To achieve this, the classifier tries to classify as being a landmark of while the pose/expression encoder , generator and identity encoder tries to prevent the classifier from doing so. is trained with the classification loss of the form:
Meanwhile, , and jointly optimize the reconstruction loss minus the identity classification loss as in Eq. 2. The reconstruction loss is defined as per-point squared Euclidean distance using landmark coordinates.
where and .
All expectations are taken over where and are training distributions of landmarks and identities where and may be from different subjects. Different from the network architecture in 
, all convolutional networks are replaced with Multi-layer Perceptrons (MLP) in LD-Net, since instead of images we operate on landmark coordinates.
To generalize to novel subjects, we introduce an identity encoder to Stage 2. A notable deficiency of  is that it does not include any encoder for the labeled data (i.e., identity in our task) and thus it is limited to generating new samples only for the labeled classes in the training data. In Stage 2, we replace the one-layer network with a full-fledged identity encoder that accepts landmarks as input and encodes them into an identity code.
The full training network of Stage 2 is shown in Fig. 3, also involving two branches similar to Stage 1, that is, and are from the same subject while and belong to different subjects.
For the second input and in the second branch, due to unavailability of ground truth, we train a discriminator and a classifier to constrain the reconstructed landmarks . We use least square loss for the discriminator following  to minimize:
The classifier is trained with an adversarial loss on both input landmarks and generated landmarks:
with expectations taken over , . is the identity label of . In addition, a content consistency loss term is defined between the generated pose/expression code and its ground truth :
For the first input and , the reconstructed landmarks should have the same pose/expression and identity as . Thus, a reconstruction loss is defined to discourage to encode pose/expression:
with the expectation over , , and . and are landmarks from the same subject .
Thus, the identity encoder and generator are jointly optimized to minimize a weighted sum of the above losses:
where , , , and .
3.2 Feature Dictionary-based Generative Adversarial Network (FD-GAN)
With the predicted landmarks rasterized into a landmark image, our next goal is to translate it to a photorealistic face image. We can think of the translation procedure as follows: a local patch around each location in the landmark image indicates “which facial part should be here”, and for each location, we want to translate this into “how it should appear”. Thus, we propose a novel feature dictionary-based generative adversarial network (FD-GAN) to achieve these intuitive objectives.
The architecture of FD-GAN is illustrated in Fig. 4, which consists of an extractor and a translator. Given a target image and its corresponding landmark image , we train an extractor that constructs a “feature dictionary” in the module D, which is essentially a mapping from an annotation in the landmark image to its appearance in the target image. Concurrently, given another landmark image and the feature dictionary, we train a translator that retrieves relevant facial features from the dictionary based on the landmarks and composes a face image.
The dictionary mapping is realized with a mechanism similar to the memory bank in the Neural Turing Machine (NTM): the feature dictionary is a memory matrix, with each memory row conceptually corresponding to some facial component and the value stored in that row corresponding to how that component should appear on a specific subject’s face. The construction of such a feature dictionary corresponds to the write operation in NTM and the lookup step during translation corresponds to the read operation in NTM. Precisely, a feature dictionary consists of rows, and each row is associated with a write tag and read tag
, both of which are vectors of lengthand are learnable parameters of the network, and a stored value , which is a vector of length computed by the network.
During the writing phase in the extractor, the stored values are computed as follows: in the form of two convolutional feature maps, the extractor generates for each spatial location a write key and a write value . Then, the value of is computed as:
where is the set of spatial locations. That is, the value of row in the feature dictionary is a weighted sum of the write values at each location, with weight being the softmax of for location .
Similarly, during the reading phase, the translator, as a convolutional feature map, generates for each spatial location a read key . The return value for each location of lookup operation is computed as:
which means the value read by each location is a weighted sum of all the rows in the feature dictionary, with weight being the softmax of for each row . The translator then continues network operations on this returned convolutional feature map.
The extractor and translator are both fully convolutional, with U-Net skip connections as shown in Fig. 4. To train such a joint extractor-translator, we employ a combination of reconstruction loss, GAN loss, and an adversarial classifier loss. The discriminator and classifier are both patch-based with their loss averaged across spatial locations. In the following equations, is the extractor-translator, and to avoid excessive notation, we use the same letters and for the discriminator and classifier as in Sec. 3.1. For simplicity, we omit the range over which the expectations are taken: and are two frames from the same video clip, and are their respective landmark images, and is the identity label of and .
The discriminator minimizes:
The classifier minimizes:
The loss of extractor-translator in FD-GAN is a weighted sum of adversarial discriminator loss, adversarial classifier loss and reconstruction:
where , , and .
We first conduct an evaluation and ablation study in Sec. 4.1 on the performance of LD-Net and FD-GAN independently, followed by comparisons of our full method with the state-of-the-art methods on cross-subject face reenactment in Sec. 4.2. For more results tested on unconstrained portrait images, please refer to the supplemental material.
For FD-GAN, the extractor and translator are based on U-Nets, with both networks joined together by dictionary writer/reader modules inserted into the up-convolution modules. The discriminator and classifier for FD-GAN are patch-based and have the same structure as the down-convolution part of the U-Nets. Please refer to the supplemental material for more details concerning the network structures and training strategies.
Our method takes approximately 0.08s for FD-GAN to generate one image and 0.02s for LD-Net to perform landmark disentanglement on a single NVIDIA TITAN X GPU.
We use three datasets to evaluate our method:
LMTest: a landmark dataset which has 200,000 landmarks (100 subjects 2000 frames of varying poses and expressions) with ground truth labels for both identity and poses/expressions. Using a video of one person performing and the first 100 neutral expression photos from the Compound Facial Expressions Database , we used single-view 3D face fitting  to retarget facial expressions and poses from the video subject to each subject’s 3D face model and project 3D vertex positions to obtain ground truth 2D landmarks. This dataset is used to evaluate the effect of LD-Net.
SelfTest: a video dataset of 8,000 frames for 80 subjects from Voxceleb testing data (100 frames per video at 25fps). It is only used for the ablation study when testing self-reenactment, where the ground truth is known.
CrossTest: a video dataset of 8,000 frames for 80 pairwise subjects (100 frames per video at 25fps) randomly sampled from the Voxceleb testing data, used to compare our method with the baselines in one-shot cross-subject face reenactment.
Metrics. We use the following metrics for quantitative evaluation of generated images.
Expression Distance (ED): computes L2 distance of intensities of corresponding facial action units detected by OpenFace  between the generated images and the driving images.
Fréchet-Inception Distance (FID) : measures the distance between the distributions of real data and generated data to quantify the result fidelity.
Structured Similarity (SSIM): measures low-level similarity to ground truth images in the self-reenactment setting.
Evaluation of LD-Net.
To validate the accuracy in disentangling identity and pose/expression, we test the LD-Net in isolation using the LMTest dataset. From the 200,000 landmarks, we sample pairs of landmarks in 3 different patterns: the same identity but different pose/expression, the same pose/expression but different identity, and both differing identity and pose/expression. In each case, we randomly sample one million pairs of landmarks and compute their distances in the latent space of identity encoder and pose/expression encoder respectively. We first use PCA to reduce the dimensionality of ’s and ’s latent codes to 8 before computing the mean Euclidean distance for each pair. For ’s latent space, pairs of landmarks from the same subject should give a smaller mean distance than pairs from different identities, and similarly for landmarks with the same pose/expression in the latent space of . Table 1 gives the mean distances for each case, which shows that the identity code and pose/expression code do control the respective aspect of the generated landmarks, no matter what kind of input is provided.
|Sample by||In ’s space||In ’s space|
Ablation analysis of LD-Net.
In Table 2, We show the effect of LD-Net on the generated images in terms of identity, expression and pose preservation in two settings: self-reenactment and cross-subject. We use three metrics: ISIM, PSIM, and ED to measure matching accuracy of identity, pose and expression, respectively. In self-reenactment, we compare results generated using ground truth landmarks and synthetic landmarks by LD-Net. In cross-subject reenactment, we do a similar comparison between the results using the source subject landmarks as-is and LD-Net landmarks. Since PSIM and ED compare the poses/expressions with the source images, the results using the source landmarks (ground truth) always lead to better matching accuracy. However, the landmarks generated by LD-Net are very close to the ground truth landmarks when comparing in self-reenactment. Moreover, the cross-subject setting shows the importance of predicting personalized landmarks with higher identity accuracy in the cross-subject reenactment in Table. 2 and better visual quality in Fig. 5.
|LM from||ISIM||PSIM||ED||LM from||ISIM||PSIM||ED|
|Target||Source||Source landmarks||LD-Net landmarks|
|image||image||& Result||& Result|
Ablation analysis of FD-GAN.
, is an advanced image-to-image translation network which can synthesize photo-realistic images from landmark images. In FD-GAN-1, we reduce the number of rows in the feature dictionary to 1 with its value being the mean of the convolution features. The final baseline, AdaIN, also uses a one-row dictionary but uses the written value to generate parameters for adaptive instance normalization (AdaIN).
We compare FD-GAN with the three baselines in both the self-reenactment and cross-subject settings. To evaluate FD-GAN alone, we utilize two important image quality metrics, SSIM (unavailable in cross-subject reenactment) and FD, in addition to ISIM. As shown in Table. 3, the quantitative comparison in both settings demonstrates that our FD-GAN best preserves low-level image features, image fidelity, and identity information. Since our generative network learns a local mapping between the target image and landmarks to the final image, it is flexible enough to generalize to unseen subjects. But existing image-to-image approaches such as Pix2PixHD  lack the domain generalization capability needed to synthesize unseen subjects without any subject-specific learning.
Comparison with one-shot methods.
We first show quantitative comparisons with two state-of-the-art one-shot face reenactment baselines, X2Face , and First-order-model , using their pre-trained models on the Voxceleb training dataset. We evaluate the models in the same setting without any fine-tuning on the CrossTest dataset. Both X2Face and First-order-model are warping-based methods which can well generalize to unseen subjects in the one-shot setting. In X2Face, the generated frame inherits the object proportions of the driving source video, and the quality of their results is very sensitive to the cropping region and face alignment as shown in Fig. 6 (we also test the algorithm with a different crop size in the supplemental material). From the quantitative comparison in Table 4 and the qualitative comparison in Fig. 6, we can see that the results from the First-order-model demonstrate the best image fidelity, since it uses a warping formulation to generate the deformed faces. However, its warping formulation, which is based on keypoints and local affine transformations, can hardly provide as accurate local control as our synthesized landmarks which better preserve the source pose/expression, especially when handling very different head poses and facial expressions. Therefore, compared to these methods, our model can generalize to unseen subjects with better identity preservation and more consistent quality under a large variety of poses/expressions.
Qualitative comparison with 3D-based methods.
Fig. 7 shows qualitative comparisons on the FaceForensics++ test dataset  with two state-of-the-art 3D-based methods (Face2Face  and NeuralTexture ). Compared to their methods which require 3D face fitting to maintain the target identity and cannot change head poses, our method can synthesize personalized faces with arbitrary head poses using only 2D landmarks.
5 Discussion and Future Work
We have demonstrated a technique for portrait reenactment that only requires a single target picture and 2D landmarks of the target and the driver. The resulting portrait is not only photorealistic but also preserves recognizable facial features of the target. Our comparison shows significantly improved results compared to state-of-the-art single-image portrait manipulation methods. Our extensive evaluations confirm that identity disentanglement of 2D landmarks is effective in preserving the identity when synthesizing a reenacted face. We have shown that our method can handle a wide variety of challenging facial expressions and poses of unseen identities without subject-specific training. This is made possible thanks to our generator, which uses a feature dictionary to translate landmark features into a photorealistic portrait.
A limitation of our method is that the resulting portrait has only a resolution of 256256, and it is still difficult to capture high-resolution person-specific details such as stubble hair. It could also suffer from some artifacts for non-facial parts and the background region, since we rely on the landmarks to transfer facial appearance but the landmarks contain no structural information about the hair or background. We believe such a limitation could be further addressed by incorporating dense pixel-wise conditioning  and segmentation. While our method can produce reasonably stable portrait reenactment results from a frame of target and 2D landmarks, the temporal consistency could be further improved by taking into account temporal information from the entire video.
We would like to thank Qingguo Xu for his help on processing the VoxCeleb video dataset and Kyle Olszewski for proofreading this manuscript. This research was conducted at USC and was funded by in part by the ONR YIP grant N00014-17-S-FO14, the CONIX Research Center, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, the Andrew and Erna Viterbi Early Career Chair, the U.S. Army Research Laboratory (ARL) under contract number W911NF- 14-D-0005, Adobe, and Sony. This project was not funded by Pinscreen, nor has it been conducted at Pinscreen or by anyone else affiliated with Pinscreen. Koki Nagano is affiliated with Pinscreen but worked on this project through his affiliation at USC/ICT. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
-  (2020) Disentangling style and content in anime illustrations. In Submitted to International Conference on Learning Representations, Note: under review. https://openreview.net/forum?id=BJe4V1HFPr Cited by: §3.1, §3.1, §3.1, §3.1.
-  (2017) Bringing portraits to life. ACM Trans. Graph. 36 (4), pp. to appear. Cited by: §2.
-  (2018) Openface 2.0: facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66. Cited by: item , item .
-  (2018-06) Towards open-set identity preserving face synthesis. In , Cited by: §2.
-  (1999) A morphable model for the synthesis of 3d faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’99, pp. 187–194. External Links: Cited by: §2.
-  (2013-07) Online modeling for realtime facial animation. ACM Trans. Graph. 32 (4), pp. 40:1–40:10. External Links: Cited by: §2.
-  (2015) Real-time high-fidelity facial performance capture. ACM Trans. Graph. 34 (4), pp. 46. Cited by: §2.
-  (2014) Facewarehouse: a 3d facial expression database for visual computing. IEEE TVCG 20 (3), pp. 413–425. Cited by: §2.
-  (2016) Real-time facial animation with image-based dynamic avatars. ACM Trans. Graph. 35 (4), pp. 126. Cited by: §2.
-  (2018) Vggface2: a dataset for recognising faces across pose and age. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 67–74. Cited by: item .
-  (2016) Rapid photorealistic blendshape modeling from rgb-d sensors. In Proceedings of the 29th International Conference on Computer Animation and Social Agents, pp. 121–129. Cited by: §2.
-  (2018-06) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In IEEE CVPR, Cited by: §2.
-  (2018) Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §4.
-  (2011-12) Video face replacement. ACM Trans. Graph. 30 (6), pp. 130:1–130:10. External Links: Cited by: §2.
-  (2019) Faceswap. Note: https://github.com/deepfakes/faceswap Cited by: §1, §2.
-  (2014) Compound facial expressions of emotion. Proceedings of the National Academy of Sciences 111 (15), pp. E1454–E1462. External Links: Cited by: item .
-  (2019-09) 3D Morphable Face Models – Past, Present and Future. arXiv e-prints. External Links: Cited by: §2.
-  (2015) VDub: modifying face video of actors for plausible visual alignment to a dubbed audio track. 34 (2), pp. 193–204. Cited by: §2.
-  (2016) Reconstruction of personalized 3d face rigs from monocular video. ACM Trans. Graph. 35 (3), pp. 28. Cited by: §2.
-  (2019-06) GANFIT: generative adversarial network fitting for high fidelity 3d face reconstruction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018-12) Warp-guided gans for single-photo facial animation. ACM Trans. Graph. 37 (6), pp. 231:1–231:12. External Links: Cited by: §2.
-  (2019) 3D guided fine-grained face manipulation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 9821–9830. Cited by: §2.
-  (2011) Multiview face capture using polarized spherical gradient illumination. ACM Trans. Graph. 30 (6), pp. 129:1–129:10. External Links: Cited by: §1.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
-  (2014) Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §3.2.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: item .
-  (2015) Unconstrained realtime facial performance capture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1675–1683. Cited by: §2.
-  (2017) Avatar digitization from a single image for real-time rendering. ACM Trans. Graph. 36 (6). Cited by: §2.
-  (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, Cited by: §4.1, §9.1.
Mesoscopic facial geometry inference using deep neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
Image-to-image translation with conditional adversarial networks. In IEEE CVPR, Vol. , pp. 5967–5976. External Links: Cited by: §1, §2.
-  (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §2.
-  (2018) A style-based generator architecture for generative adversarial networks. CoRR abs/1812.04948. External Links: Cited by: §2.
-  (2019) Neural style-preserving visual dubbing. ACM Transactions on Graphics (TOG) (). Cited by: §1, §2.
-  (2018-07) Deep video portraits. ACM Trans. Graph. 37 (4), pp. 163:1–163:14. External Links: Cited by: §1, §2.
Dlib-ml: a machine learning toolkit. Journal of Machine Learning Research 10, pp. 1755–1758. Cited by: §4.
-  (2010-07) Example-based facial rigging. ACM Trans. Graph. 29 (3). Cited by: §2.
-  (2013-07) Realtime facial animation with on-the-fly correctives. ACM Trans. Graph. 32 (4). Cited by: §2.
-  (2017) Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2813–2821. Cited by: §3.1.
-  (2018-12) PaGAN: real-time avatars using dynamic textures. ACM Trans. Graph. 37 (6), pp. 258:1–258:12. External Links: Cited by: §1, §2, §5.
-  (2018-12) FSNet: an identity-aware generative model for image-based face swapping. In Proc. of Asian Conference on Computer Vision (ACCV), Cited by: §2.
-  (2019) FSGAN: subject agnostic face swapping and reenactment. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7184–7193. Cited by: §1, §2.
-  (2017-04) On face segmentation, face swapping, and face perception. arXiv preprint arXiv:1704.06729. Cited by: §2.
-  Realistic dynamic facial textures from a single image using gans. Cited by: §2.
-  (2019) GANimation: one-shot anatomically consistent facial animation. Cited by: §2.
-  (2019) FaceForensics++: learning to detect manipulated facial images. In International Conference on Computer Vision (ICCV), Cited by: §4.2, Figure 12, Figure 13, §9.2, §9.3.
-  (2016) Real-time facial segmentation and performance capture from rgb input. In ECCV, Cited by: §2.
-  (2017) Photorealistic facial texture inference using deep neural networks. In IEEE CVPR, Cited by: §2.
-  (2018) SfSNet: learning shape, refectance and illuminance of faces in the wild. In Computer Vision and Pattern Regognition (CVPR), Cited by: §2.
-  (2017) Meet mike: epic avatars. In ACM SIGGRAPH 2017 VR Village, SIGGRAPH ’17, New York, NY, USA, pp. 12:1–12:2. External Links: Cited by: §1.
-  (2019-12) First order motion model for image animation. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: Figure 6, Table 4.
-  (2019) First order motion model for image animation. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 7137–7147. External Links: Cited by: §2, §4.2, Figure 11, Figure 12, §9.2.
-  (2017) Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. 36 (4), pp. 95. Cited by: §2.
-  (2015) Real-time expression transfer for facial reenactment. ACM Trans. Graph. 34 (6). Cited by: §2.
-  (2019) Deferred neural rendering: image synthesis using neural textures. arXiv preprint arXiv:1904.12356. Cited by: Figure 7, §4.2, Figure 13, §9.3.
-  (2016) Face2face: real-time face capture and reenactment of rgb videos. In IEEE CVPR, pp. 2387–2395. Cited by: §2, Figure 7, item , §4.2, Figure 13, §9.3.
-  (2019-06) Towards high-fidelity nonlinear 3d face morphable model. In In Proceeding of IEEE Computer Vision and Pattern Recognition, Long Beach, CA. Cited by: §2.
-  (2019) Few-shot video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
-  (2018) Video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In IEEE CVPR, Cited by: §2, Figure 6, §4.1, §4.1, Table 3, Table 4, Figure 10, §9.1.
-  (2011-07) Realtime performance-based facial animation. ACM Trans. Graph. 30 (4). Cited by: §2.
-  (2018-09) X2Face: a network for controlling face generation using images, audio, and pose codes. In The European Conference on Computer Vision (ECCV), Cited by: Figure 6, §4.2, Table 4, Figure 11, Figure 12, §9.2.
-  (2018-07) High-fidelity facial reflectance and geometry inference from an unconstrained image. ACM Trans. Graph. 37 (4), pp. 162:1–162:14. External Links: Cited by: §2.
-  (2019) Few-shot adversarial learning of realistic neural talking head models. arXiv preprint arXiv:1905.08233. Cited by: §1, §1, §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593. Cited by: §2.
In this supplementary material, we first explain details of the implementation, training strategy and performance of our method. We then provide additional results for evaluations and qualitative comparisons between our method and other one-shot face reenactment baselines on different datasets. Finally, we demonstrate the strong capability of our method by testing on in-the-wild portrait images from the Internet. More video results can be found in the supplementary video.
6 Implementation Details
All networks are MLPs in LD-Net, having 10 hidden layers with 512 features each. The length of the pose/expression code is 64 and the length of the identity code is 128.
In FD-GAN, the extractor and translator are based on U-Nets, with both networks joined together by dictionary writer/reader modules inserted in the up-convolution modules of the network, as shown in Fig. 3. in the paper.
In the down-convolution module, each level consists of a stride-2 convolution with 44 kernel, followed by a flat convolution with a 33 kernel. In the up-convolution module of the extractor, each level consists of a dictionary writer module, followed by a flat convolution with a 33 kernel and then a stride-2 convolution with 44 kernel. The up-convolution module of the translator is similar, with dictionary writers replaced with readers.
In the writer modules, write keys and write values are each computed from the input with a 11 convolution. In the reader modules, read keys are computed from the input with a 11 convolution and the values read from the dictionary are added back to the input feature, as in a residual block.
The discriminator and classifier for the generation part are patch-based and have the same structure as the down-convolution module of the U-Nets. For all networks, from the lowest level to the highest, the number of convolutions features, as well as the length of rows in the feature dictionary, are (32, 64, 128, 256). The number of rows in the dictionary are (512, 256, 128, 64), and the length of the read/write tags is 32 for all dictionaries.
7 Training Strategy
Although our FD-GAN implementation operates on 256256 images, the LD-Net part should in principle be independent of image size. For LD-Net we normalize the landmark coordinates such that the square bounding box of all points span the range [-1, 1]. For FD-GAN, pixel values are normalized to [-1, 1].
The training configuration is given in table 5. Training time is in number of iterations.
|LD-Net Stage 1||Adam||32|
|LD-Net Stage 2||Adam||32|
For one target image and its corresponding landmarks, the identity code in LD-Net and the feature dictionary in the FD-GAN can be reused for multiple source images. For each target, we measure the running time using all 100 frames in a corresponding test video. Landmark detection is performed separately in advance and is not included in the running time. It takes approximately 0.08s for FD-GAN to generate one image and 0.02s for LD-Net to do landmark disentangling on a single NVIDIA TITANX GPU.
9 Additional Qualitative Results
9.1 Ablation study
Comparison between with and without LD-Net.
In Fig. 8 and Fig. 9, we show additional qualitative evaluations for self-reenactment and cross-subject face reenactment using the SelfTest dataset and CrossTest dataset, respectively. In Fig. 8, we compare results generated using ground truth landmarks (from the source video) and results using landmarks generated by LD-Net. As can be seen in the figure, our method can predict landmarks and synthesize high-quality images that are both close to the ground truth. For the cross-subject evaluation in Fig. 9, our results using landmarks by LD-Net not only have better identity preservation but also more precise poses/expressions (e.g. in the first row).
|Source||Target||Ground Truth LMs||Result1||LD-Net LMs||Result2|
|Source||Target||Source LMs||Result1||LD-Net LMs||Result2|
Comparison FD-GAN with baselines.
Fig. 10 shows additional qualitative comparisons for cross-subject face reenactment on the CrossTest dataset with the three baselines, including the advanced image-to-image translation network Pix-2PixHD , as well as two variants of FD-GAN (AdaIN and FD-GAN-1). AdaIN (colum 4) and FD-GAN-1 (column 5) use the full feature dictionary with AdaIN  or a one-line feature dictionary, respectively.
9.2 Comparison with one-shot methods
Additional results for one-shot cross-subject face reenactment on the CrossTest dataset and FaceForensics++ dataset  are shown in Fig. 11 and Fig. 12 respectively, with comparisons between our method and X2face  (column 3), X2face-aligned  (column 4), and First-order-model  (column 5). Note that in X2face , the generated frames inherit the object proportions of the driving source video by transferring absolute coordinates, and thus it is very sensitive to face alignment. In addition to testing X2face  using exactly the same input configurations as other methods, we also take a smaller face region with a tighter bounding box as input to minimize the misalignment between source and target images for X2face  and obtain the results as shown as X2face-aligned. We also compute the metrics of the results by X2face-aligned on the CrossTest dataset, which are ISIM=0.7855, PSIM=0.7207, ED=0.344 and FID=62.14. Although the results are better than using non-aligned face images, our results are both quantitatively and qualitatively superior to theirs.
9.3 Comparison with 3D-based methods
Additional results for cross-subject face reenactment on the FaceForensics++ dataset  are shown in Fig. 13, with comparisons to two state-of-the-art 3D-based face reenactment methods, Face2Face  and NeuralTexture .
9.4 More Results
To demonstrate the capacity and generalization of our method, we test it on in-the-wild face images with the diverse appearance and challenging poses/expressions, including 2D paintings, historical photographs, as well as some celebrity portraits, as shown in Fig. 14. In the accompanying video, we provide more video examples for reference.