Learned Spatial Representations for Few-shot Talking-Head Synthesis

04/29/2021 ∙ by Moustafa Meshry, et al. ∙ 0

We propose a novel approach for few-shot talking-head synthesis. While recent works in neural talking heads have produced promising results, they can still produce images that do not preserve the identity of the subject in source images. We posit this is a result of the entangled representation of each subject in a single latent code that models 3D shape information, identity cues, colors, lighting and even background details. In contrast, we propose to factorize the representation of a subject into its spatial and style components. Our method generates a target frame in two steps. First, it predicts a dense spatial layout for the target image. Second, an image generator utilizes the predicted layout for spatial denormalization and synthesizes the target frame. We experimentally show that this disentangled representation leads to a significant improvement over previous methods, both quantitatively and qualitatively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

page 8

page 14

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study the task of learning personalized head avatars in a low-shot setting, also known as “neural talking heads”. Given a single-shot or few-shot images of a source subject, and a driving sequence of facial landmarks, possibly derived from a different subject, the goal is to synthesize a photo-realistic video of the source subject, under the poses and expressions of the driving sequence. This task has a wide range of applications, including those in AR/VR, video conferencing, gaming, animated movie production and video compression in tele-communication.

Traditional graphics-based approaches to this task rely on a 3D face geometry and produce very high quality synthesis. However, they tend to focus on modeling the face area without the hair, and they learn a subject-specific model and cannot generalize to new subjects. In contrast, recent 2D-based approaches [47, 33, 5, 46] learn a subject-agnostic model that can animate unseen subjects given as few as a single image. Furthermore, since these works learn an implicit model and do not require an explicit geometric representation, they can synthesize the full head, including the hair, mouth interior, and even wearable accessories like glasses and earrings. This remarkable generalization ability however comes at the cost of low quality and poor identity preservation when compared to their 3D-based subject-specific counterparts. Bridging the quality gap between 2D-based subject-agnostic and 3D-based subject-specific approaches remains an open problem.

Recent efforts in 2D-based approaches can be divided into two classes; warping-based and direct synthesis. As the name suggests, warping-based methods (e.g., [33]) learn to warp the input image or a recovered canonical pose based on the motion of the driving sequence. While these methods achieve high realism, especially for static and rigid parts of the image, they tend to work well only for a limited range of motion, head rotation and dis-occlusion.

Figure 1: Our framework disentangles spatial and style information for image synthesis. It predicts a latent spatial layout for the target image, which is used to produce per-pixel style modulation parameters for the final synthesis.
Figure 2: Overview of our training pipeline. The cross-entropy loss with the oracle segmentation is used during pre-training the layout predictor , and then turned off during the full pipeline training.

On the other hand, direct synthesis approaches (e.g., [47, 5, 46]) encode the source subject into a compressed latent code, and a generator decodes the latent code to synthesize the target pose. These approaches learn a prior over the compressed latent space, and can generate realistic results for a wider range of poses and head motion. However, they exhibit a noticeable identity gap between their output and the source subject.

We posit that the identity gap is caused by the entangled representation of the source subject in a single latent code. This compressed 1D latent encodes multi-view shape information, identity cues, as well as color information, lighting and background details. In order to synthesize a target view from a latent code, the generator needs to devise a complex function to decode the uni-dimensional latent into its corresponding 2D spatial information. We argue this not only consumes a large portion of the network capacity, but also limits the amount of information that can be encoded in the latent code.

To address this problem, we propose a two-step framework that decomposes the synthesis of a talking head into its spatial and style components. Our framework animates a source subject in two steps. First, it predicts a novel spatial layout of the subject under the target pose and expression. Then, it synthesizes the target frame conditioned on the predicted layout. This factorized representation yields the following key performance advantages.

Better subject-agnostic model performance.

The performance of our subject-agnostic (also called meta-learned) model not only performs better than previous subject-agnostic state-of-the-art, but is also on-par with the subject-finetuned performance of previous works when there are only few source images available (e.g., less than 10 images).

Better fine-tuned performance with less data.

Fine-tuning our model for a specific subject requires significantly less data and fewer iterations than previous works, and yet achieves better performance. For example, we show that fine-tuning our model using 4-shot inputs outperforms previous state-of-the-art models fine-tuned using 32-shot inputs.

Robustness to pose variations.

We show that our model is more robust against a wider range of poses and facial expressions, while still producing both realistic and identity-preserving results.

Improved identity preservation.

Shape difference between the source and driving identities poses a challenge for identity preservation in reenacted results. The intermediate novel spatial representation learned by our model reduces the sensitivity towards such differences and better preserves the identity.

In summary, we make the following contributions:

  • [leftmargin=*]

  • A novel approach that disentangles the spatial and style components for talking-head synthesis.

  • A novel latent spatial representation that proves effective for few-shot novel view synthesis.

  • We achieve state-of-the-art performance in both the single-shot and multi-shot settings, as well as in the meta-learned and subject-finetuned modes.

2 Related work

Existing approaches for realistic talking-head synthesis can be categorized into 3D-based and 2D-based.

3D-based methods.

Such methods [36, 38, 35] utilize 3D geometric representations as a proxy to animate a target subject. Common geometric representations, such as 3D morphable models (3DMM) [2], only model the face area, and do not include challenging regions like the hair, eyes and mouth interior. Obtaining a detailed geometry of these regions is an expensive and challenging task. Therefore, such methods either cannot synthesize or perform poorly on those regions. Recent works [20, 37, 9]

combine the traditional graphics pipeline with machine learning to better model the eye movement, mouth interior, or learn a better appearance model. However, they learn subject-specific models that do not generalize to new subjects. Other works 

[27, 8] take first steps to generalize to multiple subjects but they do not perform well on hair and other regions outside the face.

2D-based methods.

These methods [6, 28, 31, 11, 12, 46, 5, 33, 41, 47] learn an implicit model of the head and do not require a proxy geometry. Therefore they can synthesize the full head including dynamic regions like the hair, eyes, and mouth interior. They can also model different wearable accessories such as hats, glasses, and earrings. Early works build on top of CycleGAN [50] and learn subject-specific models [1, 45]. More recent works [47, 41, 33, 5, 12, 46] learn subject agnostic models that can animate unseen subjects given only a single or few-shot images. However, these methods lack in quality and identity preservation compared to the 3D-based subject-specific models. To bridge this performance gap, hybrid models [47, 5, 46] utilize a meta-learning phase that trains a subject-agnostic model on a large corpus of data, then an optional subject-specific fine-tuning phase is performed to improve the realism and restore the source identity. In this work, we improve the meta-learned performance to achieve state-of-the-art results without any subject-specific fine-tuning. While our model could still benefit from the optional fine-tuning phase to further refine the results, it requires significantly less data samples compared to previous works.

On another axis, 2D-based approaches can be categorized based on the synthesis technique into warping-based (e.g., [44, 33, 12, 42]) and direct synthesis (e.g., [47, 5, 46]). Warping-based approaches warp an input image [33, 12] or a recovered canonical pose [44]

to synthesize novel poses. Warping results however tend to break when the target pose is far from that of the source image. Direct synthesis approaches utilize advances in Generative Adversarial Networks (GANs) 

[10] and Image-to-Image (I2I) translation [16] to generate novel poses. Compared to warping-based approaches, direct synthesis methods can realistically handle a wider range of poses and expressions.

Multi-modal Image-to-Image (I2I) translation.

Several multi-modal I2I translation works feed a style latent code, either directly to the generator [51] or through adaptive instance normalization (AdaIN)  [14, 15]. Recent state-of-the-art architectures [29, 23, 52] showed a significant improvement over traditional UNet [32] and encoder-decoder architectures, by generating per-pixel spatial denormalization (SPADE) parameters [29]. However, such architectures depend on the existence of accurate semantic segmentations or other dense spatial representations of the target image, hence limiting their usage in tasks where such dense representations do not exist. In this work, we learn to predict a latent dense layout to provide the spatial input to SPADE.

3 Method

Figure 3: Layout pre-training predicts meaningful segmentation maps despite the noisy oracle segmentations. Our latent spatial representation encodes more information than traditional segmentations.

Our approach factorizes the representation of a head avatar into spatial and style components. It breaks down the novel view head synthesis of a subject into two steps. First, a layout prediction network translates facial landmarks for a target view into a dense spatial layout of the subject. Then, an image generator synthesizes the final image conditioned on the predicted layout. We first give an overview of our pipeline in Section 3.1. Then, we explain how to pre-train the layout prediction network to predict semantic segmentations of novel views in Section 3.2, followed by the full pipeline training in Section 3.3. Section 3.4 explains how the layout prediction network transitions from predicting semantic maps to learning a more powerful latent spatial representation. And finally, we discuss how to learn a personalized head avatar through an optional subject-specific fine-tuning stage in Section 3.5.

3.1 Overview

Given -shot inputs of a source subject, a two-headed encoder processes the inputs and generates layout latents and style latents for . The latents are then averaged to get an aggregated layout latent and style latent . Averaging the latents cancels out view-specific information and transient occluders, and maintains implicit 3D information like the head and hair shape for the layout latent, and color and lighting information for the style latent. We have two generators: a layout predictor network and an image generator . The layout predictor takes as input the facial landmarks for a target view and the layout latent and generates a spatial layout , such as a semantic map, for the target view. The image generator processes the style latent and utilizes spatial denormalization layers (SPADE [29]), conditioned on the predicted layout , to synthesize the final image . An overview of our framework is shown in Figure 2.

3.2 Layout prediction pre-training

Training the above pipeline end-to-end without any supervision or constraints on the predicted layouts results in a degenerate solution, where the spatial layouts and their corresponding spatial denormalization are completely ignored. All spatial and style information are thus encoded into and decoded from the style latent , which results in a poor performance. Therefore, we opted to pre-train the layout prediction network to predict a plausible semantic segmentation of a target view, given the input facial landmarks and the layout latent . To supervise this training, we use an off-the-shelf face segmentation network [22] as an oracle to segment the target image into a semantic map , and we apply a cross-entropy loss (X-ent) between the oracle segmentation and our predicted segmentation . We observe that the obtained oracle segmentations are very noisy and have poor quality (e.g., Figure 3). This is caused by the domain gap, in terms of image resolution and the distribution of head poses, between the datasets used to train the oracle segmentation network [22], and in-the-wild videos of talking heads. Thus, to regularize the segmentation prediction training, we use a mutli-task pre-training strategy where the layout prediction network predicts an extra RGB reconstruction of the target image , which is used as a secondary supervisory signal. Specifically, we have

(1)

And the objective for the pre-training is

(2)

where is a perceptual reconstruction loss, and is a relative weighting term which is set to a low value.

3.3 Full pipeline training

Once the layout predictor network has been pre-trained to predict semantic segmentations, we plug it into our full pipeline. The predicted segmentation is fed as the spatial input to a SPADE image generator that synthesizes the final image as

(3)

We observe that the SPADE generator quickly utilizes the input spatial segmentations to resolve spatial ambiguities, and we no longer fall into a degenerate solution where the spatial input is ignored.

Our full pipeline, comprising the layout and style encoders , the layout predictor and the image generator , is optimized to minimize three losses; a reconstruction loss , an adversarial loss , and a latent regularization loss .

For the reconstruction loss , we employ a perceptual loss [17] based on both the VGG19 [34] and VGGFace [30] networks, as well as an loss. While the VGG19-based perceptual loss is a standard reconstruction loss, we follow Zakharov  [47] and utilize a VGGFace-based perceptual loss to promote identity preservation. We also use an loss to better preserve color transfer between the synthesized and ground truth images.

The adversarial loss, , encourages the output to be photo-realistic. To achieve that, a discriminator network is trained to discriminate between real and fake images, while the generator network, aims to fool the discriminator by bringing the output closer to the manifold of real images. We borrow the architecture of the discriminator network from [19] and use a non-saturating logistic loss with gradient penalty [25]. Finally, we impose an regularization on the learned latent codes to encourage compactness of the latent space. The full training objective is given by

(4)

where determine the relative weights between the loss terms.

3.4 Learning a latent spatial representation

Spatial denormalization (SPADE) generates per-pixel denormalization parameters by feeding a dense spatial input through a small convolutional subnetwork. While SPADE [29] originally uses semantic maps as input, we explore learning a latent spatial representation that better suits the image synthesis task at hand. To do this, we turn off the cross-entropy loss so as to give the layout predictor the freedom to diverge from predicting traditional semantic segmentations and learn other latent representations that better optimize the few-shot novel view synthesis objective. The layout predictor is thus supervised only by the training objective of Eqn. 4. Figure 3 shows examples of the learned latent layouts. Although they might look less interpretable than traditional semantic maps, they seem to encode more information and capture accurate details.

Figure 4:

Qualitative comparison in the single-shot setting. We show three sets of examples representing low, medium and high variance between the source and target poses. Our method is more robust to pose variations than the baselines.

3.5 Subject fine-tuning

Training our full pipeline learns a powerful subject-agnostic model that produces high quality and identity-preserving synthesis. Optionally, we can learn a personalized head avatar to further refine the results for a given subject. To do this, we follow [47, 5, 46] and fine-tune the subject-agnostic model (also called meta-learned model) using the few-shot inputs of the source identity. Specifically, we compute the layout and style embeddings and fine-tune the weights of the layout and image generators , as well as the discriminator, , by reconstructing the set of few-shot inputs, and optimizing the same training objective of Eqn. 4. We observe that subject fine-tuning restores high-frequency components and improves background reconstruction when compared to the meta-learned outputs.

4 Experimental evaluation

Method PSNR SSIM LPIPS ID-SIM NMKE FID
X2Face [44] 15.50 0.466 0.346 0.691 0.333 98.58
Bi-layer [46] 0.721 0.236 130.58
FSTH [47] 16.92 0.597 0.263 0.836 0.049 53.07
LPD [5] 0.837 0.070 48.48
FOMM [33] 18.20 0.635 0.236 0.869 0.061 56.10
Ours 17.37 0.605 0.232 0.886 0.041 45.69
Table 1: Quantitative comparison in the single-shot setting.
Implementation details.

Please, refer to the supp. material for networks architecture, hyper-parameters and training details. Our code will be publicly released.

Dataset.

We perform our evaluation using the VoxCeleb [7] dataset, which is a large-scale in-the-wild video dataset. The train set contains over a million clips from 145,569 videos of 5,994 different identities. The test set contains new identities that are not part of the training. We use the test subset released by Zakharov  [47], which contains a total of 1,600 frames from videos of 50 subjects. For self-reenactment scenarios, the input few-shots and the driving sequence do not overlap. We obtain the facial landmarks for sampled frames using an off-the-shelf facial landmarks detector [4].

Baselines.

We compare our method to the following baselines: X2Face [44], FSTH [47], FOMM [33], Latent Pose Descriptor (LPD) [5], and Bi-layer [46]. We use the released pre-trained models provided by the authors for all baselines, except for FSTH [47] where we use the authors’ provided outputs, as their code and models were not released. Since some baselines only accept single-shot inputs (e.g., FOMM and Bi-layer), we divide our evaluation into a single-shot setting, where we compare to all the baselines, and a multi-shot setting, where we only compare against the few-shot baselines. Since the LPD [5] and Bi-layer [46] baselines do not predict the background and re-crop the input/output frames, we subtract the background and compare with their corresponding cropped ground truths for quantitative analysis. We also exclude those two baselines from frame reconstruction evaluation since their output does not align with the rest of the methods.

Metrics.

We evaluate all models along five axes.

  • [leftmargin=*]

  • Reconstruction fidelity using the peak signal-to-noise ratio (

    PSNR) and structural similarity (SSIM[43] metrics.

  • Perceptual similarity between the output and the ground truth using the AlexNet-based LPIPS metric [48].

  • Identity preservation (ID-SIM

    ) using the cosine similarity between face embeddings from a face recognition network 

    [30].

  • Normalized Mean Keypoint Error (NMKE), which measures the pose error between the synthesized and ground truth images as computed in [5, 46].

  • Perceptual quality of the output using the Frechet-Inception Distance (FID) metric [13].

4.1 Single-shot comparative evaluation

Table 1 shows a quantitative comparison with the baselines in the single-shot setting. Our method outperforms all baselines in perceptual reconstruction (LPIPS), identity preservation (ID-SIM), pose matching (NMKE) and visual quality (FID). However, FOMM scores better in the standard reconstruction metrics (PSNR and SSIM). We argue this is intrinsic to their method due to its warping-based nature, which accurately captures the background and other static regions, and thus gives low reconstruction error even in the presence of clear artifacts. Furthermore, while FOMM cannot utilize more input frames to its advantage, our method’s performance improves with multi-shot inputs to significantly surpass FOMM in all metrics (see supp. material for the quantitative numbers).

Figure 4 shows qualitative results from three groups representing low, medium and high variance between the input and target poses. We observe that all methods perform well when the target pose is similar to that of the input shot. LPD produces sharp results within the low-medium pose variation, but shows blurry artifacts within the face and eyes in the case of high pose variance. FSTH shows a clear identity gap. FOMM accurately matches the background and shows highly realistic results when the pose variance is low, but shows a clear identity gap and visible artifacts when the target pose is far from the source image. Our method is more robust against pose variation, yielding realistic results while preserving the source identity.

Figure 5: A qualitative comparison showing the effect of increasing the K-shot inputs and applying subject fine-tuning.
Figure 6: Quantitative comparison with the few-shot baselines, showing the effect of both increasing the K-shot inputs and subject-specific fine-tuning. Dotted and solid lines represent the meta-learned and fine-tuned models respectively.
Figure 7: Cross-subject reenactment with different driving identities. Results are shown for our meta-learned model without any fine-tuning, and using 32-shot inputs.
Figure 8: Examples from the ablation study. Results shown are for the meta-learned models with a single-shot input (source).

4.2 Multi-shot comparative evaluation

Here, we focus on the effect of increasing the number of -shot inputs, and the effect of subject-specific fine-tuning using the -shot inputs. Figure 6 plots the ID-SIM, NMKE and FID performance metrics as we increase the number of -shots. We observe that the pose reconstruction performance (NMKE) is mainly dictated by the approach itself, rather than the number of -shots or whether the models are fine-tuned or not. For example, the meta-learning performance of FSTH with is better than the fine-tuned LPD model with . Similarly, the single-shot meta-learning performance of our method is better than the fine-tuned baselines at .

For the ID-SIM and FID metrics, the meta-learning performance of our model is not only superior to that of the baselines, but it is also on-par with the fine-tuned baselines for . However, as is increased to 32, the fine-tuned baselines eventually outperform our meta-learned model. Another very important advantage to our approach is that it achieves better performance with significantly less data. For example, fine-tuning our model with just outperforms the fine-tuned baselines at . Since fine-tuning on more data requires more training iterations and thus more time, our method spends much less time fine-tuning on fewer data samples, and yet achieves similar or better results. We observe similar behavior with other metrics (PSNR, SSIM and LPIPS). Please, refer to the supp. material for full results.

Figure 5 visualizes the effect of both increasing and subject fine-tuning. Our method preserves the source identity without any fine-tuning, even with a single-shot input. On the other hand, the baselines only restore the source identity after the subject-specific fine-tuning. Our method also shows the most improvement, in terms of realism and better identity match, when increasing the number of -shot inputs. For example, our method successfully filters out the subject hand occluding the face in the single-shot input.

4.3 Cross-subject reenactment

Cross-subject reenactment poses a challenge, especially for landmark-driven approaches. The shape difference between facial landmarks of the source and driver identities could lead to a noticeable identity gap in the reenacted results. The intermediate spatial representation learned by our method helps reduce this problem and leads to good identity preservation of the source subject regardless of the driver identity. Figure 7 shows sample reenactment results using different driver identities. To demonstrate the effectiveness of our disentangled representation, we avoid any subject fine-tuning and show the results of our meta-learned model with 32-shot inputs. The source identity is well-preserved among challenging facial expressions and different views covering both the left and right sides of the face.

Method PSNR SSIM LPIPS ID-SIM NMKE FID
Baseline 17.00 0.574 0.274 0.837 0.044 67.19
+ SPADE 16.94 0.575 0.268 0.834 0.043 56.00
+ learned seg. maps 16.94 0.578 0.265 0.828 0.042 62.78
+ latent layout (ours) 17.22 0.592 0.247 0.860 0.042 54.40
Upper bound 18.21 0.629 0.219 0.867 0.039 48.06
Table 2: Ablation study of our approach. +SPADE replaces the UNet generator with SPADE. +Learned seg. maps conditions the generator on learned segmentations. +Latent layout learns a latent spatial representation. The upper bound gets to cheat and uses the ground truth segmentations.

4.4 Ablation study

We evaluate the contribution of different components of our proposed approach. All ablation experiments are trained with the same hyper-parameters and for the same number of epochs, and are evaluated in the

single-shot setting with no fine-tuning. We report the results in Table 2. The baseline model has the same setup as FSTH [47], where a UNet generator with AdaIN layers [14] translates the input landmarks into the target image. Next, we replace the UNet architecture with a SPADE generator [29] conditioned on the facial landmarks (+SPADE). This improved the FID, but other metrics remained around the same. We hypothesize this is due to using sparse landmarks as the spatial input, while SPADE needs dense spatial inputs to generate the per-pixel denormalization parameters. To validate our hypothesis, we conducted an experiment as an upper bound, where we get to cheat and segment the ground truth target image using an off-the-shelf face segmentation network [22] (oracle), and we use these oracle segmentations as the spatial input to SPADE. Even though the oracle segmentations are noisy (e.g., Figure 3), this still resulted in a significant boost in all metrics, proving that the SPADE generator could benefit from dense spatial inputs. Therefore, we trained a layout prediction network to predict a plausible semantic segmentation for the target pose (+Learned seg. maps). This surprisingly produced mixed results and even caused a drop in the ID-SIM and FID scores. We posit this is because the noisy oracle segmentations do not provide a consistent supervisory signal, which causes the learned segmentations to miss important shape cues (e.g., the correct face shape), as well as overfit common errors in the oracle segmentations as the training progresses. Finally, removing the supervision on the predicted layouts and learning a latent spatial representations (+Latent layouts) resulted in a reasonable performance improvement over all metrics. We also show a qualitative comparison for the ablation study in Figure 8. We observe that the qualitative results of the upper bound experiment (using the oracle/ground-truth segmentation) exhibits artifacts caused by errors in the oracle segmentation. The results of our method with the learned latent layouts looks qualitatively better, with no clear artifacts, despite having worse quantitative metrics than the upper bound experiment.

5 Conclusion

We proposed a novel approach for talking-head synthesis. Our model learns a novel latent spatial representation that proves effective for our task. We improve the performance of both subject-agnostic models, as well as subject-finetuned models while requiring significantly less data samples. The learned latent spatial representation helps provide robustness against a wide range of poses and expressions, and results in better identity preservation, especially for the cross-subject reenactment scenarios.

Acknowledgements. This project was partially funded by the DARPA SemaFor (HR001119S0085) and DARPA SAIL-ON (W911NF2020009) programs.

References

  • [1] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh (2018) Recycle-gan: unsupervised video retargeting. In Eur. Conf. Comput. Vis., pp. 119–135. Cited by: §2.
  • [2] V. Blanz and T. Vetter (1999) A morphable model for the synthesis of 3d faces. In Proc. SIGGRAPH, pp. 187–194. Cited by: §2.
  • [3] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis. Int. Conf. Learn. Represent.. Cited by: §A.1, §A.1.
  • [4] A. Bulat and G. Tzimiropoulos (2017) How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Int. Conf. Comput. Vis., pp. 1021–1030. Cited by: §A.1, §4.
  • [5] E. Burkov, I. Pasechnik, A. Grigorev, and V. Lempitsky (2020) Neural head reenactment with latent pose descriptors. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 13786–13795. Cited by: §A.4.2, §A.7, §1, §1, §2, §2, §3.5, 4th item, §4, Table 1.
  • [6] L. Chen, R. K. Maddox, Z. Duan, and C. Xu (2019) Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 7832–7841. Cited by: §2.
  • [7] J. S. Chung, A. Nagrani, and A. Zisserman (2018) VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH, Cited by: §A.1, §A.7, §4.
  • [8] O. Fried, A. Tewari, M. Zollhöfer, A. Finkelstein, E. Shechtman, D. B. Goldman, K. Genova, Z. Jin, C. Theobalt, and M. Agrawala (2019) Text-based editing of talking-head video. ACM Trans. Graph. 38 (4), pp. 1–14. Cited by: §2.
  • [9] G. Gafni, J. Thies, M. Zollhöfer, and M. Nießner (2020) Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. arXiv preprint arXiv:2012.03065. Cited by: §2.
  • [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Adv. Neural Inform. Process. Syst., pp. 2672–2680. Cited by: §2.
  • [11] K. Gu, Y. Zhou, and T. Huang (2020) FLNet: landmark driven fetching and learning network for faithful talking facial animation synthesis. In AAAI, Vol. 34, pp. 10861–10868. Cited by: §2.
  • [12] S. Ha, M. Kersner, B. Kim, S. Seo, and D. Kim (2020) Marionette: few-shot face reenactment preserving identity of unseen targets. In AAAI, Vol. 34, pp. 10893–10900. Cited by: §2, §2.
  • [13] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. neurips. Cited by: 5th item.
  • [14] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Int. Conf. Comput. Vis., pp. 1501–1510. Cited by: §A.1, §2, §4.4.
  • [15] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018)

    Multimodal unsupervised image-to-image translation

    .
    In Eur. Conf. Comput. Vis., Cited by: §2.
  • [16] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2.
  • [17] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    .
    In Eur. Conf. Comput. Vis., pp. 694–711. Cited by: §3.3.
  • [18] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of GANs for improved quality, stability, and variation. In Int. Conf. Learn. Represent., Cited by: §A.1.
  • [19] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of stylegan. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 8110–8119. Cited by: §A.1, §3.3.
  • [20] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Niessner, P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt (2018) Deep video portraits. In Proc. SIGGRAPH, Cited by: §2.
  • [21] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. Int. Conf. Learn. Represent.. Cited by: §A.1.
  • [22] C. Lee, Z. Liu, L. Wu, and P. Luo (2020) MaskGAN: towards diverse and interactive facial image manipulation. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §A.1, §3.2, §4.4.
  • [23] X. Liu, G. Yin, J. Shao, X. Wang, and H. Li (2019) Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In NeurIPS, Cited by: §2.
  • [24] F. Marra, D. Gragnaniello, D. Cozzolino, and L. Verdoliva (2018) Detection of gan-generated fake images over social networks. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 384–389. Cited by: §A.8.
  • [25] L. Mescheder, A. Geiger, and S. Nowozin (2018) Which training methods for gans do actually converge?. In International conference on machine learning, pp. 3481–3490. Cited by: §A.1, §3.3.
  • [26] Y. Mirsky and W. Lee (2021) The creation and detection of deepfakes: a survey. ACM Computing Surveys (CSUR) 54 (1), pp. 1–41. Cited by: §A.8.
  • [27] K. Nagano, J. Seo, J. Xing, L. Wei, Z. Li, S. Saito, A. Agarwal, J. Fursund, and H. Li (2018) PaGAN: real-time avatars using dynamic textures. ACM Trans. Graph. 37 (6), pp. 1–12. Cited by: §2.
  • [28] Y. Nirkin, Y. Keller, and T. Hassner (2019) Fsgan: subject agnostic face swapping and reenactment. In Int. Conf. Comput. Vis., pp. 7184–7193. Cited by: §2.
  • [29] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §A.1, §2, §3.1, §3.4, §4.4.
  • [30] O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015) Deep face recognition. Brit. Mach. Vis. Conf.. Cited by: §3.3, 3rd item.
  • [31] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer (2018) Ganimation: anatomically-aware facial animation from a single image. In Eur. Conf. Comput. Vis., pp. 818–833. Cited by: §2.
  • [32] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §A.1, §2.
  • [33] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe (2019-12) First order motion model for image animation. In NeurIPS, Cited by: §A.2, §A.4.1, Table 3, §1, §1, §2, §2, §4, Table 1.
  • [34] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. Int. Conf. Learn. Represent.. Cited by: §3.3.
  • [35] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman (2017) Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. 36 (4), pp. 1–13. Cited by: §2.
  • [36] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt (2015) Real-time expression transfer for facial reenactment.. ACM Trans. Graph. 34 (6), pp. 183–1. Cited by: §2.
  • [37] J. Thies, M. Zollhöfer, and M. Nießner (2019) Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph.. Cited by: §2.
  • [38] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2face: real-time face capture and reenactment of rgb videos. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 2387–2395. Cited by: §2.
  • [39] R. Tolosana, R. Vera-Rodriguez, J. Fierrez, A. Morales, and J. Ortega-Garcia (2020) Deepfakes and beyond: a survey of face manipulation and fake detection. Information Fusion 64, pp. 131–148. Cited by: §A.8.
  • [40] S. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros (2020) CNN-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8695–8704. Cited by: §A.8.
  • [41] T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro (2019) Few-shot video-to-video synthesis. In NeurIPS, Cited by: §A.7, §2.
  • [42] T. Wang, A. Mallya, and M. Liu (2020) One-shot free-view neural talking-head synthesis for video conferencing. arXiv preprint arXiv:2011.15126. Cited by: §2.
  • [43] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: 1st item.
  • [44] O. Wiles, A. Koepke, and A. Zisserman (2018) X2face: a network for controlling face generation using images, audio, and pose codes. In Eur. Conf. Comput. Vis., pp. 670–686. Cited by: §2, §4, Table 1.
  • [45] W. Wu, Y. Zhang, C. Li, C. Qian, and C. C. Loy (2018) Reenactgan: learning to reenact faces via boundary transfer. In Eur. Conf. Comput. Vis., pp. 603–619. Cited by: §2.
  • [46] E. Zakharov, A. Ivakhnenko, A. Shysheya, and V. Lempitsky (2020-08) Fast bi-layer neural synthesis of one-shot realistic head avatars. In Eur. Conf. Comput. Vis., Cited by: §A.7, §1, §1, §2, §2, §3.5, 4th item, §4, Table 1.
  • [47] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky (2019) Few-shot adversarial learning of realistic neural talking head models. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 9459–9468. Cited by: §A.1, §A.1, §A.3, §A.4.2, §A.7, §1, §1, §2, §2, §3.3, §3.5, §4, §4, §4.4, Table 1.
  • [48] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    .
    In IEEE Conf. Comput. Vis. Pattern Recog., pp. 586–595. Cited by: 2nd item.
  • [49] X. Zhang, S. Karaman, and S. Chang (2019) Detecting and simulating artifacts in gan fake images. In 2019 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–6. Cited by: §A.8.
  • [50] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Int. Conf. Comput. Vis., pp. 2223–2232. Cited by: §2.
  • [51] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In Adv. Neural Inform. Process. Syst., Cited by: §2.
  • [52] P. Zhu, R. Abdal, Y. Qin, and P. Wonka (2020) Sean: image synthesis with semantic region-adaptive normalization. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 5104–5113. Cited by: §2.

Appendix A Appendix

a.1 Implementation details

Dataset pre-processing.

The released VoxCeleb2 dataset [7] contains pre-processed videos to have a center crop around the face. We uniformly sample 10 frames from each video and obtain the facial landmarks using an off-the-shelf facial landmarks detector [4]. Once the landmarks are obtained we use the same procedure as [47] to connect the facial landmarks to obtain contours for different face parts (e.g., eyes, nose, lips …etc. ). We observe that the facial landmarks extraction fails for a small fraction of videos, which we opted to ignore. We also segment each frame using the face parsing tool provided by [22] to obtain the oracle segmentation maps for pre-training the layout prediction network. The face parsing network performs poorly on VoxCeleb2 frames due to the domain gap, in terms of image resolution and the distribution of head poses, between the datasets used to train the face parsing network [22], and the cropped VoxCeleb2 videos. We observe that the face segmentation network [22] better captures different details at different resolutions. For example the segmentation result at the original VoxCeleb2 resolution better captures larger regions like the hair, neck and clothes. On the other hand, upsampling the frame to the resolution used for training the segmentation network [22] gives better segmentation results for the finer and smaller regions like the nose, eyes, mouth, and ears. So, to improve the oracle segmentations, we segment each frame twice at 256x256 and 512x512 resolutions and merge the coarse and fine semantic classes from both results.

Encoder networks.

We use a resnet encoder for both the layout and style encoders . The encoder architecture has 5 downsampling blocks, followed by a fully connected layer that generates a 512-dimensional latent code. The architecture for the residual blocks is borrowed from [3], with replacing average-pooling with blur-pooling. We use 32 feature maps at the first encoder layer and double this number after each downsampling block with a maximum of 512 feature maps. We follow [47] and concatenate the facial landmarks to the few-shot RGB images before feeding them to the encoder.

Layout generator.

We use a traditional UNet architecture [32] with residual blocks. The residual blocks are borrowed from [3] with replacing BatchNorm with Instance Norm and applying adaptive instance normalization (AdaIN[14]. The smallest and largest number of feature maps are 32 and 512 respectively, and we use blur-pool and bilinear upsampling in the downsampling and upsampling blocks respectively.

Image generator.

We use a SPADE generator architecture [29] with replacing BatchNorm with Instance Norm. We also use 32 feature maps at the last generator layer and 64 feature maps in each SPADE block, compared to 64 and 128 feature maps respectively in the original architecture [29]. The input to each SPADE block is the concatenation of the predicted layout map and the facial landmarks.

Discriminator.

We borrow the architecture of the discriminator network from [19], with reducing the smallest number of feature maps from 64 to 32. We also use a non-saturating logistic loss with gradient penalty [25].

Figure 9: Identity similarity metric (ID-SIM) for the single-shot setting across three test subsets representing low, medium and high variance between the source and target poses. The performance gap widens in our favor as the pose variance increases.
Training.

We follow [18] and use equalized learning rate in all of our networks. We pre-train the layout prediction network for 2 epochs, followed by training the full pipeline for 8 epochs. Our best model was left to train for an extra 5 epochs, which mainly improves the FID score, while slightly improving the other quantitative metrics as well. We use an Adam optimizer [21] with , and a learning rate of for all networks. We linearly decay the learning rate by a factor of during the last epoch. For more implementation and training details, we will release the code and training scripts at http://www.cs.umd.edu/~mmeshry/projects/lsr/.

a.2 Robustness to pose variation

Here we investigate the robustness of different methods against the pose variation between the source and the target images. First, we cluster the test set into low, medium and high pose variance based on the mean normalized keypoint difference (NMKE) between the source and target ground truth images. Then we compute the identity similarity metric (ID-SIM) per each cluster for the single-shot setting and report the results in figure 9. The performance gap between our method and the baselines widens as the pose variance increases, indicating that our method has better robustness against pose variation. Note that we report the results only for the single-shot setting, where the performance gap with the FOMM baseline [33] is close. However, our method significantly outperforms FOMM in the multi-shot setting, as we show in Section A.4.1.

Figure 10: Averaging latent codes from multi-shot inputs successfully filters out transient occluders and maintains only the desired information for novel view head synthesis.

a.3 Effect of latent averaging

Given K-shot inputs, we follow [47] and obtain a single layout and style latents {} by averaging the K layout and style latents computed from the inputs respectively. We observe that averaging the K latents cancels out view-specific information and transient occluders, and successfully maintains the implicit 3D information needed for novel view head synthesis. Figure 10 shows some examples that highlight this effect. The single-shot source images show some transient occluders like the subjects’ hand or news bar, which in turn corrupts our single-shot output. However, increasing the inputs to four frames successfully filters out the transient occluders and results in clean outputs.

a.4 More comparative results

Method PSNR SSIM LPIPS ID-SIM NMKE FID
FOMM [33] 18.20 0.635 0.236 0.869 0.061 56.10
Ours-meta (K=1) 17.27 0.598 0.241 0.869 0.041 48.11
Ours-ft (K=1) 17.37 0.605 0.232 0.886 0.041 45.69
Ours-meta (K=4) 18.90 0.638 0.192 0.909 0.039 43.19
Ours-ft (K=4) 19.33 0.661 0.171 0.930 0.037 34.31
Table 3: Comparison with the FOMM baseline [33]. While FOMM cannot benefit from multiple input frames, our method shows a significant improvement over FOMM with as few as 4-shot inputs.

a.4.1 Comparison with FOMM

The FOMM baseline [33] accurately reconstructs the background and other static regions due to its warping-based nature. Therefore, it achieves lower reconstruction error (PSNR and SSIM) than our approach in the single-shot setting, even if their output contains clear artifacts in the face area. However, one limitation to FOMM is that it cannot utilize more input frames to its advantage. On the other hand, Table 3 shows that our approach benefits from as few as four input frames to outperform FOMM, even in the meta-learned mode. Subject fine-tuning further improves our performance to outperform FOMM by a wide margin in all metrics.

K Method No Subject Fine-tuning Subject Fine-tuned
PSNR SSIM LPIPS ID-SIM NMKE FID PSNR SSIM LPIPS ID-SIM NMKE FID
1 FSTH 16.80 0.570 0.259 0.801 0.048 51.12 16.92 0.597 0.263 0.836 0.049 53.07
LPD 0.732 0.072 80.20 0.837 0.070 48.48
Ours 17.27 0.598 0.241 0.869 0.041 48.11 17.37 0.605 0.232 0.886 0.041 45.69
4 FSTH
LPD 0.755 0.069 79.67 0.909 0.058 38.81
Ours 18.90 0.638 0.192 0.909 0.039 43.19 19.33 0.661 0.171 0.930 0.037 34.31
8 FSTH 17.86 0.600 0.225 0.836 0.046 46.38 18.35 0.647 0.218 0.899 0.044 45.15
LPD 0.760 0.068 77.00 0.922 0.056 35.87
Ours 19.18 0.645 0.186 0.917 0.039 44.23 19.65 0.675 0.160 0.940 0.036 32.54
32 FSTH 18.66 0.613 0.207 0.843 0.044 44.85 19.69 0.686 0.171 0.927 0.041 33.69
LPD 0.769 0.066 63.47 0.935 0.054 33.96
Ours 19.35 0.650 0.182 0.921 0.037 43.32 19.98 0.690 0.146 0.948 0.038 28.26
Table 4: Detailed quantitative comparison with the few-shot baselines, showing the effect of both increasing the K-shot inputs and subject-specific fine-tuning.
Figure 11: Extending Figure 5 of the main paper. More results comparing our performance to the few-shot baselines with respect to increasing the the K-shot inputs and applying subject fine-tuning.

a.4.2 More comparative evaluation

We report the quantitative details for the effect of increasing the number of K-shot inputs, as well as the effect of subject fine-tuning in Table 4. We observe similar conclusions to those obtained from Figure 6 in the main paper. LPD [5] performs very poorly in the meta-learned setting, and only outperforms the FSTH baseline [47] in the subject fine-tuning setting. On the other hand, our method consistently outperforms the baselines in all metrics across different settings. Furthermore, the performance of our method at is on-par with or outperforms the baselines evaluated at across all metrics. Since the LPD [5] baseline does not predict the background and re-crops the input/output frames, we subtract the background and compare with their corresponding cropped ground truths for quantitative analysis. We also exclude LPD from frame reconstruction evaluation since its outputs do not align with the rest of the methods. Also, the authors of FSTH [47] only provide their output for and they did not release their code. Therefore, we don’t report their performance for .

Figure 12: Extending Figure 4 of the main paper, showing more qualitative comparisons in the single-shot setting. We show three sets of examples representing low, medium and high variance between the source and target poses. Our method is more robust to pose variations than the baselines.

a.4.3 More qualitative comparisons

Here, we expand our qualitative comparisons of the main paper in both the multi-shot and single-shot settings. Figure 11 extends Figure 5 of the main paper. It shows more examples comparing the effect of increasing the K-shot inputs and applying subject fine-tuning between our method and the baselines. Figure 12 shows more comparisons in the single-shot setting as Figure 4 in the main paper.

Figure 13: Qualitative results of our method showing the gains of increasing the number of K-shot inputs and applying subject fine-tuning.
Figure 14: Expanding Figure 7 of the main paper by showing the same reenactment results in the single and 4-shot settings. Our model extrapolates well to challenging poses and expressions even with a single-shot input (shown in source), while preserving the source identity.
Figure 15: More cross-subject reenactment results with different driving identities. Results are shown for our meta-learned model without any fine-tuning, and using 32-shot inputs. We also show the corresponding latent spatial layout maps.

a.5 More qualitative results

We show more qualitative results of our method showing the effect of increasing the K-shot inputs, and the effect of applying the subject fine-tuning in Figure 13. We observe that we get a noticeable improvement when we increase K from 1 to 4. The visual gain from increasing K further starts to saturate, although quantitative metrics generally keep improving (e.g., Table 4). While increasing K beyond 4 still leads to better visual results in general, we observe that the most improvement focuses on the background and clothes reconstruction, with slight improvements to sharpness and color matching as well. Subject fine-tuning further improves the sharpness and better reconstructs the background details.

Figure 16: Qualitative results on subjects not belonging to VoxCeleb2.

a.6 More reenactment results

We first expand Figure 7 of the main paper by showing the same reenactment results but for the single-shot meta-learned setting and the 4-shot inputs in both the meta-learned and subject fine-tuned settings in Figure 14. The results show that even in the single-shot meta-learned setting, our model does a pretty good job extrapolating the input image (source) to challenging poses and expressions, while preserving the source identity. Increasing the input shots to 4 leads to a noticeable visual improvement, and fine-tuning further leads to slight improvements, most notable in the female source (middle example). These results show that our method does not require many frames to produce realistic and identity preserving results. For video comparison with the baselines, please refer to the supplementary video.

We also extend Figure 7 of the main paper by showing more reenactment results in Figure 15. We also show the predicted spatial layouts corresponding to the outputs. The predicted spatial layouts may be less interpretable than traditional semantic segmentations, but they seem to encode more information and capture accurate details about the face shape.

Additionally, we perform out-of-domain reenactment using source subjects not present in the VoxCeleb2 dataset. Some qualitative results are shown in Figure 16. Our approach can synthesize realistic novel views given only a single-shot input, although in some cases it shows a bit of an identity gap.

Figure 17: Example limitations. Top: male-to-female reenactment sometimes causes low identity preservation and other visible artifacts. Bottom: our approach cannot faithfully reconstruct the background details.

a.7 Limitations and failure cases

Here we discuss some of the limitations of our approach.

Temporal consistency.

Similar to previous direct synthesis approaches [47, 5, 46], our training does not enforce temporal consistency. Therefore the output videos often contain some flickering. Considering the temporal aspect during training (e.g., similar to [41]) could mitigate this problem, but on the expense of higher training cost.

Failure modes for cross-subject reenactment.

We observe that most of the failure cases in cross-subject reenactment are caused by either source subjects with complex backgrounds, or using male drivers to animate female sources (e.g.,  Figure 17). Since complex backgrounds could lead to artifacts in our results, then this signifies that the background information is entangled with the face and identity information. Learning a better disentangled representation could improve this problem. On the other hand, the trouble faced with male-to-female reenactment implies that our approach still has some sensitivity to the driver landmarks. While our approach reduces this sensitivity significantly compared to previous baselines, there is still room for improvement.

Background reconstruction

Direct synthesis approaches, including our method, synthesize the target frame from a compressed latent code. This compressed bottleneck leads to the loss of some information, especially for the background details. Figure 17 shows some examples in the single-shot setting. Our method cannot transfer static parts (e.g., the closed captions or the background) from the source image to the synthesized view. Borrowing elements from the warping-based approaches is one direction to better reconstruct static details.

Dataset-induced limitations.

The VoxCeleb2 dataset [7] has low resolution videos and is processed to perform zoomed-in center crops that often cut off the top of the head. Dataset biases are inevitably inherited by the trained models. Therefore, generating output for out-of-domain inputs requires pre-processing the inputs to have similar properties to the VoxCeleb2 dataset.

a.8 Ethical concerns

While the task of synthesizing realistic talking heads has a wide range of applications, it also raises ethical concerns regarding potential misuses of this technology. A prime example of this is the growing misuse of DeepFakes [39, 26]. Several state-of-the-art methods can easily swap identities, expressions as well as face attributes and generate photo-realistic samples. Additionally, with the increase in the ease of access to face reenactment models, more and more people can misuse such models through widely available applications. Thus it is important at the same time to have the ability to detect fake content. In this direction recent works like [24, 40, 49] have tried to solve the problem of detecting real fake images. Especially interesting is the work by Wang . [40] which shows that models for fake image detection can be made to generalize well to unseen scenarios. While this is a temporary respite, it is important to continue research in the field of fake image detection to keep on par with the ever improving field of image synthesis, as not only do models improve, but also the ease of access to such models grows rapidly.