StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3

Realistic generative face video synthesis has long been a pursuit in both computer vision and graphics community. However, existing face video generation methods tend to produce low-quality frames with drifted facial identities and unnatural movements. To tackle these challenges, we propose a principled framework named StyleFaceV, which produces high-fidelity identity-preserving face videos with vivid movements. Our core insight is to decompose appearance and pose information and recompose them in the latent space of StyleGAN3 to produce stable and dynamic results. Specifically, StyleGAN3 provides strong priors for high-fidelity facial image generation, but the latent space is intrinsically entangled. By carefully examining its latent properties, we propose our decomposition and recomposition designs which allow for the disentangled combination of facial appearance and movements. Moreover, a temporal-dependent model is built upon the decomposed latent features, and samples reasonable sequences of motions that are capable of generating realistic and temporally coherent face videos. Particularly, our pipeline is trained with a joint training strategy on both static images and high-quality video data, which is of higher data efficiency. Extensive experiments demonstrate that our framework achieves state-of-the-art face video generation results both qualitatively and quantitatively. Notably, StyleFaceV is capable of generating realistic 1024×1024 face videos even without high-resolution training videos.

READ FULL TEXT VIEW PDF

page 1

page 3

page 4

page 5

page 7

page 8

03/28/2022

Encode-in-Style: Latent-based Video Encoding using StyleGAN2

We propose an end-to-end facial video encoding approach that facilitates...
03/30/2022

High-resolution Face Swapping via Latent Semantics Disentanglement

We present a novel high-resolution face swapping method using the inhere...
09/15/2022

HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator

Video prediction is an important yet challenging problem; burdened with ...
06/27/2022

Video2StyleGAN: Encoding Video in Latent Space for Manipulation

Many recent works have been proposed for face image editing by leveragin...
10/12/2020

High-Fidelity 3D Digital Human Creation from RGB-D Selfies

We present a fully automatic system that can produce high-fidelity, phot...
06/06/2019

Scaling Autoregressive Video Models

Due to the statistical complexity of video, the high degree of inherent ...
05/06/2021

Inverting Generative Adversarial Renderer for Face Reconstruction

Given a monocular face image as input, 3D face geometry reconstruction a...

Code Repositories

Introduction

Face generation has been a long-standing research topic in vision and graphics. As humans perceive the world from a video-point of view, the exploration of video face generation is of great importance. Despite the great progress Goodfellow et al. (2014); Karras et al. (2019, 2020b, 2020a) in generating static face images, the task of face video generation is less explored due to the extra complexity introduced by the temporal dimension. Compared to a face image, we expect a generated face video to possess three desired properties: 1) High fidelity. The face of each frame should be photo-realistic. 2) Identity preservation. The same face identity should be kept among frames in the generated video. 3) Vivid dynamics. A natural face video is supposed to contain a reasonable sequence of motions.

However, current face video generation methods still struggle to generate visually satisfying videos that meet all desired properties above. Most previous works design specific end-to-end generative models Tian et al. (2021); Tulyakov et al. (2018); Clark et al. (2019); Skorokhodov et al. (2021); Yu et al. (2022) that target directly at video generation. Such designs are expected to achieve both spatial quality and temporal coherence in a single framework trained from scratch, imposing great burden to the network. In another fashion, some recent works Fox et al. (2021); Tian et al. (2021) synthesize videos by employing a pre-trained StyleGAN and finding the corresponding sequence of latent codes. But those methods generate videos by manipulating the entangled latent codes thus making identity easily change with head movements.

To address these challenges, we propose a new framework, named StyleFaceV with three desired properties: photo-realism, identity preservation and vivid dynamics. In Fig. 1, we show some videos generated by our framework. The two videos in each row share the same face identity but with different facial movements. Our framework uses the StyleGAN3 Karras et al. (2021) to synthesize a sequence of face frames to obtain the final generated video. With the powerful image generator, the key to generate a realistic video is: How can we generate a temporally coherent video with the same identity by traversing through the latent space of StyleGAN3? Our core insight is to decompose the latent space of the StyleGAN3 into the appearance and pose information via a decomposition and recomposition pipeline.

The decomposition module extracts pose and appearance information from the images synthesized by StyleGAN3. We design two extraction networks to extract pose features and appearance features, respectively. With the decomposed features, the recomposition module is employed to fuse them back to the latent codes as the inputs to StyleGAN3. The fused latent codes are then used to generate the facial images. With the decomposition and recomposition pipeline, we can easily sample a well disentangled sequence of latent codes with the same identity by sampling the pose information while keeping the appearance information fixed. The sampling of pose sequences is achieved by an additional LSTM Hochreiter and Schmidhuber (1997) module.

To train the decomposition and recomposition pipeline, we need faces with the same identity but different poses. Such paired data can be easily obtained from face videos. However, large-scale and high-quality video datasets are lacking. To alleviate the problems caused by the lack of high-quality videos, we propose the joint training strategy, which utilizes the rich image priors captured in the well pretrained StyleGAN3 model as well as the limited video data. Specifically, we do the self augmentation by synthesizing a new frame with a randomly translated and rotated face through the affine layer embedded in StyleGAN3 to simulate paired frames of face videos. The use of synthetic data eases the requirements on the video dataset. Notably, this joint training design enables our model to generate realistic

face videos even without accessing high-resolution training videos.

In summary, our main contributions include:

  • We propose a novel framework StyleFaceV which decomposes and recomposes pretrained StyleGAN3 to generate new face videos via sampling decomposed pose sequences.

  • We design a joint training strategy which utilizes information from both the image domain and video domain.

  • Our work exhibits that latent space of StyleGAN3 contains a sequence of latent codes which is able to generate a temporally coherent face video with vivid dynamics.

Figure 2: Overview of Our Proposed StyleFaceV. (a) Our proposed StyleFaceV decomposes the image into an appearance representation and a pose representation . Then we re-compose both representations to get the intermediate embedding as the input of the synthesis network . (b) Besides video training data, joint training strategy also samples generated images for training to utilize rich image priors captured in the well pretrained StyleGAN3 model by sampling initial content noise . To simulate paired frames with different poses but the same appearance, self-augmentation is used to randomly transform the generated image through the embedded function of StyleGAN3. (c) To generate videos from scratch, an LSTM-based motion sampling module is trained to sample a sequence of pose representations from motion noise , which is then re-composed with a randomly sampled appearance representation to generate a sequence of frames.

Related Work

StyleGAN Models.

With the success of Generative Adversarial Networks (GAN)  

Goodfellow et al. (2014), generative models have become a popular technique to generate natural images. After that, various GAN architectures Karras et al. (2017); Miyato et al. (2018); Brock et al. (2018) are proposed to stabilize the training process and the quality of generated images is greatly improved. After StyleGAN series Karras et al. (2019, 2020b, 2020a) have achieved high-resolution generation and style controlling, Karras et al. (2021)

propose StyleGAN3 to solve the “texture sticking” problem, paving the way for generative models better suited for video and animation. Besides generating images from scratch, those pre-trained image generators are also useful for performing several image manipulation and restoration tasks by utilizing captured natural image priors, such as image colorization

Pan et al. (2021); Wu et al. (2021)

, super-resolution

Chan et al. (2021); Wang et al. (2021); Menon et al. (2020) and facial image editing Shen et al. (2020); Patashnik et al. (2021); Jiang et al. (2021). In our proposed StyleFaceV framework, we also utilized a pre-trained StyleGAN3 to render photo-realistic face videos.

Video Generation. Different from image generation which only requires sampling at the spatial level, video generation involves an additional sampling at temporal level. Many works of video synthesis have achieved impressive results, e.g., video-to-video translation with given segmentation masks  Wang et al. (2018, 2019) or human poses  Chan et al. (2019), and face reenactment with given both human identities and driven motions  Siarohin et al. (2019); Zhou et al. (2021). However, in the unconditional setting, generating videos from scratch remains still unresolved. Early works Tulyakov et al. (2018); Saito et al. (2017); Hyun et al. (2021); Aich et al. (2020); Munoz et al. (2021); Saito et al. (2020) proposed to use a single framework to synthesize low-resolution videos by sampling from some noise. Tulyakov et al. (2018) decompose noises in the content domain and motion domain. The content noise controls the content of the synthesized video, while motion noises handle the temporal consistency. Recently, MoCoGAN-HD Tian et al. (2021), and StyleVideoGAN Fox et al. (2021) scaled up the resolution of synthesized videos by employing pre-trained StyleGAN. MoCoGAN-HD and StyleVideoGAN proposed to sample a sequence of latent codes, which were then fed into pre-trained StyleGAN to synthesize a sequence of images. The pose information and identity information were mixed in one single latent code. Different from these works, our methods explicitly decompose the input to the generator into two branches, i.e., pose and appearance, via a decomposition loss. DI-GAN Yu et al. (2022) introduces an INR-based video generator that improves motion dynamics by manipulating space and time coordinates separately. However, it has a high computational cost and suffers sticking phenomenon. StyleGAN-V Skorokhodov et al. (2021) modified architectures of the generator and discriminator of StyleGAN by introducing a pose branch to inputs. Both DI-GAN and StyleGAN-V require the retraining of generators on video datasets, making the faces in synthesized videos restricted to the domain of the training set. By contrast, our method utilizes image priors captured in the pre-trained StyleGAN3 and synthesizes facial videos with diverse identities.

Our Approach

Our framework is built on top of powerful pre-trained image generator StyleGAN3 Karras et al. (2021), which provides high quality face frames. As shown in Fig. 2, a decomposition and recomposition pipeline is used to find some sequences of latent codes from the latent space of StyleGAN3 to generate a temporally coherent sequence of face images with the same identity. Finally, the motion sampler models the distribution of natural movements and samples a sequence of poses to drive the movements of faces.

Pre-Trained Image Generator

The recently proposed StyleGAN3 Karras et al. (2021) does not suffer from the “texture sticking” problem and thus paves the way for generating videos by pre-trained image generators. Among several variants of StyleGAN3, only StyleGAN3-R is equivalent to the rotation transformations. Considering face rotations are commonly observed in talking-face videos, we adopt the pre-trained StyleGAN3-R as our image generator. It contains a mapping network that maps the content noise into an intermediate latent space , and a synthesis network that synthesizes image from the intermediate embedding :

(1)

We slightly fine-tune the generator on video dataset so that it can generate images from the distribution of video dataset.

In the unconditional setting, face video generation task aims to generate a realistic face video sequence from scratch. Regarding the generated image as a single frame, a sequence of generated images can form the final synthesized video . Therefore, the rest problem is how to find a sequence of latent codes from the latent space of StyleGAN3 to generate a temporally coherent sequence of face images with the same identity.

Decomposition and Recomposition

To find the path in the latent space of StyleGAN3 which is able to change the face actions and movements without changing the identity, we propose a decomposition and recomposition pipeline as shown in Fig. 2(a). In this pipeline, we decompose a face image into an appearance representation and a pose representation . Then we can simply re-compose and to get the intermediate embedding as the input of the synthesis network by the re-composition network :

(2)

This pipeline allows us to drive the face by changing the pose representation while the identity is kept by freezing the appearance representation .

Figure 3: Pose Extraction. We visualize the pose representations for two persons and draw the traced points on its corresponding image.

Pose Extraction. Pose representations in face videos mainly consist of two levels of information: 1) overall head movement, 2) local facial movement. Both of these two kinds of movements will be reflected in landmarks. Therefore, we use a pre-trained landmark detection model Wang et al. (2019)

to supervise the pose estimation. However, all predicted landmarks also contain the identity information, like face shape. To purify the pose information without reserving the identity information, our pose extractor

are only trained to reserve information of selected key face points :

(3)
(4)

where is key face points predictor and presents targeted key face points selected from predicted landmarks from the pre-trained landmark detection model. As shown in Fig. 3, points only trace the position of facial attributes and their status (e.g., open/close for mouth). The number of key face points is significantly less than that of landmarks. In addition, more detailed pose information which is not obtained by landmarks will be supervised during the reconstruction stage.

Appearance Extraction. Different from the pose representations, it is hard to directly define appearance representation. In this face video generation task, we assume appearance is not changed. Based on this assumption, if we sample two frames and from the same video, we can get:

(5)

In theory, can be learned by making close to . However, we find that networks can hardly converge if we train the framework together from scratch. It is mainly because and disentangle with each other, and can not provide effective appearance/pose information if we randomly initialize the network. Therefore, we first train with the single loss in Eq. 4. After is pretrained, we train the whole framework together.

Objective Functions. To use the video dataset, each time we sample frames with interval and predict each frame through Eq. 5. Here we simply set to make sampled frames have obvious pose differences. By training to reconstruct frames with the same appearance, is guided to ignore the pose information while learns to extract the differences among frames, which are pose representations in this work. The overall function for training with video dataset is:

(6)

where and are constraints to guarantee the reconstruction quality in both pixel domain and perceptual feature domain extracted by VGG16 Simonyan and Zisserman (2014). is an adversarial loss Goodfellow et al. (2014) to enhance local details. is a pose training loss introduced in Eq. 4. is the weight for different losses when using video data. When the pipeline is able to do roughly good images reconstruction with decomposition, s for training with sampled images will decrease to make pose representations bring a finer control, which will be further analyzed in ablation study.

Joint Training Strategy. A well-trained StyleGAN3 model contains rich image priors captured in the large-scale and high-quality face image datasets. Utilizing this feature, we can also sample identities from distribution of original image datasets. However, there is no shared appearance among sampled images and each image is the individual frame. To prevent the networks from being lazy to do only reconstruction without decomposing, we design the joint training strategy as shown in Fig. 2(b). We firstly do the self augmentation through the affine layer embedded in StyleGAN3-R to produce a new frame with a randomly translated and rotated face while its face appearance is kept and background is still natural. Then we use the original version to provide the pose information and the transformed variants to provide the appearance information.

Specifically, instead of sampling two frames and from the same video in Eq. 5, we use the equation below to get and from the StyleGAN3-R:

(7)

where is the random rotation angle, and , are the random translation parameters.

In addition, we add a self-supervised embedding loss function

to help the convergence of training:

(8)

where and . Other loss functions are the same as Eq. 6.

Figure 4: Qualitative Comparison. We compare our proposed StyleFaceV with some representative methods on the video generation task. Results of MCHD-pre and MCHD-re are from FFHQ domain, while results of SG-V and DI-GAN are from RAVDESS domain. Our StyleFaceV can generate videos by sampling from both domains and we compare them respectively. Qualitative comparison shows that our StyleFaceV outperforms all other methods in both quality and identity persevering.

Motion Sampling

In order to generate videos from random motion noise , we design a motion sampling module to sample meaningful sequences of pose representations. Here we choose LSTM Hochreiter and Schmidhuber (1997) to generate later poses of frames by giving the initial pose and sampled motion noises :

(9)

Because we have no ground truth for generated results, we use a 2D adversarial loss to make the single generated pose representation fit the distribution of real videos and a 3D adversarial loss whose inputs are consecutive pose representations to constrain the temporal consistency:

(10)

where is the video discriminator to predict whether a video is real or fake according to the given sequence.

Additionally, we add a diversity loss to make the generated pose representations vary with the given motion noises :

(11)

where is a fully connected layer and is the hidden feature.

For the motion sampler of resolution, we generate final videos with sampled pose sequence and apply supervisions on image/video representations because we find this operation brings more diverse motions. But for resolution, all losses are added on pose representations only, which are much smaller than image space, saving a large amount of computation cost. This design allows the training process to supervise a longer sequence.

Experiments

Joint Unaligned Dataset

Previous face video generation methods Tulyakov et al. (2018); Saito et al. (2020); Skorokhodov et al. (2021) usually use FaceForensics Rössler et al. (2019) as the video dataset and are able to generate good videos whose identities are from FaceForensics. However, StyleGAN-V Skorokhodov et al. (2021) points out the limitation that only identities of FaceForensics can be sampled. To overcome this problem, we propose the joint training strategy. We choose unaligned FFHQ Karras et al. (2019) as image dataset that provides tens of thousands of identities for sampling and RAVDESS Livingstone and Russo (2018) as video dataset which provides high-quality vivid face motion sequences. Compared to FaceForensics, RAVDESS has more various motions and better video quality thus having a smaller gap with the image dataset FFHQ, helping the convergence of joint training.

In addition, many previous video generation methods Tulyakov et al. (2018); Saito et al. (2020); Skorokhodov et al. (2021) stick the face in the center by aligning the face bounding box when preprocessing face video data. It erases natural head movement thus reducing the difficulty of face generation. However, this operation brings obvious face distortions and shaking among frames. In this paper, we use the setting Siarohin et al. (2019) and allow the face to freely moved in a fixed window. It is more consistent with real-world face videos because, in real-world face videos, the camera is usually fixed.

Evaluation Metrics. For evaluation, we report Frechet Inception Distance (FID) Heusel et al. (2017) and Frechet Video Distance (FVD) Unterthiner et al. (2018). In this paper, we use two versions of FID: FID-RAVDESS and FID-Mixture, where FID-RAVDESS is computed against RAVDESS dataset and FID-Mixture is calculated against a mixture dataset (RAVDESS and FFHQ). We also conduct a user study to evaluate the quality of generated videos. We mix our generated videos with those generated by baselines because paired comparison is not supported in unconditional generation. A total of users were asked to give two scores (-, the best score is ) for the naturalism of movements and the identity preservation throughout the movements.

Methods FID-RAVDESS () FID-Mixture () FVD ()
MCHD-pre 174.44 53.53 1425.77
MCHD-re 146.08 73.73 1350.67
StyleGAN-V 17.93 95.35 171.60s
DIGAN 12.02 77.01 142.08
StyleFaceV 15.42 25.31 118.72
Table 1: Quantitative Comparisons on FID and FVD Score. We compute the FID-RAVDESS between the generated frames and real frames from RAVDESS. FID-Mixture is calculated against a mixture dataset. FVD is computed between synthesized videos and videos in RAVDESS dataset.

Experimental Settings

We compare our proposed StyleFaceV with the following representative methods on the face video generation task. Compared to previous sticking setting, unaligned face generation is more challenging. Considering huge computational costs for DI-GAN, all results are fixed with frames for the fair comparison.

MoCoGAN-HD Tian et al. (2021) uses a pre-trained image generator and synthesizes videos by modifying the embedding in latent space. We compare with two versions of MoCoGAN-HD: 1) MCHD-pre uses the released pretrained model which supports sampling identites from FFHQ domain. 2) MCHD-re retrains the models using StyleGAN3-R Karras et al. (2021) with video dataset RAVDESS Livingstone and Russo (2018) for a fair comparison, as the original version is built upon StyleGAN2 Karras et al. (2020b) which does not support various unaligned face poses.

DI-GAN Yu et al. (2022) uses an INR-based video generator to improve the motion dynamics by manipulating the space and time coordinates separately. Because it only supports the video dataset, we retrain it on RAVDESS dataset. The number of training frames is set to which is the same as the testing phase to avoid the collapse.

StyleGAN-V (SG-V) Skorokhodov et al. (2021) builds the model on top of StyleGAN2, and redesigns its generator and discriminator networks for video synthesis. We also retrain it on RAVDESS dataset.

Qualitative Comparison

As presented in Fig. 4, MCHD-pre always generates faces located at the center of the image because it is based on StyleGAN2. In addition, generated results will crash with the time going as this model is pre-trained on frames setting. Our retrained MCHD-re improves those problems by using StyleGAN3. Meanwhile, video dataset RAVDESS makes MCHD-re learn the face motion. However, the man in the second video clip of MCHD-re gradually becomes a woman with the pose movement. The reason is that MCHD-re still moves along the entangled latent codes thus making identity change with other actions. Our method well decomposes the appearance representations and pose representations, preserving the identity throughout the pose movement.

Another state-of-the-art method, DI-GAN exactly produces videos without identity changes. However, some of frames will be blurred or even crash. The most recent work StyleGAN-V does not suffer from the crash problem. However, its generated faces are easy to deform and distort. The poor stability is mainly because both DI-GAN and StyleGAN-V use a single end-to-end trained framework to handle both the quality and coherence of all synthesized frames, which gives a huge burden to the network. Therefore, DI-GAN and StyleGAN-V cannot model the unaligned faces with movements. In contrast, our framework uses a pre-trained StyleGAN3-R which only focuses on the quality of every single frame and uses other modules to undertake the duty of coherence. Results shown in Fig. 4 exhibit that our StyleFaceV outperforms all other methods in both quality and identity keeping.

Figure 5: User Study. We asked users to give 1 - 5 scores (the higher is better) for videos generated by different methods in terms of movement naturalism and identity preservation. Our method achieves the highest score.

Quantitative Comparison

As shown in Table 1, our proposed StyleFaceV achieves the FID-RAVDESS score comparable to that of DIGAN. Since our method adopts a joint training strategy which utilizes information from both the image domain and video domain, apart from RAVDESS dataset, it is able to generate images from the FFHQ distribution. In other words, the synthesized frames of our method are more diverse than RAVDESS dataset. Therefore, it has a slightly higher FID-RAVDESS than DIGAN. To further evaluate if the synthesized data are truly from FFHQ distribution and RAVDESS distribution, we set up a mixture testing dataset, which is uniformly mixed with FFHQ dataset and RAVDESS dataset. We compute the FID-Mixture with this mixture dataset. As we can see, our method achieves the lowest FID-Mixture score, verifying that our method is capable of generating frames similar to both of the domains. FVD is adopted as the metric for evaluating the quality of synthesized videos, and it is computed against two sets of videos. Compared to the state-of-the-art methods, our method has the lowest FVD score, indicating that videos synthesized by our method are more realistic. Fig. 5 shows the results of the user study. We compare our proposed StyleFaceV with MCHD-pre, MCHD-re, and DI-GAN. We report the preference percentage for each method. Our approach achieves the highest scores for both the naturalism of movement and identity preservation, outperforming baseline methods by a large margin.

Figure 6: Effectiveness of Self-Supervised Embedding Loss. The images in the first column are generated by StyleGAN3. Column two and three represent reconstruction results without/with self-supervised embedding loss, respectively. The model without self-supervised embedding loss always produces severely blur results.
Figure 7: Case without Weight Re-Balance. We reconstruct images using the leftmost appearance reference and motion references. The model without weight re-balance can not reconstruct fine details from pose representations.

Ablation Study

Settings

Arcface Cosine Similarity (

)
Without 0.6676
With 0.7266
Table 2: Arcface Cosine Similarity Influenced by Self-Supervised Embedding Loss. The arcface similarity is averaged over images. The model without self-supervised embedding loss has an arcface similarity of 0.6676, while our final method achieves 0.7266 in this metric.
Methods FID-RAVDESS () FID-Mixture () FVD ()
Without Reweight 119.90 21.09 722.84
Without 102.97 67.91 461.17
StyleFaceV 15.42 25.31 118.72
Table 3: Quantitative Ablations on FID and FVD Score. We compute the FID-RAVDESS between the generated frames and real frames from RAVDESS. FID-Mixture is calculated against a mixture dataset. FVD is computed between synthesized videos and videos in RAVDESS dataset.

Self-Supervised Embedding Loss. Unaligned images generated by StyleGAN3-R make models hardly converge. Therefore, the model trained without the self-supervised embedding loss generates blurry results. To help the model converge, we add the self-supervised embedding loss on . Since the loss is computed directly on the latent codes, the gradients are better preserved and propagated to and compared to the training with only a reconstruction loss at image level. As shown in Fig. 6, without the help of self-supervised embedding loss, the model is not able to recover more details of the original sampled image, thus produces severely blur results with both bad FID (FID-RAVDESS=, FID-Mixture=) and FVD (). We also report arcface cosine similarity between the reconstructed images and the original images. The arcface similarity is averaged over images. As shown in Table 2 , the model without self-supervised embedding loss has an arcface similarity of 0.6676, while our final method achieves 0.7266 in this metric, further indicating the effectiveness of the self-supervised embedding loss.

Weight Re-Balance. At the beginning of the training of our decomposition and re-composition pipeline, the biggest challenge is the reconstruction of unaligned images with various poses. After that, the model is supposed to be capable of tracing finer details from pose representations, especially the actions of the mouth. However, the change of mouth actions is only present in video data. To make networks focus on it, we need to re-balance the weight and make all s for training with sampled images significantly smaller (). Fig. 7 shows that the model with weight re-balance can successfully trace fine details from pose representations. As shown in Table. 3, for the setting without Weight Re-Balance, it focus on FFHQ reconstruction (FID-RAVDESS=) but performs poorly on RAVDESS reconstruction (FID-Mixture=) and motion sampling (FVD=).

Implementation Details

Network structure. , , and are formed by res-blocks. For 2D Discriminator, we just use NLayerDiscriminator Isola et al. (2017). 3D Discriminator is similar but uses conv3d instead.

Loss weights. Initially, , , , , for both image data and video data. After the disentanglement pipeline roughly converges and is able to reconstruct identity, we do weight rebalance to make the pose extractor mainly focus on fine motion capture. Then for image data, , , , , . For video data, , , , .

Figure 8: Results of StyleFaceV in High Resolution. StyleFaceV is able to generate videos.

Results in High Resolution

We also demonstrate the capability of StyleFaceV to generate videos in Fig. 8. It should be noted that the models are trained without accessing any high-resolution training videos.

Limitation

Our main limitation is that the generated videos whose identity from the video domain have more vivid motions than those from the image domain. The reason lies in that we have direct temporal training objectives on the video domain while there is no temporal constraint for the image domain. This is a common issue for all baselines and our proposed method has already demonstrated plausible results.

Conclusions

In this work, we propose a novel framework named StyleFaceV to synthesize videos by leveraging pre-trained image generator StyleGAN3. At the core of our framework, we decompose the inputs to the StyleGAN3 into appearance and pose information. In this way, we can generate videos from noise by sampling a sequence of pose representations while keeping the appearance information unchanged. We also propose a joint-training strategy to make use of information from both the image domain and video domain. With the joint training strategy, our method is able to generate high-quality videos without large-scale training video datasets. Extensive experiments demonstrate that our proposed StyleFaceV can generate more visually appealing videos than state-of-the-art methods.

References

  • A. Aich, A. Gupta, R. Panda, R. Hyder, M. S. Asif, and A. K. Roy-Chowdhury (2020) Non-adversarial video synthesis with learned priors. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 6090–6099. Cited by: Related Work.
  • A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: Related Work.
  • C. Chan, S. Ginosar, T. Zhou, and A. A. Efros (2019) Everybody dance now. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5933–5942. Cited by: Related Work.
  • K. C.K. Chan, X. Wang, X. Xu, J. Gu, and C. C. Loy (2021) GLEAN: generative latent bank for large-factor image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Related Work.
  • A. Clark, J. Donahue, and K. Simonyan (2019) Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571. Cited by: Introduction.
  • G. Fox, A. Tewari, M. Elgharib, and C. Theobalt (2021) Stylevideogan: a temporal generative model using a pretrained stylegan. arXiv preprint arXiv:2107.07224. Cited by: Introduction, Related Work.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: Introduction, Related Work, Decomposition and Recomposition.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30. Cited by: Joint Unaligned Dataset.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Introduction, Motion Sampling.
  • S. Hyun, J. Kim, and J. Heo (2021) Self-supervised video gans: learning for appearance consistency and motion coherency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10826–10835. Cited by: Related Work.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, Cited by: Implementation Details.
  • Y. Jiang, Z. Huang, X. Pan, C. C. Loy, and Z. Liu (2021) Talk-to-edit: fine-grained facial editing via dialog. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13799–13808. Cited by: Related Work.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: Related Work.
  • T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila (2020a) Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems 33, pp. 12104–12114. Cited by: Introduction, Related Work.
  • T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila (2021) Alias-free generative adversarial networks. Advances in Neural Information Processing Systems 34. Cited by: Introduction, Related Work, Pre-Trained Image Generator, Our Approach, Experimental Settings.
  • T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410. Cited by: Introduction, Related Work, Joint Unaligned Dataset.
  • T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020b) Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8110–8119. Cited by: Introduction, Related Work, Experimental Settings.
  • S. R. Livingstone and F. A. Russo (2018) The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13 (5), pp. e0196391. Cited by: Joint Unaligned Dataset, Experimental Settings.
  • S. Menon, A. Damian, S. Hu, N. Ravi, and C. Rudin (2020) Pulse: self-supervised photo upsampling via latent space exploration of generative models. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 2437–2445. Cited by: Related Work.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: Related Work.
  • A. Munoz, M. Zolfaghari, M. Argus, and T. Brox (2021) Temporal shift gan for large scale video generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3179–3188. Cited by: Related Work.
  • X. Pan, X. Zhan, B. Dai, D. Lin, C. C. Loy, and P. Luo (2021) Exploiting deep generative prior for versatile image restoration and manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Related Work.
  • O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski (2021) Styleclip: text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085–2094. Cited by: Related Work.
  • A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner (2019) FaceForensics++: learning to detect manipulated facial images. In International Conference on Computer Vision (ICCV), Cited by: Joint Unaligned Dataset.
  • M. Saito, E. Matsumoto, and S. Saito (2017)

    Temporal generative adversarial nets with singular value clipping

    .
    In Proceedings of the IEEE international conference on computer vision, pp. 2830–2839. Cited by: Related Work.
  • M. Saito, S. Saito, M. Koyama, and S. Kobayashi (2020) Train sparsely, generate densely: memory-efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision 128 (10), pp. 2586–2606. Cited by: Related Work, Joint Unaligned Dataset, Joint Unaligned Dataset.
  • Y. Shen, C. Yang, X. Tang, and B. Zhou (2020) Interfacegan: interpreting the disentangled face representation learned by gans. IEEE transactions on pattern analysis and machine intelligence. Cited by: Related Work.
  • A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe (2019) First order motion model for image animation. Advances in Neural Information Processing Systems 32. Cited by: Related Work, Joint Unaligned Dataset.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Decomposition and Recomposition.
  • I. Skorokhodov, S. Tulyakov, and M. Elhoseiny (2021) StyleGAN-v: a continuous video generator with the price, image quality and perks of stylegan2. arXiv preprint arXiv:2112.14683. Cited by: Introduction, Related Work, Joint Unaligned Dataset, Joint Unaligned Dataset, Experimental Settings.
  • Y. Tian, J. Ren, M. Chai, K. Olszewski, X. Peng, D. N. Metaxas, and S. Tulyakov (2021) A good image generator is what you need for high-resolution video synthesis. arXiv preprint arXiv:2104.15069. Cited by: Introduction, Related Work, Experimental Settings.
  • S. Tulyakov, M. Liu, X. Yang, and J. Kautz (2018) Mocogan: decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1526–1535. Cited by: Introduction, Related Work, Joint Unaligned Dataset, Joint Unaligned Dataset.
  • T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly (2018) Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717. Cited by: Joint Unaligned Dataset.
  • T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro (2019) Few-shot video-to-video synthesis. arXiv preprint arXiv:1910.12713. Cited by: Related Work.
  • T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-video synthesis. arXiv preprint arXiv:1808.06601. Cited by: Related Work.
  • X. Wang, Y. Li, H. Zhang, and Y. Shan (2021) Towards real-world blind face restoration with generative facial prior. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Related Work.
  • X. Wang, L. Bo, and L. Fuxin (2019) Adaptive wing loss for robust face alignment via heatmap regression. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Decomposition and Recomposition.
  • Y. Wu, X. Wang, Y. Li, H. Zhang, X. Zhao, and Y. Shan (2021) Towards vivid and diverse image colorization with generative color prior. In International Conference on Computer Vision (ICCV), Cited by: Related Work.
  • S. Yu, J. Tack, S. Mo, H. Kim, J. Kim, J. Ha, and J. Shin (2022) Generating videos with dynamics-aware implicit generative adversarial networks. arXiv preprint arXiv:2202.10571. Cited by: Introduction, Related Work, Experimental Settings.
  • H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu (2021) Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4176–4186. Cited by: Related Work.