StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3

by   Haonan Qiu, et al.

Realistic generative face video synthesis has long been a pursuit in both computer vision and graphics community. However, existing face video generation methods tend to produce low-quality frames with drifted facial identities and unnatural movements. To tackle these challenges, we propose a principled framework named StyleFaceV, which produces high-fidelity identity-preserving face videos with vivid movements. Our core insight is to decompose appearance and pose information and recompose them in the latent space of StyleGAN3 to produce stable and dynamic results. Specifically, StyleGAN3 provides strong priors for high-fidelity facial image generation, but the latent space is intrinsically entangled. By carefully examining its latent properties, we propose our decomposition and recomposition designs which allow for the disentangled combination of facial appearance and movements. Moreover, a temporal-dependent model is built upon the decomposed latent features, and samples reasonable sequences of motions that are capable of generating realistic and temporally coherent face videos. Particularly, our pipeline is trained with a joint training strategy on both static images and high-quality video data, which is of higher data efficiency. Extensive experiments demonstrate that our framework achieves state-of-the-art face video generation results both qualitatively and quantitatively. Notably, StyleFaceV is capable of generating realistic 1024×1024 face videos even without high-resolution training videos.


page 1

page 3

page 4

page 5

page 7

page 8


Encode-in-Style: Latent-based Video Encoding using StyleGAN2

We propose an end-to-end facial video encoding approach that facilitates...

CPNet: Exploiting CLIP-based Attention Condenser and Probability Map Guidance for High-fidelity Talking Face Generation

Recently, talking face generation has drawn ever-increasing attention fr...

One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2

While recent research has progressively overcome the low-resolution cons...

Text2Performer: Text-Driven Human Video Generation

Text-driven content creation has evolved to be a transformative techniqu...

High-resolution Face Swapping via Latent Semantics Disentanglement

We present a novel high-resolution face swapping method using the inhere...

High-Fidelity 3D Digital Human Creation from RGB-D Selfies

We present a fully automatic system that can produce high-fidelity, phot...

Scaling Autoregressive Video Models

Due to the statistical complexity of video, the high degree of inherent ...

Please sign up or login with your details

Forgot password? Click here to reset