DeepAI AI Chat
Log In Sign Up

Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation

by   Xian Liu, et al.
SenseTime Corporation
Monash University

Animating high-fidelity video portrait with speech audio is crucial for virtual reality and digital entertainment. While most previous studies rely on accurate explicit structural information, recent works explore the implicit scene representation of Neural Radiance Fields (NeRF) for realistic generation. In order to capture the inconsistent motions as well as the semantic difference between human head and torso, some work models them via two individual sets of NeRF, leading to unnatural results. In this work, we propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF. The proposed model can handle the detailed local facial semantics and the global head-torso relationship through two semantic-aware modules. Specifically, we first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering. Moreover, to enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions. Extensive evaluations demonstrate that our proposed approach renders more realistic video portraits compared to previous methods. Project page:


page 4

page 7

page 8


GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis

Generating photo-realistic video portrait with arbitrary speech audio is...

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

Generating high-fidelity talking head video by fitting with the input au...

Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition

While dynamic Neural Radiance Fields (NeRF) have shown success in high-f...

HRTF Field: Unifying Measured HRTF Magnitude Representation with Neural Fields

Head-related transfer functions (HRTFs) are a set of functions describin...

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

To the best of our knowledge, we first present a live system that genera...

EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model

Although significant progress has been made to audio-driven talking face...

Reconstructing Personalized Semantic Facial NeRF Models From Monocular Video

We present a novel semantic model for human head defined with neural rad...