Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

06/06/2023
by   Zhenhui Ye, et al.
0

We are interested in a novel task, namely low-resource text-to-talking avatar. Given only a few-minute-long talking person video with the audio track as the training data and arbitrary texts as the driving input, we aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet due to two challenges: (1) It is challenging to mimic the timbre from out-of-domain audio for a traditional multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and lip-synchronized talking avatars with limited training data. In this paper, we introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a generic zero-shot multi-speaker TTS model that well disentangles the text content, timbre, and prosody; and (2) embraces recent advances in neural rendering to achieve realistic audio-driven talking face video generation. With these designs, our method overcomes the aforementioned two challenges and achieves to generate identity-preserving speech and realistic talking person video. Experiments demonstrate that our method could synthesize realistic, identity-preserving, and audio-visual synchronized talking avatar videos.

READ FULL TEXT

page 2

page 4

research
12/11/2019

Neural Voice Puppetry: Audio-driven Facial Reenactment

We present Neural Voice Puppetry, a novel approach for audio-driven faci...
research
08/31/2023

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Existing automated dubbing methods are usually designed for Professional...
research
05/15/2023

Identity-Preserving Talking Face Generation with Landmark and Appearance Priors

Generating talking face videos from audio attracts lots of research inte...
research
06/07/2022

FlexLip: A Controllable Text-to-Lip System

The task of converting text input into video content is becoming an impo...
research
09/20/2022

Text2Light: Zero-Shot Text-Driven HDR Panorama Generation

High-quality HDRIs(High Dynamic Range Images), typically HDR panoramas, ...
research
05/01/2023

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation

Generating talking person portraits with arbitrary speech audio is a cru...
research
04/29/2021

Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary

With the advance of deep learning technology, automatic video generation...

Please sign up or login with your details

Forgot password? Click here to reset