Robust One Shot Audio to Video Generation

12/14/2020
by   Neeraj Kumar, et al.
0

Audio to Video generation is an interesting problem that has numerous applications across industry verticals including film making, multi-media, marketing, education and others. High-quality video generation with expressive facial movements is a challenging problem that involves complex learning steps for generative adversarial networks. Further, enabling one-shot learning for an unseen single image increases the complexity of the problem while simultaneously making it more applicable to practical scenarios. In the paper, we propose a novel approach OneShotA2V to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person. OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person. Further, it feeds the features generated from the audio input directly into a generative adversarial network and it adapts to any given unseen selfie by applying fewshot learning with only a few output updation epochs. OneShotA2V leverages spatially adaptive normalization based multi-level generator and multiple multi-level discriminators based architecture. The input audio clip is not restricted to any specific language, which gives the method multilingual applicability. Experimental evaluation demonstrates superior performance of OneShotA2V as compared to Realistic Speech-Driven Facial Animation with GANs(RSDGAN) [43], Speech2Vid [8], and other approaches, on multiple quantitative metrics including: SSIM (structural similarity index), PSNR (peak signal to noise ratio) and CPBD (image sharpness). Further, qualitative evaluation and Online Turing tests demonstrate the efficacy of our approach.

READ FULL TEXT

page 6

page 7

research
12/14/2020

Multi Modal Adaptive Normalization for Audio to Video Generation

Speech-driven facial video generation has been a complex problem due to ...
research
02/19/2021

One Shot Audio to Animated Video Generation

We consider the challenging problem of audio to animated video generatio...
research
12/14/2020

Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis

The style of the speech varies from person to person and every person ex...
research
11/02/2020

Facial Keypoint Sequence Generation from Audio

Whenever we speak, our voice is accompanied by facial movements and expr...
research
07/18/2022

Audio Input Generates Continuous Frames to Synthesize Facial Video Using Generative Adiversarial Networks

This paper presents a simple method for speech videos generation based o...
research
11/15/2022

Towards an objective characterization of an individual's facial movements using Self-Supervised Person-Specific-Models

Disentangling facial movements from other facial characteristics, partic...
research
07/17/2020

Personalized Speech2Video with 3D Skeleton Regularization and Expressive Body Poses

In this paper, we propose a novel approach to convert given speech audio...

Please sign up or login with your details

Forgot password? Click here to reset