High-Speed and High-Quality Text-to-Lip Generation

07/14/2021
by   Jinglin Liu, et al.
0

As a key component of talking face generation, lip movements generation determines the naturalness and coherence of the generated talking face video. Prior literature mainly focuses on speech-to-lip generation while there is a paucity in text-to-lip (T2L) generation. T2L is a challenging task and existing end-to-end works depend on the attention mechanism and autoregressive (AR) decoding manner. However, the AR decoding manner generates current lip frame conditioned on frames generated previously, which inherently hinders the inference speed, and also has a detrimental effect on the quality of generated lip frames due to error propagation. This encourages the research of parallel T2L generation. In this work, we propose a novel parallel decoding model for high-speed and high-quality text-to-lip generation (HH-T2L). Specifically, we predict the duration of the encoded linguistic features and model the target lip frames conditioned on the encoded linguistic features with their duration in a non-autoregressive manner. Furthermore, we incorporate the structural similarity index loss and adversarial learning to improve perceptual quality of generated lip frames and alleviate the blurry prediction problem. Extensive experiments conducted on GRID and TCD-TIMIT datasets show that 1) HH-T2L generates lip movements with competitive quality compared with the state-of-the-art AR T2L model DualLip and exceeds the baseline AR model TransformerT2L by a notable margin benefiting from the mitigation of the error propagation problem; and 2) exhibits distinct superiority in inference speed (an average speedup of 19× than DualLip on TCD-TIMIT).

READ FULL TEXT

page 4

page 6

page 8

research
10/25/2020

Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

Fast inference speed is an important goal towards real-world deployment ...
research
09/09/2021

Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring

This article describes an efficient end-to-end speech translation (E2E-S...
research
08/06/2020

FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

Lipreading is an impressive technique and there has been a definite impr...
research
05/18/2020

MoBoAligner: a Neural Alignment Model for Non-autoregressive TTS with Monotonic Boundary Search

To speed up the inference of neural speech synthesis, non-autoregressive...
research
10/26/2020

TTS-by-TTS: TTS-driven Data Augmentation for Fast and High-Quality Speech Synthesis

In this paper, we propose a text-to-speech (TTS)-driven data augmentatio...
research
07/20/2022

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

Generating sound effects that humans want is an important topic. However...
research
09/15/2023

Stack-and-Delay: a new codebook pattern for music generation

In language modeling based music generation, a generated waveform is rep...

Please sign up or login with your details

Forgot password? Click here to reset