Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

08/29/2023
by   Ji-Hoon Kim, et al.
0

The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. Synthesised samples are available at the anonymous demo page: https://mm.kaist.ac.kr/projects/LTBS.

READ FULL TEXT
research
06/29/2023

High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

The goal of Automatic Voice Over (AVO) is to generate speech in sync wit...
research
03/03/2023

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

Speech restoration (SR) is a task of converting degraded speech signals ...
research
03/24/2022

SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling

We present a self-supervised speech restoration method without paired sp...
research
02/27/2023

Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow

There are two types of methods for non-autoregressive text-to-speech mod...
research
12/21/2022

ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement

Prior works on improving speech quality with visual input typically stud...
research
03/04/2022

Freeform Body Motion Generation from Speech

People naturally conduct spontaneous body motions to enhance their speec...
research
05/02/2019

High quality, lightweight and adaptable TTS using LPCNet

We present a lightweight adaptable neural TTS system with high quality o...

Please sign up or login with your details

Forgot password? Click here to reset