On the Audio-visual Synchronization for Lip-to-Speech Synthesis

03/01/2023
by   Zhe Niu, et al.
0

Most lip-to-speech (LTS) synthesis models are trained and evaluated under the assumption that the audio-video pairs in the dataset are perfectly synchronized. In this work, we show that the commonly used audio-visual datasets, such as GRID, TCD-TIMIT, and Lip2Wav, can have data asynchrony issues. Training lip-to-speech with such datasets may further cause the model asynchrony issue – that is, the generated speech and the input video are out of sync. To address these asynchrony issues, we propose a synchronized lip-to-speech (SLTS) model with an automatic synchronization mechanism (ASM) to correct data asynchrony and penalize model asynchrony. We further demonstrate the limitation of the commonly adopted evaluation metrics for LTS with asynchronous test data and introduce an audio alignment frontend before the metrics sensitive to time alignment for better evaluation. We compare our method with state-of-the-art approaches on conventional and time-aligned metrics to show the benefits of synchronization training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2023

Audio-visual video-to-speech synthesis with synthesized input audio

Video-to-speech synthesis involves reconstructing the speech signal of a...
research
09/30/2020

Rethinking Evaluation Methodology for Audio-to-Score Alignment

This paper offers a precise, formal definition of an audio-to-score alig...
research
10/07/2021

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

In this paper, we formulate a novel task to synthesize speech in sync wi...
research
05/30/2023

Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling

The study of speech disorders can benefit greatly from time-aligned data...
research
10/06/2020

Digital Voicing of Silent Speech

In this paper, we consider the task of digitally voicing silent speech, ...
research
05/07/2022

Timestamp-independent Haptic-Visual Synchronization

The booming haptic data significantly improves the users'immersion durin...
research
05/04/2022

SVTS: Scalable Video-to-Speech Synthesis

Video-to-speech synthesis (also known as lip-to-speech) refers to the tr...

Please sign up or login with your details

Forgot password? Click here to reset