High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

06/29/2023
by   Junchen Lu, et al.
0

The goal of Automatic Voice Over (AVO) is to generate speech in sync with a silent video given its text script. Recent AVO frameworks built upon text-to-speech synthesis (TTS) have shown impressive results. However, the current AVO learning objective of acoustic feature reconstruction brings in indirect supervision for inter-modal alignment learning, thus limiting the synchronization performance and synthetic speech quality. To this end, we propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction, which not only provides more direct supervision for the alignment learning, but also alleviates the mismatch between the text-video context and acoustic features. Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality by outperforming baselines in both objective and subjective evaluations. Code and speech samples are publicly available.

READ FULL TEXT
research
10/07/2021

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

In this paper, we formulate a novel task to synthesize speech in sync wi...
research
08/29/2023

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

The goal of this work is to reconstruct high quality speech from lip mot...
research
04/06/2022

Prosodic Alignment for off-screen automatic dubbing

The goal of automatic dubbing is to perform speech-to-speech translation...
research
03/24/2022

SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling

We present a self-supervised speech restoration method without paired sp...
research
04/02/2022

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

The mainstream neural text-to-speech(TTS) pipeline is a cascade system, ...
research
08/12/2021

Mispronunciation Detection and Correction via Discrete Acoustic Units

Computer-Assisted Pronunciation Training (CAPT) plays an important role ...
research
08/02/2023

SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis

While FastSpeech2 aims to integrate aspects of speech such as pitch, ene...

Please sign up or login with your details

Forgot password? Click here to reset