Neural Dubber: Dubbing for Videos According to Scripts

10/15/2021
by   Chenxu Hu, et al.
2

Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

READ FULL TEXT

page 3

page 4

page 8

research
02/10/2021

Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning

Deep learning models are becoming predominant in many fields of machine ...
research
10/13/2022

Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar

Since the beginning of the COVID-19 pandemic, remote conferencing and sc...
research
11/19/2021

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. ...
research
05/02/2022

A Novel Speech-Driven Lip-Sync Model with CNN and LSTM

Generating synchronized and natural lip movement with speech is one of t...
research
12/09/2018

Increase Apparent Public Speaking Fluency By Speech Augmentation

Fluent and confident speech is desirable to every speaker. But professio...
research
04/07/2022

Correcting Misproducted Speech using Spectrogram Inpainting

Learning a new language involves constantly comparing speech productions...
research
10/07/2021

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

In this paper, we formulate a novel task to synthesize speech in sync wi...

Please sign up or login with your details

Forgot password? Click here to reset