Dubbing is a post-production process of re-recording actors’ dialogues in a controlled environment (i.e., a sound studio), which is extensively used in filmmaking and video production. There are two common application scenarios for dubbing. The first one is replacing previous dialogues because poor sound quality is very common for speech recorded on noise location or the scene itself is too challenging to record high-quality audio. The second one is replacing the actor’ voices in foreign-language films with those of other performers speaking the audience’s language. For example, an English video needs to be dubbed into Chinese if it is shown in China.
In this paper, we mainly focus on the first application scenario, also known as “automated dialogue replacement (ADR)”, in which the professional voice actor watches original performance in the pre-recorded video, and re-records each line to match the lip movement with proper prosody such as stress, intonation and rhythm, which allows their speech to be synchronized with the pre-recorded video. In this scenario, the lip motion (viseme) in the video is consistent with the given scripts (phoneme), and the pre-recorded high-definition video can not be modified during the ADR process.
While dubbing is an impressive ability of professional voice actors, we aim to achieve this ability computationally. We name this novel task automatic video dubbing (AVD): synthesizing human speech that is temporally synchronized with the given video according to the corresponding text. The main challenges of the task are two-fold: (1) temporal synchronization between synthesized speech and video, i.e., the synthesized speech should be synchronized with the lip movement of the speaker in the given video; (2) the content of the speech should be consistent with the input text.
Text to speech (TTS) is a task closely related to dubbing, which aims at converting given texts into natural and intelligible speech. However, several limitations prevent TTS from being applied in the dubbing problem: 1) TTS is a one-to-many mapping problem (i.e., multiple speech variations can be spoken from the same text) Ren et al. (2021), so it is hard to control the variations (e.g., prosody, pitch and duration) in synthesized speech during generation; 2) with only text as input, TTS can not utilize the visual information from the video to control speech synthesis, which greatly limits its applications in dubbing scenarios where the synthesized speech are required to be synchronized with the video.
We introduce Neural Dubber, the first model to solve the AVD task. Neural Dubber is a multi-modal speech synthesis model, which generates high-quality and lip-synced speech from the given text and video. In order to control the duration and prosody of synthesized speech, Neural Dubber works in a non-autoregressive way following Ren et al. (2021). The problem of length mismatch between phoneme sequence and mel-spectrograms sequence in non-autoregressive TTS is usually solved by up-sampling the phoneme sequence according to the predicted phoneme duration. Meanwhile, a phoneme duration predictor is needed, where the duration ground truth is usually obtained from another model Ren et al. (2019, 2021) or itself during training Kim et al. (2020). However, due to the natural correspondence between lip movement and text Chung et al. (2017b), we do not need to get phoneme duration target in advance like previous methods Ren et al. (2019, 2021); Kim et al. (2020). Instead, we use the text-video aligner which adopts an attention module between the video frames and phonemes, and then upsample the text-video context sequence according to the length ratio of mel-spectrograms sequence and video frame sequence. The text-video aligner not only solves the problem of length mismatch, but also allows the lip movement in the video to control the prosody of the generated speech explicitly by the attention between video frames and phonemes.
In the real dubbing scenario, voice actors need to alter the timbre and tone according to different performers in the video. In order to better simulate the real case in the AVD task, we propose the image-based speaker embedding (ISE) module, which aims to synthesize speech with different timbres conditioning on the speakers’ face in the multi-speaker setting. To the best of our knowledge, this is the first attempt to predict a speaker embedding from a face image with the goal of generating speech with a reasonable timbre that is consistent with the speaker’s facial features (e.g., gender and age). This is achieved by taking advantage of the natural co-occurrence of faces and speech in videos without the supervision of speaker identity. With ISE, Neural Dubber can synthesize speech with a reasonable timbre according to the speaker’s face. In other words, Neural Dubber can use different face images to control the timbre of the synthesized speech.
We conduct experiments on the chemistry lecture dataset from Lip2Wav Prajwal et al. (2020b) for the single-speaker AVD, and the LRS2 Afouras et al. (2018a) dataset for the multi-speaker AVD. The results of extensive quantitative and qualitative evaluations show that in terms of speech quality, Neural Dubber is on par with state-of-the-art TTS models Wang et al. (2017); Shen et al. (2018); Ren et al. (2021). Furthermore, Neural Dubber can synthesize speech temporally synchronized with the lip movement in video. In the multi-speaker setting, we demonstrate that the ISE enables Neural Dubber to generate speech with reasonable timbre based on the speaker’s face, resulting in Neural Dubber outperforming FastSpeech 2 by a big margin in term of audio quality. We attach some audio files and video clips generated by our model in the project page.
2 Related Work
Text to Speech.
Text to Speech (TTS) Arık et al. (2017); Shen et al. (2018); Wang et al. (2017); Ren et al. (2019), which aims to synthesize intelligible, natural and high-quality speech from the input text, has seen tremendous progress in recent years. Specifically, the prevalent methods have shifted from concatenative synthesis Hunt and Black (1996), parametric synthesis Wu et al. (2016) to end-to-end neural network-based synthesis Ping et al. (2019); Shen et al. (2018); Wang et al. (2017), where the quality of the synthesized speech is improved by a large margin and is close to that of the human counterpart. The general paradigm of end-to-end neural network-based methods usually first generate the acoustic feature (e.g., mel-spectrogram) from text using encoder-decoder architecture autoregressively Wang et al. (2017); Shen et al. (2018); Ping et al. (2018); Li et al. (2019) or non-autoregressively Ren et al. (2019, 2021); Kim et al. (2020), then reconstruct the waveform signal using vocoder Griffin and Lim (1984); Oord et al. (2016); Prenger et al. (2019); Kumar et al. (2019a); Yamamoto et al. (2020). When it comes to the recent multi-speaker TTS system Jia et al. (2018), the speaker embedding is often extracted using speaker verification system, and is fed to the decoder of the TTS system in order to encourage the model to obtain a timbre inclination for the speaker of interest. Different from TTS, Neural Dubber is conditioned not only on texts but also on videos, intending to synthesize natural speech given both of them.
Talking Face Generation.
Talking face generation has a long history in computer vision, ranging from viseme-based modelsEdwards et al. (2016); Zhou et al. (2018) to neural synthesis of 2D Suwajanakorn et al. (2017); Thies et al. (2020); Wiles et al. (2018) or 3D Karras et al. (2017); Richard et al. (2021); Taylor et al. (2017) face. Recently neural synthesis approaches have been proposed to generate realistic 2D video of talking heads. Concretely speaking, Chung et al. Chung et al. (2017a) first generates lower face animation using cropped frontal images. After then Zhou et al. Zhou et al. (2019)
further disentangles identity from speech using generative adversarial networks (GANs). Wav2LipPrajwal et al. (2020a) tries to explore the problem of visual dubbing, i.e., lip-syncing a talking head video of an arbitrary person to match a target speech segment. From our perspective, however, such methods can not generate high-fidelity face and lip given speech, spawning the results are of low resolution and look uncanny sometimes. Besides, audios in most talking face pipelines need to be prepared in advance, thus, strictly speaking, this does not belong to dubbing (re-recording)111https://en.wikipedia.org/wiki/Dubbing_(filmmaking), but to the face synchronization while given audio. In contrast to the aforementioned works, Neural Dubber is not required to prepare audio beforehand and modify the lip motion, but generates speech audio synchronized with the video from scripts.
Lip to Speech Synthesis.
Given a video, the lip to speech task aims at synthesizing the corresponding speech audio by directly judging from the lip motion. While the conventional method Kello and Plaut (2004)
exploits the visual features extracted from active appearance models, recent end-to-end methods have also shed some light on it. In particular, Vid2SpeechEphrat and Peleg (2017) and Lipper Kumar et al. (2019b) generate low-dimensional linear predictive coding features to synthesize speech in the constrained scene. Vougioukas et al. Vougioukas et al. (2020) using the GANs-based method to exert for quality gains. Lip2Wav Prajwal et al. (2020b) has achieved promising results in real-life speaker-dependent scenarios, but it is still somewhat incongruous and prone to collapse in the multi-speaker setting. This is possibly because the word error rate in lip reading task Afouras et al. (2018b); Assael et al. (2016); Chung et al. (2017b); Chung and Zisserman (2017) is still high, let alone the lip to speech synthesis. In Neural Dubber, the textual information is provided, allowing us to concentrate more on the alignment between the phoneme and lip motion in video, instead of decoding speech from lip motion directly.
In this section, we first introduce the novel automatic video dubbing (AVD) task; we then describe the overall architecture of our proposed Neural Dubber; finally we detail the main components in Neural Dubber.
3.1 Automatic Video Dubbing
Given a sentence and a corresponding video clip (without audio) , the goal of automatic video dubbing (AVD) is to synthesize natural and intelligible speech whose content is consistent with the sentence , and whose prosody is synchronized with the lip movement of the active speaker in the video . Compared to the traditional speech synthesis task which only generates natural and intelligible speech given the sentence , AVD task is more difficult due to the synchronization requirement.
3.2 Neural Dubber
3.2.1 Design Overview
Our Neural Dubber aims to solve the AVD task. Concretely, we formulate the problem as follows: given a phoneme sequence and a video frame sequence , we need to predict a target mel-spectrograms sequence .
The overall model architecture of Neural Dubber is shown in Figure 2. First, we apply a phoneme encoder and a video encoder to process the phonemes and images respectively. Note that the images we feed to the video encoder only contain mouth region of the speaker following Chung et al. (2017b); Petridis et al. (2018); Stafylakis and Tzimiropoulos (2017). We use
to represent these images. After the encoding, raw phonemes turn into a sequence of hidden representationswhile images of mouth region turn into a sequence of hidden representations . Then we feed and into the text-video aligner (which will be described in Section 3.2.3) and get the expanded sequence with the same length as the target mel-spectrograms sequence . Meanwhile, a face image randomly selected from the video frames is input into image-based speaker embedding (ISE) module (which will be described in Section 3.2.4) to generate a image-based speaker embedding (only used in multi-speaker setting). We add
and ISE together and feed them into the variance adaptor to add some variance information (e.g., pitch and energy). Finally, we use the mel-spectrogram decoder to convert the adapted hidden sequence into mel-spectrograms sequence followingRen et al. (2019). Different from FastSpeech 2 Ren et al. (2021), our variance adaptor consists of pitch and energy predictors without duration predictor, because we solve the problem of length mismatch between the phoneme and mel-spectrograms sequence in the text-video aligner and the input of variance adaptor is as long as the mel-spectrograms sequence.
3.2.2 Phoneme and Video Encoders
The phoneme encoder and video encoder are shown in Figure 1(a), which are enclosed in a dashed box. The function of the phoneme encoder and video encoder is to transform the original phoneme and image sequences into hidden representation sequences which contain high-level semantics. The phoneme encoder we use is similar to that in FastSpeech Ren et al. (2019), which consists of an embedding layer and N Feed-Forward Transformer (FFT) blocks. The video encoder consists of a feature extractor and K FFT blocks. The feature extractor is a CNN backbone that generates feature representation for every input mouth image. And then we use the FFT blocks to capture the dynamics of the mouth region because FFT is based on self-attention Vaswani et al. (2017) and 1D convolution where self-attention and 1D convolution are suitable for capturing long-term and short-term dynamics respectively.
3.2.3 Text-Video Aligner
The most challenging aspect of the AVD task is alignment: (1) the content of the generated speech should come from the input phonemes; (2) the prosody of the generated speech should be aligned with the video in time. So it does not make sense to produce speech solely from phonemes, nor video. In our design, the text-video aligner (Figure 1(b)) aims to find the correspondence between text and lip movement first, so that synchronized speech can be generated in the later stage.
In the text-video aligner, an attention-based module learns the alignment between the phoneme sequence and the video frame sequence, and produces the text-video context sequence. Then an upsampling operation is performed to change the length of the text-video context sequence from to .
In practice, we adopt the popular Scaled Dot-Product Attention Vaswani et al. (2017) as the attention module, where is used as the query, and is used as both the key and the value.
is the matrix of attention weights. After the attention module, we get the text-video context sequence, i.e., the expanded sequence of phoneme hidden representation by linear combination. We use a residual connectionHe et al. (2016) to add the for efficient training. However, we use a dropout layer with a large dropout rate to prevent mel-spectrograms from being generated directly from visual information. The attention weight obtained after softmax is the main determinant of the speed and prosody of the synthesized speech like the attention weight between spectrograms and phonemes in Wang et al. (2017); Shen et al. (2018); Li et al. (2019). The sequence of video hidden representations is used as the query, so the attention weight is controlled by the video explicitly, and the temporal alignment between video frames and phonemes is achieved. The obtained monotonic alignment between video frames and phonemes contributes to the synchronization between the synthesized speech and the video on fine-grained (phoneme) level.
There is a natural temporal correspondence between the speech audio and the video. In other words, once the alignment between video frames and phonemes is achieved, the alignment between mel-spectrogram frames and phonemes can be obtained. In practice, the length of a mel-spectrograms sequence is times that of a video frame sequence. We denote the as
where sr denotes the sampling rate of the audio and hs denotes hop size set when transforming the raw waveform into mel-spectrograms. We upsample the text-video context sequence to with scale factor is . In practice, we use the upsampling method with nearest mode.
After that, the length of the text-video context sequence is expanded to that of the mel-spectrograms sequence. Thus, the problem of length mismatch between the phoneme and mel-spectrograms sequence is solved without the supervision of fine grained alignment between phonemes and mel-spectrograms. Because of the attention between video frames and phonemes, the speed and part of prosody of synthesized speech are controlled by the input video explicitly, which makes the synthesized speech well synchronized with the input video.
Monotonic Alignment Constraint
In text to speech (TTS) task, the monotonic and diagonal alignments in the attention weights between text and speech are important to ensure the quality of synthesized speech Wang et al. (2017); Shen et al. (2018); Tachibana et al. (2018); Chen et al. (2020). In Neural Dubber, a multi-modal TTS model, the monotonic and diagonal alignments between video frames and phonemes are also critical. So we adopt a diagonal constraint on the attention weights to guide the text-video attention module to learn right alignments following Chen et al. (2020). We formulate the diagonal attention rate as
is a hyperparameter for bandwidth of the diagonal area. We add the diagonal constraint loss which is defined asto our final loss for better alignments.
3.2.4 Image-based Speaker Embedding Module
How much can we infer about the way people speak from their appearances? In the real dubbing scenario, voice actors need to alter the timbre according to different performers. In order to better simulate the real case in AVD task, we aim to synthesize speech with different timbres conditioning on the speakers’ faces in multi-speaker setting. There have been many works Nagrani et al. (2018); Kim et al. (2018); Chung et al. (2020) researching the correlation between voice and speakers’ face recently, but none of them learn the joint speaker-face embeddings to solve the multi-speaker text to speech task. In this work, we propose image-based speaker embedding (ISE) module (Figure 1(c)), a new multi-modal speaker embedding module, generates an embedding that encapsulates the characteristics of the speaker’s voice from an image of his/her face. The ISE module is trained with other components of Neural Dubber from scratch in a self-supervised manner, utilizing the natural co-occurrence of faces and speech audio in videos, but without the supervision of speaker identity. We randomly select a face image from
, and obtain a high-level face feature by feeding the selected face image into a pre-trained and fixed face recognition networkParkhi et al. (2015); Cao et al. (2018). Then we feed the face feature to a trainable MLP and gain the ISE. The predicted ISE is directly broadcasted and added to so as to control the timbre of synthesized speech. Our model learns face-voice correlations which allow it to generate speech that coincides with various voice attributes of the speakers (e.g., gender and age) inferred from their faces.
4 Experiments and Results
In the single-speaker setting, we conduct experiments on the chemistry lecture dataset from Lip2Wav Prajwal et al. (2020b). With a large vocabulary size and a lot of head movements, the dataset is originally used for the unconstrained single-speaker lip to speech synthesis. To make it fit the AVD task, we collect the official transcripts from YouTube. We need corresponding sentence-level text and audio clips for training, so we segment the long videos into sentence-level clips according to the start and end timestamp of each sentence in the transcripts. Some segmented sentence-level video clips contain frames that only capture the PowerPoint but not lecturer face which can not be used for training. So we conduct data cleaning to remove them. Finally, the dataset contains 6,640 samples, with the total video length of approximately 9 hours. We randomly split the dataset into 3 sets: 6240 samples for training, 200 samples for validation, and 200 samples for testing. In the following subsections, we refer to this dataset as chem for short.
In multi-speaker setting, we conduct experiments on the LRS2 Afouras et al. (2018a) dataset, which consists of thousands of sentences spoken by various speakers on BBC channels. This dataset suits the AVD task well, because each sample includes both the text and video pair. Note that we only train on the training set of the LRS2 dataset, which only contains data of approximately 29 hours. Compared to other multi-speaker speech synthesis datasets Zen et al. (2019), this dataset is quite small for multi-speaker speech generation and does not provide the speaker identity for each sample. The ISE module aids Neural Dubber in solving these problems.
4.2 Data Pre-processing
The video frames are sampled at 25 FPS. We detect and crop the face from the video frames using Zhang et al. (2017) face detection following Prajwal et al. (2020b). The images input to the video encoder are resized to in dimension, which only cover the mouth region of the face, as shown in Figure 1(a). The face image input to the ISE module is in dimension and covers the whole face of the speaker. In order to alleviate the mispronunciation problem, we convert the text sequences into the phoneme sequences Arık et al. (2017); Li et al. (2019); Ren et al. (2021)
with an open-source grapheme-to-phoneme tool. For the speech audio, we transform the raw waveform into mel-spectrograms followingShen et al. (2018). The frame size and hop size are set to 640 samples (40 ms) and 160 samples (10 ms) with respect to the sample rate 16 kHz.
4.3 Model Configuration
Our Neural Dubber consists of 4 feed-forward Transformer (FFT) blocks Ren et al. (2019) in the phoneme encoder and the mel-spectrogram decoder, and 2 FFT blocks in the video encoder. The feature extractor in the video encoder is the ResNet18 He et al. (2016) except for the first 2D convolution layer being replaced by 3D convolutions Petridis et al. (2018). The variance adaptor contains pitch predictor and energy predictor. The configurations of the FFT block, the mel-spectrogram decoder, the pitch predictor and the energy predictor are the same as those in FastSpeech 2 Ren et al. (2021). In the text-video aligner, the hidden size of the scaled dot-product attention is set to 256, the number of the upsample operation is set to 4 according to Equation (4). In the ISE module, the face feature extractor we use is a pre-trained and fixed ResNet50 trained on the VGGFace2 Cao et al. (2018) dataset. The face feature is a 4096-D feature that is extracted from the penultimate layer (i.e., one layer prior to the classification layer) of the network.
Since automatic video dubbing is a new task that we propose, none of the previous works focused on solving this task. So we propose a baseline model based on the Tacotron Wang et al. (2017) system with some modifications which make it fit to the new AVD task. We call this baseline model Video-based Tacotron. In order to make use of the information in video, we concatenate the spectrogram frames with the corresponding hidden representation of video frames, and use it as the decoder input:
where is the decoder input, represents the concatenation operation, is the hidden representation of video frames, which is obtained by the same way as in Neural Dubber described in the Section 3.2.1 and is same as that in Equation (4). The Video-based Tacotron implementation is based on an open-source Tacotron repository 222https://github.com/fatchord/WaveRNN where the attention is replaced with the location-sensitive attention Chorowski et al. (2015) according to Shen et al. (2018) for better results. We set the reduction factor to 2 and change the vocoder to Parallel WaveGAN Yamamoto et al. (2020) for fair comparison.
4.4 Training and Inference
We train Neural Dubber on 1 NVIDIA V100 GPU. We use the Adam optimizer Kingma and Ba (2014) with , , and follow the same learning rate schedule in Vaswani et al. (2017). Our model is optimized with the loss similar to that in Ren et al. (2021). We set the batchsize to 18 and 24 on chem dataset and LRS2 dataset respectively. It takes 200k/300k steps for training until convergence on the chem/LRS2 dataset. In this work, we use Parallel WaveGAN Yamamoto et al. (2020) as the vocoder to transform the generated mel-spectrograms into audio samples. We train two Parallel WaveGAN vocoders on the training set of chem dataset and LRS2 dataset respectively, following an open-source implementation 333https://github.com/kan-bayashi/ParallelWaveGAN. Each Parallel WaveGan vocoder is trained on 1 NVIDIA V100 GPU for 1000K steps. In the inference process, the output mel-spectrograms of Neural Dubber are transformed into audio samples using the pre-trained Parallel WaveGAN.
Since the AVD task aims to synthesize human speech synchronized with the video from text, the audio quality and the audio-visual synchronization (av sync) are the important evaluation criteria.
We conduct the mean opinion score (MOS) Chu and Peng (2006) evaluation on the test set to measure the audio quality and the av sync. We randomly select 30 video clips from the test set, where each video clip is scored by at least 20 raters, who are all native English speakers. We overlay the synthesized speech on the original video before showing it to the rater, following Prajwal et al. (2020b). The text and the video are consistent among different systems, so that all raters only examine the audio quality and the av sync without other interference factors. For each video clip, the raters are asked to rate scores of 1-5 from bad to excellent (higher score indicates better quality) on the audio quality and the av sync, respectively. We perform the MOS evaluation on Amazon Mechanical Turk (MTurk).
In order to measure the synchronization between the generated speech and the video quantitatively, we use the pre-trained SyncNet Chung and Zisserman (2016), which is publicly available444https://github.com/joonson/syncnet_python, following Prajwal et al. (2020b). The method can explicitly test for synchronization between speech audio and lip movements in unconstrained videos in the wild Chung and Zisserman (2016); Prajwal et al. (2020a). We adopt two metrics: Lip Sync Error - Distance (LSE-D) and Lip Sync Error - Confidence (LSE-C) from Wav2Lip Prajwal et al. (2020a). The two metrics can be automatically calculated by the pre-trained SyncNet model. LSE-D denotes the minimal distance between the audio and the video features for different offset values. A lower LSE-D means the speech audio and video are more synchronized. LSE-C denotes the confidence that the audio and the video are synchronized with a certain time offset. A lower LSE-C means that some parts of the video are completely out of sync, where the audio and the video are uncorrelated.
4.5.2 Single-speaker AVD
We first conduct MOS evaluation on the chem single-speaker dataset, to compare the audio quality and the av sync of the video clips generated by Neural Dubber with other systems, including 1) GT, the ground-truth video clips; 2) GT (Mel + PWG), where we first convert the ground-truth audio into mel-spectrograms, and then convert it back to audio using Parallel WaveGAN Yamamoto et al. (2020) (PWG); 3) FastSpeech 2 Ren et al. (2021) (Mel + PWG); 4) Video-based Tacotron (Mel + PWG). Note that the systems in 2), 3), 4) and Neural Dubber use the same pre-trained Parallel WaveGAN for a fair comparison. In addition, we compare Neural Dubber with those systems on the test set using the LSE-D and LSE-C metrics. The results for single-speaker AVD are shown in Table 1. It can be seen that Neural Dubber can surpass the Video-based Tacotron baseline and is on par with FastSpeech 2 in terms of audio quality, which demonstrates that Neural Dubber can synthesize high-quality speech. Furthermore, in terms of the av sync, Neural Dubber outperforms FastSpeech 2 and Video-based Tacotron by a big margin and matches GT (Mel + PWG) system in both qualitative and quantitative evaluations, which shows that Neural Dubber can control the prosody of speech and generate speech synchronized with the video. For FastSpeech 2 and Video-based Tacotron, the LSE-D is high and the LSE-C is low, indicating that they can not generate speech synchronized with the video.
|Method||Audio Quality||AV Sync||LSE-D||LSE-C|
|GT||3.93 0.08||4.13 0.07||6.926||7.711|
|GT (Mel + PWG)||3.83 0.09||4.05 0.07||7.384||6.806|
|FastSpeech 2 Ren et al. (2021) (Mel + PWG)||3.71 0.08||3.29 0.09||11.86||2.805|
|Video-based Tacotron (Mel + PWG)||3.55 0.09||3.03 0.10||11.79||2.231|
|Neural Dubber (Mel + PWG)||3.74 0.08||3.91 0.07||7.212||7.037|
The evaluation results for the single-speaker AVD. The subjective metrics for audio quality and av sync are with 95% confidence intervals.
We also show a qualitative comparison in Figure 4 which contains mel-spectrograms of audios generated by the above systems. It shows that the prosody of the audio generated by Neural Dubber is closed to that of ground truth recording, i.e., well synchronized with the video.
In addition, we compare our method with another baseline Halperin et al. (2019) which automatically stretches and compresses the audio signal to match the lip movement given an unaligned face sequence and speech audio. We use the speech generated by FastSpeech 2 (Mel + PWG) system, and then align the pre-generated speech with the lip movement in video according to Halperin et al. (2019). However, the quality and naturalness of its synthesized speech is much worse than the pre-generated speech due to challenging alignments. So this baseline is not comparable to our Neural Dubber.
4.5.3 Multi-speaker AVD
Similar to Section 4.5.2, we conduct human evaluation and quantitative evaluation on the LRS2 multi-speaker dataset to compare Neural Dubber with other systems in multi-speaker setting. Due to the failure of Video-based Tacotron in single-speaker AVD, we no longer compare our model with it. Note that we can not add a trivial speaker embedding module to FastSpeech 2, because the LRS2 dataset does not contain the speaker identity for each video. So we directly train FastSpeech 2 on the LRS2 dataset without modifications. The results are shown in Table 2. We can see that Neural Dubber outperforms FastSpeech 2 by a significant margin in terms of audio quality, exhibiting the effectiveness of ISE in multi-speaker AVD. The qualitative and quantitative evaluations show that the speech synthesized by Neural Dubber is much better than that of FastSpeech 2 and is on par with the ground truth recordings in terms of synchronization. These results show that Neural Dubber can address the multi-speaker AVD which is more challenging than the single-speaker AVD.
|Method||Audio Quality||AV Sync||LSE-D||LSE-C|
|GT||3.97 0.09||3.81 0.10||7.214||6.755|
|GT (Mel + PWG)||3.92 0.09||3.69 0.11||7.317||6.603|
|FastSpeech 2 Ren et al. (2021) (Mel + PWG)||3.15 0.14||3.33 0.10||10.17||3.714|
|Neural Dubber (Mel + PWG)||3.58 0.13||3.62 0.09||7.201||6.861|
In order to demonstrate that ISE enables Neural Dubber to control the timbre by the input face image, some audio clips are generated by Neural Dubber with the same phoneme sequence and mouth image sequence but different speaker face images as input. We select 12 males and 12 females from the test set of LRS2 dataset for this evaluation. For each person, we chose 10 face images with different head posture, illumination and facial makeup, etc.
We visualize the speaker embedding of these audios in Figure 4 by using a pre-trained speaker encoder Wan et al. (2018) from an open-source repository555https://github.com/resemble-ai/Resemblyzer. We first use the speaker (voice) encoder to derive a high-level representation, i.e., a 256-D embedding, from an audio, which summarizes the characteristics of the voice in the audio. Then we use t-SNE Van der Maaten and Hinton (2008) to visualize the generated embedding. It can be seen that the utterances generated from the images of the same speaker form a tight cluster, and that the cluster representing each speaker is separated from each other. In addition, there is a distinctive discrepancy between the speech synthesized from the face images of different genders. It concludes that Neural Dubber can use the face image to alter the timbre of the generated speech.
4.5.4 Comparing with the Lip-motion Based Speech Generation Method
Recently, some works have demonstrated the impressive ability to generate speech directly from the lip motion. However, the quality and intelligibility of the generated speech are relatively poor, and the word error rate (WER) is very high. In this section, we compare with a SOTA lip-motion based speech generation system Lip2Wav Prajwal et al. (2020b). Because Lip2Wav can only generate word-level speech in the multi-speaker setting, we only compare Neural Dubber with Lip2Wav in the single-speaker setting still on the chem dataset. We use the official GitHub repository to train Lip2Wav on our version of the chemistry lecture dataset. As we mentioned in Section 4.1, the dataset is different from the original one in Lip2Wav. It only contains data of approximately 9 hours, which is much less than the original one (approximately 24 hours). In this experiment, the training and testing sets of Neural Dubber and Lip2Wav are identical, so the results can be compared directly. Following the Lip2Wav paper Prajwal et al. (2020b)
, we use STOI and ESTOI for estimating the intelligibility and PESQ for measuring the quality. In addition, using an out-of-the-box ASR system, we evaluate the speech results using word error rates (WER). In order to eliminate the influence of the ASR system, we also measure the WER for ground truth speech audio. All these metrics are computed on the test dataset.
|Neural Dubber (ours)||0.467||0.308||1.250||18.01%|
As the comparison results in Table 3 show, Neural Dubber surpasses Lip2Wav by a big margin in terms of speech quality and intelligibility. Please note that STOI, ESTOI, and PESQ scores of Lip2Wav are lower than those in Prajwal et al. (2020b), because the training data we used is much less than theirs. Most importantly, the WER of Neural Dubber is lower than that of Lip2Wav. It shows that Neural Dubber outperforms Lip2Wav significantly in pronunciation accuracy. WER of Lip2Wav is up to 72.70%, indicating that it mispronounces a lot of content, which is unacceptable in the AVD task. Just like it is unacceptable for an actor to always mispronounce the lines. Please note that the WER of Lip2Wav we get is consistent with the results in Prajwal et al. (2020b) (see its Table 5). In summary, Neural Dubber far outperforms Lip2Wav in terms of speech intelligibility, quality, and pronunciation accuracy (WER), and is much more suitable for the AVD task.
5 Limitations and Societal Impact
When the script is changed to be different from what the speaker is actually saying, our method can only deal with the situation of modifying couple words. In addition, the lip movement of the modified text should be similar to the original lip movement in the video. The facial appearance may lead to timbre ambiguity due to the dataset bias. It might be offensive. Our method can dub videos automatically, which may be useful for filmmaking and video production.
In this work, we introduce a novel task, automatic video dubbing (AVD), which aims to synthesize human speech synchronized with the given video from text. To solve the AVD task, we propose Neural Dubber, a multi-modal TTS model, which can generate lip-synced mel-spectrograms in parallel. We design several key components including the video encoder, the text-video aligner and the ISE module for Neural Dubber to better solve the task. Our experimental results show that, in terms of speech quality, Neural Dubber is on par with FastSpeech 2 on the chem dataset, even outperforms FastSpeech 2 on the LRS2 dataset due to ISE’s help. More importantly, Neural Dubber can synthesize speech temporally synchronized with the video.
- Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §4.1.
- Deep lip reading: a comparison of models and an online application. In Interspeech, Cited by: §2.
Deep voice: real-time neural text-to-speech.
International Conference on Machine Learning, pp. 195–204. Cited by: §2, §4.2.
- Lipnet: end-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599. Cited by: §2.
- Vggface2: a dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pp. 67–74. Cited by: §3.2.4, §4.3.
- Multispeech: multi-speaker text to speech with transformer. arXiv preprint arXiv:2006.04664. Cited by: §3.2.3.
- Attention-based models for speech recognition. arXiv preprint arXiv:1506.07503. Cited by: §4.3.
- Objective measure for estimating mean opinion score of synthesized speech. Google Patents. Note: US Patent 7,024,362 Cited by: §4.5.1.
- You said that?. In British Machine Vision Conference, Cited by: §2.
Lip reading sentences in the wild.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. Cited by: §1, §2, §3.2.1.
- Out of time: automated lip sync in the wild. In Asian conference on computer vision, pp. 251–263. Cited by: §4.5.1.
- Lip reading in profile. In British Machine Vision Conference, Cited by: §2.
- Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision. arXiv preprint arXiv:2004.14326. Cited by: §3.2.4.
- JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics (TOG) 35 (4), pp. 1–11. Cited by: §2.
- Vid2speech: speech reconstruction from silent video. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5095–5099. Cited by: §2.
Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing 32 (2), pp. 236–243. Cited by: §2.
- Dynamic temporal alignment of speech to lips. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3980–3984. Cited by: §4.5.2.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.3, §4.3.
- Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, pp. 373–376. Cited by: §2.
- Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in Neural Information Processing Systems, Cited by: §2.
- Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–12. Cited by: §2.
- A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. The Journal of the Acoustical Society of America 116 (4), pp. 2354–2364. Cited by: §2.
- On learning associations of faces and voices. In Asian Conference on Computer Vision, pp. 276–292. Cited by: §3.2.4.
- Glow-tts: a generative flow for text-to-speech via monotonic alignment search. arXiv preprint arXiv:2005.11129. Cited by: §1, §2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.4.
- Melgan: generative adversarial networks for conditional waveform synthesis. arXiv preprint arXiv:1910.06711. Cited by: §2.
Lipper: synthesizing thy speech using multi-view lipreading.
Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2588–2595. Cited by: §2.
Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6706–6713. Cited by: §2, §3.2.3, §4.2.
- Learnable pins: cross-modal embeddings for person identity. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 71–88. Cited by: §3.2.4.
- Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §2.
- Deep face recognition. Cited by: §3.2.4.
- End-to-end audiovisual speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6548–6552. Cited by: §3.2.1, §4.3.
- Clarinet: parallel wave generation in end-to-end text-to-speech. In International Conference on Learning Representations, Cited by: §2.
- Deep voice 3: 2000-speaker neural text-to-speech. Proc. ICLR, pp. 214–217. Cited by: §2.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492. Cited by: §2, §4.5.1.
- Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13796–13805. Cited by: §1, §2, §4.1, §4.2, §4.5.1, §4.5.1, §4.5.4, §4.5.4.
- Waveglow: a flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. Cited by: §2.
- Fastspeech 2: fast and high-quality end-to-end text to speech. In International Conference on Learning Representations, Cited by: §1, §1, §1, §2, §3.2.1, §4.2, §4.3, §4.4, §4.5.2, Table 1, Table 2.
- Fastspeech: fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, Cited by: §1, §2, §3.2.1, §3.2.2, §4.3.
- MeshTalk: 3d face animation from speech using cross-modality disentanglement. arXiv preprint arXiv:2104.08223. Cited by: §2.
- Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §1, §2, §3.2.3, §3.2.3, §4.2, §4.3.
- Combining residual networks with lstms for lipreading. arXiv preprint arXiv:1703.04105. Cited by: §3.2.1.
- Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG) 36 (4), pp. 1–13. Cited by: §2.
- Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4788. Cited by: §3.2.3.
A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–11. Cited by: §2.
- Neural voice puppetry: audio-driven facial reenactment. In Proceedings of the European conference on computer vision (ECCV), pp. 716–731. Cited by: §2.
- Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: §4.5.3.
- Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §3.2.2, §3.2.3, §4.4.
- Realistic speech-driven facial animation with gans. International Journal of Computer Vision 128 (5), pp. 1398–1413. Cited by: §2.
- Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. Cited by: §4.5.3.
- Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135. Cited by: §1, §2, §3.2.3, §3.2.3, §4.3.
- X2face: a network for controlling face generation using images, audio, and pose codes. In Proceedings of the European conference on computer vision (ECCV), pp. 670–686. Cited by: §2.
- Merlin: an open source neural network speech synthesis system.. In SSW, pp. 202–207. Cited by: §2.
- Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. Cited by: §2, §4.3, §4.4, §4.5.2.
- LibriTTS: a corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882. Cited by: §4.1.
- S3fd: single shot scale-invariant face detector. In Proceedings of the IEEE international conference on computer vision, pp. 192–201. Cited by: §4.2.
- Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9299–9306. Cited by: §2.
- Visemenet: audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–10. Cited by: §2.