V2C: Visual Voice Cloning

Existing Voice Cloning (VC) tasks aim to convert a paragraph text to a speech with desired voice specified by a reference audio. This has significantly boosted the development of artificial speech applications. However, there also exist many scenarios that cannot be well reflected by these VC tasks, such as movie dubbing, which requires the speech to be with emotions consistent with the movie plots. To fill this gap, in this work we propose a new task named Visual Voice Cloning (V2C), which seeks to convert a paragraph of text to a speech with both desired voice specified by a reference audio and desired emotion specified by a reference video. To facilitate research in this field, we construct a dataset, V2C-Animation, and propose a strong baseline based on existing state-of-the-art (SoTA) VC techniques. Our dataset contains 10,217 animated movie clips covering a large variety of genres (e.g., Comedy, Fantasy) and emotions (e.g., happy, sad). We further design a set of evaluation metrics, named MCD-DTW-SL, which help evaluate the similarity between ground-truth speeches and the synthesised ones. Extensive experimental results show that even SoTA VC methods cannot generate satisfying speeches for our V2C task. We hope the proposed new task together with the constructed dataset and evaluation metric will facilitate the research in the field of voice cloning and the broader vision-and-language community.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

page 11

page 12

page 15

05/10/2021

MASS: Multi-task Anthropomorphic Speech Synthesis Framework

Text-to-Speech (TTS) synthesis plays an important role in human-computer...
10/27/2017

Detection and Analysis of Human Emotions through Voice and Speech Pattern Processing

The ability to modulate vocal sounds and generate speech is one of the f...
04/05/2022

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

In this paper, we address the problem of lip-voice synchronisation in vi...
10/07/2021

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

In this paper, we formulate a novel task to synthesize speech in sync wi...
07/01/2021

An Objective Evaluation Framework for Pathological Speech Synthesis

The development of pathological speech systems is currently hindered by ...
11/29/2019

Improving Voice Separation by Incorporating End-to-end Speech Recognition

Despite recent advances in voice separation methods, many challenges rem...
11/13/2021

Direct Noisy Speech Modeling for Noisy-to-Noisy Voice Conversion

Beyond the conventional voice conversion (VC) where the speaker informat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: (a) Voice Cloning (VC) vs. (b) Visual Voice Cloning (V2C). Given an input triplet (i.e., subtitle/text, reference audio, and target video), our V2C task seeks to convert the text into a speech, which should be with the voice of reference audio and the emotion derived from reference video. Note that the reference audio only provides an expected voice while its content is irrelevant.

Voice Cloning (VC) [Arik2018NeuralVC, Chen2019SampleEA, Jia2018TransferLF, nachmani2018fitting] aims to convert a paragraph of text to speech with the desired voice from a reference audio. However, there exist many applications in the real world that require the generated speeches not only using a template voice but also being with rich emotions (e.g., angry, happy, and sad), such as movie dubbing. This is beyond the scope of conventional VC tasks (Figure 1(a)), as no extra guiding information can be used to generate desired tones and rhythms. Considering that we humans accomplish the movie dubbing task with the most reference from visual observations (e.g., watching the movie to grasp the emotion of characters), we propose an extension task of VC, namely Visual Voice Cloning (V2C).

An example of the proposed V2C task is shown in Figure 1(b). Unlike the conventional VC task, which converts text to speech only aided by a reference audio, our V2C task takes a triplet (text/subtitle, reference audio, reference video) as input and expects a resulting speech with the same voice but varying emotions derived from the reference video. The text/subtitle is the content that the generated speech needs to cover. The reference audio includes a pre-recorded voice of the target speaker from a different clip. And we aim to generate a speech with the voice in the reference audio and the character’s visual emotion from the reference video, speaking the content in the given text.

The new task poses several novel challenges. First, the conventional Voice Cloning (VC) methods [Arik2018NeuralVC, blaauw2019data, Chen2019SampleEA, Jia2018TransferLF, nachmani2018fitting] cannot well solve the V2C task as they focus only on how to convert the input text to speech with the voice/tone exhibited in the reference audio, without considering the emotion and context of the new speech. However, in our V2C task (e.g., movie dubbing) the voice emotion is crucial for generating human-like speech. Second, in our V2C task, the voice emotion should be derived from the reference video rather than the reference audio from an irrelevant clip. Taking movie dubbing as an example, it requires humans to grasp the emotions of characters by watching the corresponding movie clips and observing their performances (e.g., facial expressions or actions). Although several improved VC methods [skerry2018towards, wang2018style] also try to inject the voice emotion into their generated speech, they capture both emotion and voice from the reference audio, which cannot satisfy the requirements of V2C task. In our V2C task, an ideal method should be able to disentangle voice and emotion from the reference audios and the reference videos, respectively.

As there is no off-the-shelf dataset suitable for the V2C task, we collect the first V2C-Animation dataset to facilitate the research in this field. The V2C-Animation dataset comprises 10,217 video clips with audios and subtitles, covering 26 animated movies with 153 characters (i.e., speakers) in total. Our V2C dataset covers three modalities (i.e., text, audio and video) unlike the existing text-to-speech datasets [ljspeech17, panayotov2015librispeech, yamagishi2019cstr, zen2019libritts] or movie description datasets [rohrbach2015dataset, tapaswi2016movieqa], which only focus on text and audio, or text and video. Besides, we also provide emotion annotation (e.g., happy or sad) for each audio and video clip like [goodfellow2013challenges]. To alleviate the impact from background music, we only extract the sound channel of the centre speaker, which mainly focuses on the sound of the speaking character. In this way, we ensure that all the audio clips only contain the sound from speaking characters.

To address the above challenges of V2C task, based on the widely used Text-to-Speech (TTS) framework (i.e., FastSpeech2 [ren2020fastspeech]), we propose a new method called Visual Voice Cloning Network (V2C-Net), considering the emotion information derived from the reference video frames. Moreover, based on MCD [kubichek1993mel], we design an evaluation metric, called MCD-DTW weighted by Speech Length (MCD-DTW-SL), seeking to evaluate the generated speech effectively and automatically.

In summary, our contribution include:

  • We propose a new task, namely Visual Voice Cloning (V2C). Given a triplet (i.e., text/subtitle, reference audio and reference video), the task seeks to convert the text into a speech with voice and emotion derived from reference audio and reference video, respectively.

  • We collect the first V2C-Animation dataset, which consists of 26 animated movies, 153 characters, 10,217 video clips with the aligned audios and subtitles, covering three modalities (i.e., text, audio, video) and speakers’ emotion.

  • We design a new method, called Visual Voice Cloning Network (V2C-Net). Besides, to evaluate the generated speech automatically, we provide an advanced automatic evaluation metric, named MCD-DTW-SL.

2 Related Work

As the V2C is a new task, here we briefly review several closely related works in the fields of Text to Speech, Voice Cloning, and Prosody Transfer.

Text to Speech. Many text-to-speech (TTS) synthesis methods [arik2017deep, kalchbrenner2018efficient, Wang2017TacotronTE, li2019neural, chen2021adaspeech, yan2021adaspeech] have been proposed to generate natural speech from text. Then, based on WaveNet, Deep Voice [arik2017deep] divides a TTS model into several modules, which are optimised independently. Wang et al[Wang2017TacotronTE] propose a new framework Tacotron, which integrates all the necessary stages in text-to-speech synthesis and enables that the speech synthesis model can be optimised in an end-to-end manner. Recently, TransformerTTS [li2019neural] introduces the structure of transformer [vaswani2017attention] into TTS task while Ren et al[ren2019fastspeech] propose a more efficient transformer (i.e., FastSpeech) by using non auto-regressive generation method. Based on FastSpeech, they further design an improved FastSpeech2 [ren2020fastspeech], which seeks to control the generated speech via the adjustment of pitch and energy. However, the TTS task mainly focuses on how to convert natural language text to speech in a correct pronounce. Instead, our V2C task requires the generated speech to be additionally with a suitable voice emotion and tone.

Voice Cloning. Unlike the TTS method, which synthesises speech only with a single voice, voice cloning (VC) task [Gibiansky2017DeepV2, Ping2018DeepV3, Taigman2018VoiceLoopVF] seeks to generate multiple speeches with different voices. Based on Deep Voice [arik2017deep] and Tacotron [Wang2017TacotronTE], Deep Voice 2 [Gibiansky2017DeepV2] map the voices from different speakers into a common space and use the low-dimensional embedding from this common space as a condition to aid the generation process. Jia et al. [Jia2018TransferLF] propose a multi-speaker TTS framework, which consists of three sub-modules (i.e., encoder, synthesizer and vocoder), which is able to synthesise a high-quality speech from given text. More recent extensions [Arik2018NeuralVC, blaauw2019data, Chen2019SampleEA, nachmani2018fitting] focus on synthesising the voice of an unseen person using a few samples only. Specifically, to synthesise a person’s voice from only a few audio samples, Arik et al. [Arik2018NeuralVC]

study two approaches: speaker adaption and speaker encoding. The speaker adaption seeks to fine-tune a trained multi-speaker model for an unseen speaker using a few audio-text pairs while the speak encoding tries to directly estimate the speaker embedding from audios of an unseen speaker. Chen

et al. [Chen2019SampleEA] propose an adaptive TTS system by using the meta-learning approach. Different from VC task, our V2C additionally requires the prosody/tone of generated speeches to match with the reference video.

Prosody Transfer. To produce realistic speech, prosody transfer (PT) [bian2019multi, Hsu2019HierarchicalGM, skerry2018towards, stanton2018predicting, Valle2020MellotronME, wang2018style, Whitehill2020MultiReferenceNT] seeks to grasp prosody from reference audios. Specifically, extending from Tacotron [Wang2017TacotronTE], Skerry-Ryan et al. [skerry2018towards] propose an encoder architecture to learn a representation of prosody from reference spectrogram slices, which are derived from the reference audio. Global Style Tokens (GSTs) [wang2018style]

models the styles of different speakers using an interpretable embedding, which can be used as a condition when transferring the different speaking styles. Based on the variational autoencoder (VAE) framework 

[Kingma2014AutoEncodingVB], Hsu et al. [Hsu2019HierarchicalGM] design a neural sequence-to-sequence TTS model, which categorises the speaking styles into several latent attributes and hence controls the speaking style via adapting these attributes. To transfer speaking style which is underrepresented in the dataset, Whitehill et al. [Whitehill2020MultiReferenceNT] propose an adversarial cycle-consistent training procedure for a multi-reference neural TTS system. Overall, the goal of prosody transfer is to capture both the voice and emotion from the reference audio, and therefore the task is defined without the use of the information from visual side. By contrast, V2C task is proposed to infer the voice emotion from a reference video, which has many real-world applications such as movie dubbing.

3 V2C Task and V2C-Animation Dataset

3.1 Problem Definition for V2C Task

Given a triplet (i.e., text, reference audio and reference video), our Visual Voice Cloning (V2C) task aims to generate a speech (i.e., a waveform in time-domain) from text , which should use the voice of reference audio and be with the emotion derived from reference video . In Figure 1, we take movie dubbing as an example. Given a movie clip (i.e., reference video), a subtitle (i.e., text) and a reference audio, we seek to synthesise a speech from subtitle according to both the character’s emotion derived from movie and the voice from reference audio.

3.2 Dataset Construction for V2C Task

The dataset for V2C task should cover all three modalities and the samples from different modalities need to be aligned with each other. As there is no an off-the-shelf dataset suitable for this new task, we collect the first V2C dataset, called V2C-Animation.

Data Collection. We search for Blu-ray animated movies with the corresponding subtitles and then select a set of 26 movies of diverse genres. Specifically, we first cut the movies into a series of video clips according to the subtitle files. Here, we use an SRT type subtitle file. In addition to subtitles/texts, the SRT file also contains starting and ending time-stamps to ensure the subtitles match with video and audio, and sequential number of subtitles (e.g., No. 1340 in Figure 2), which indicates the index of each video clip. Based on the SRT file, we cut the movie into a series of video clips using the FFmpeg toolkit [tomar2006converting] (an automatic audio and video processing toolkit) and then extract the audio from each video clip by FFmpeg as well. Note that, to alleviate the impact from the background music, we only extract the sound channel of the centre speaker, which mainly focuses on the sound of the speaking character.

Figure 2: An example of how to cut a movie into a series of video clips according to SRT subtitle files. Note that the SRT files contain both starting and ending time-stamps for each video clip. No. 1340 refers to the sequential number of the current utterance.
Dataset Text Audio Video Identity Emotion #Movies #Video Clips #Audio Clips #Speakers Avg. S Avg. A/V (s)
LJ Speech [ljspeech17] - - 13100 1 17.23 6.57
LibriSpeech [panayotov2015librispeech] - - 250698 2484 32.55 14.10
VCTK [yamagishi2019cstr] - - 44070 108 7.41 3.59
LibriTTS [zen2019libritts] - - 375086 2456 16.86 5.62
MPII-MD [rohrbach2015dataset] 94 68337 - - - 3.88
MovieQA [tapaswi2016movieqa] 140 6771 - - 6.20 202.67
LRS2-main [chung2017lip] - 48164 48164 - 7.13 12
LRS2-pretrain [chung2017lip] - 96318 96318 - 21.43 10
V2C-Animation 26 10217 10217 153 6.51 2.40
Table 1: We compare our V2C-Animation dataset with several existing multi-modal datasets. “Identity” and “Emotion” indicate whether the datasets contain annotations corresponding to the speaker’s identity and emotion. The notations “#Movies”, “#Video Clips”, “#Audio Clips” and “#Speakers” refer to the number of movies, videos, audios and speakers/characters, respectively. “Avg. S” indicates the average length of subtitle while “Avg. A/V” is the average duration of audio/video.

Data Annotation and Organisation.

Inspired by the organisation of LibriSpeech dataset 

[panayotov2015librispeech], we categorise the obtained video clips, audios and subtitles into their corresponding characters (i.e., speakers) via a crowd-sourced service. To ensure that the characters appeared in the video clips are the same as the speaking ones, we manually remove the data example that does not satisfy the requirement. Then, following the categories of FER-2013 [goodfellow2013challenges] (a dataset for human facial expression recognition), we divide the collected video/audio clips into 8 types including angry, happy, sad, etc. In this way, we totally collect a dataset with 10,217 video clips with paired audios and subtitles. All of the annotations, time-stamps of the mined movie clips and a tool to extract the triplet data will be released. We randomly split samples as training data, samples as validation data and samples as testing data.

3.3 V2C-Animation Dataset vs. Related Datasets

We compare our V2C-Animation dataset against ones of the VC, TTS, and PT tasks. In addition, we consider some Movie Description (MD) datasets and Lip Reading Sentence (LRS) datasets, which contain both video and text like ours. Specifically, the VC/TTS/PT datasets include LJ Speech [ljspeech17], LibriSpeech [panayotov2015librispeech], VCTK [yamagishi2019cstr] and LibriTTS [zen2019libritts], the MD datasets involve MPII-MD [rohrbach2015dataset] and MovieQA [tapaswi2016movieqa], while LRS datasets contain LRS2 [chung2017lip]111The LRS2 dataset has two subsets: LRS2-main and LRS2-pretrain. On LRS2-pretrain, the utterance of each video may contain multiple sentences. But each video only corresponds to a single sentence on LRS2-main. There is some overlap between LRS2-pretrain and LRS2-main sets.. From Table 1, the statistic results demonstrate that our V2C-Animation dataset is unique, covering all the three modalities (i.e., text, audio and video) with both identity and emotion annotations, while most others only focus on the two modalities and all of them are without emotion annotations. More dataset statistics can be found in the supplementary.

To further compare the differences between our V2C-Animation dataset and related datasets, following [skerry2018towards], we visualise the pitch tracks of the samples from our dataset and others. Specifically, we randomly select an audio sample from LJ Speech, LibriSpeech and LibriTTS, respectively. Due to the varying lengths of audios, for a fair comparison, we crop two seconds of audio from each compared sample. As shown in Figure 3

, the audio pitches from the existing datasets are more smooth and the ranges of frequency (Hz) are narrower than ours. Moreover, we provide average and variance values of the pitch tricks. Table 

2 shows that the variance of our V2C-Animation is largest, which further demonstrates that our proposed dataset has a wider range of frequency (Hz). Both the visual and statistical results demonstrate that our V2C-Animation dataset is more challenging due to the varied prosody.

Figure 3: Examples from the existing Text-to-Speech (TTS) datasets (i.e., LJ speech, LibriTTS and LibriSpeech) and our V2C-Animation dataset. A pitch of 0 Hz refers to an unvoiced segment.
Dataset LJ Speech LibriSpeech
Avg. P (Hz) 127.27 11800.96 88.15 7313.39
Dataset LibriTTS V2C-Animation (Ours)
Avg. P (Hz) 93.97 9295.67 117.99 16910.77
Table 2: We compare the average and variance of pitch from our V2C-Animation dataset and the related datasets. “Avg. P” refers to the average values of pitches with the corresponding variances.

4 Visual Voice Cloning Network (V2C-Net)

Figure 4: Overview of V2C-Net. It consists of three main components: a multi-modal encoder, a synthesizer and a vocoder. A triplet (i.e., text, reference audio and reference video) is fed into the multi-modal encoder (Sec. 4.1) and it outputs three types of embeddings. Based on these embeddings, the synthesizer (Sec. 4.2) generates a mel-spectrogram. Finally, the mel-spectrogram is converted into the waveform (i.e., speech) by the vocoder (Sec. 4.3).

For the V2C task, we propose a baseline model, called Visual Voice Cloning Network (V2C-Net), which is based on a widely used TTS framework FastSpeech2 [ren2020fastspeech]. As shown in Figure 4, our model contains three main components: a multi-modal encoder, a synthesizer and a vocoder. We input a triplet (i.e., text, reference audio, and reference video) into the encoder and output three types of features (i.e., phoneme, speaker, and emotion). Based on these features, we use the synthesizer to generate a mel-spectrogram (see Figure 4 right side), which is a time-frequency representation of the audio signal. Last, we convert the generated mel-spectrogram into waveform (i.e., speech, see Figure 4 right-bottom) by the vocoder.

4.1 Multi-modal Encoder for Feature Extraction

Given a triplet , the output feature/embedding from the multi-modal encoder is

(1)

where is the length of input sentence (i.e., the number of phoneme). is the multi-modal encoder, mainly containing three sub-modules: a text encoder , a speaker encoder , and an emotion encoder . Here, , and indicate the input text, reference audio and reference video, respectively. We obtain the -th output feature , where is the embedding of the -th phoneme derived from text . The embeddings and are the outputs of and , respectively. The notation indicates the operation of element-wise add.

Text encoder. Following the structure of FastSpeech2, we take 4 Feed-Forward Transformer (FFT) blocks [ren2019fastspeech] as our text encoder. Based on such a text encoder, we produce a series of phoneme embeddings from an input text . Mathematically, the process can be defined as .

Speaker encoder. To explore the voice characteristics of different speakers, we adopt a speaker encoder , which has the same architecture as [wan2018generalized]

, comprising 3 LSTM layers and a linear layer. The speaker encoder first converts a sequence of mel-spectrogram frames, derived from the reference audio, to a series of hidden embeddings by LSTM, and then maps the last hidden embedding to a fixed-dimensional vector via the linear layer. For convenience, we define the process as

, where refers to the speaker embedding while is a mapping function, converting the audio from waveform to mel-spectrogram.

Emotion encoder. To exploit the emotion from video, we design an emotion encoder , which captures the embedding of the whole video clip . To be specific, we use the I3D model [carreira2017quo] as our emotion encoder and calculate the emotion embedding by .

4.2 Synthesizer for Mel-Spectrogram Generation

To generate the mel-spectrogram from the conditional phoneme embedding , we introduce a synthesizer inspired by FastSpeech2 [ren2020fastspeech], and obtain the predicted mel-spectrogram frames , where

is the number of mel-spectrogram frames. Here, the synthesizer mainly contains four parts: a duration predictor, a pitch predictor, an energy predictor, and a mel-spectrogram decoder. Loss function of the synthesizer is

(2)

where , , and refer to the losses of mel-spectrogram, duration predictor, pitch predictor and energy predictor, respectively. The , and are hyper-parameters, and we set in practice. The details are depicted in the following.

Duration Predictor. To alleviate the problem of length mismatch between the input embedding and mel-spectrogram frames (i.e., ), we introduce a duration predictor , which takes the embedding as input and predicts the duration of each phoneme embedding. The -th phoneme duration indicates the number of copies for -th phoneme embedding . Then, we use a Length Regulator ():

(3)

where is extended phoneme embedding222For example, if , would be . For simplicity, we redefine as with length . To optimise the duration predictor, we use Montreal forced alignment (MFA) [mcauliffe2017montreal] tools to obtain the ground-truth phoneme duration sequence, and then calculate a mean square error (MSE) loss between ground-truth and predicted . Formally, the loss can be defined as

(4)

Pitch and Energy Predictors. To affect the prosody and volume of speech, following [ren2020fastspeech], we employ a pitch predictor and an energy predictor , respectively. Specifically, to predict the pitch contour, we use continuous wavelet transform (CWT) to convert the continuous pitch into pitch spectrogram [suni2013wavelets, hirose2015speech], and take it as ground-truth to optimise the pitch predictor by MSE loss:

(5)

where and denote the -th ground-truth and predicted pitch value, respectively. For energy, we follow the operation in [ren2020fastspeech]

that calculate an L2-norm of the amplitude of each short-time Fourier transform (STFT) frame and take it as energy. The corresponding loss function is

(6)

where and are the -th ground-truth and predicted energy value, respectively. Last, we encode each pitch and energy value into the corresponding embedding by the embedding layers and separately, and then add the pitch and energy embeddings into the extended phoneme embedding . Mathematically, the mel-spectrograms can be generated by

(7)

where refers to a mel-spectrogram decoder, consisting of 6 FFT blocks [ren2019fastspeech]. To optimise the predicted mel-spectrogram, we use the loss function:

(8)

where denotes the -th frame of the ground-truth mel-spectrogram while is the predicted one.

4.3 Vocoder for Speech Synthesis

In Figure 4, to convert the generated mel-spectrogram into time-domain waveform , we use HiFi-GAN [kong2020hifi] as our vocoder, which mainly focuses on the raw waveform generation from mel-spectrogram by GANs [goodfellow2014generative]. The generator of HiFi-GAN can be divided into two major modules: a transposed convolution (ConvTranspose) network and a multi-receptive field fusion (MRF) module. Specifically, we first upsample the mel-spectrogram by ConvTranspose, which seeks to take an alignment between the length of the output features and the temporal resolution of raw waveforms. Then, we feed the upsampled features into the MRF module, which consists of multiple residual blocks [he2016deep], and take the sum of outputs from these blocks as our predicted waveform. Here, we follow the settings of [kong2020hifi] that use the residual blocks with different kernel sizes and dilation rates to ensure different receptive fields. We optimise the vocoder via the objective function that contains an LSGAN-based loss [mao2017least], a mel-spectrogram loss [isola2017image], and a feature matching loss [kumar2019melgan]. In practice, we use the vocoder (i.e., HiFi-GAN) pretrained on LibriSpeech dataset [panayotov2015librispeech].

5 Experiments

Methods MCD MCD-DTW MCD-DTW-SL Id. Acc. Emo. Acc. MOS-naturalness MOS-similarity
Ground Truth 00.00 00.00 00.00 90.62 84.38 4.61 0.15 4.74 0.12
SV2TTS [Jia2018TransferLF] 21.08 12.87 49.56 33.62 37.19 2.03 0.22 1.92 0.15
SV2TTS* [Jia2018TransferLF] 17.41 11.16 15.92 38.21 41.24 3.20 0.20 3.09 0.33
FastSpeech2 [ren2020fastspeech] 12.08 10.29 10.31 59.38 53.13 3.86 0.07 3.75 0.06
V2C-Net (Ours) 11.79 10.09 10.05 62.50 56.25 3.97 0.06 3.90 0.06
Table 3: Comparison with the state-of-the-art methods. We provide the results of both objective (i.e., MCD, MCD-DTW and MCD-DTW-SL) and subjective evaluation metrics (i.e., MOS-naturalness and MOS-similarity). “Id. Acc.” and “Emo. Acc.” are the identity and emotion accuracy of the generated speech, respectively. The method with “*” refers to a variant taking video (emotion) embedding as an additional input. “Ground Truth” denotes the results on ground-truth samples. means that the higher (lower) value is better.

We evaluate the quality of generated speech in terms of three aspects: 1) objective evaluation, 2) subjective evaluation, and 3) identity and emotion accuracy. The objective and subjective evaluation metrics aim to assess the quality of generated speeches by comparing with ground-truth ones. By contrast, the identity accuracy and emotion accuracy focus on whether the generated speeches involve the desired voice (i.e., identity) and emotion, respectively. We provide both quantitative and qualitative results on the V2C-Animation dataset. More details are in the following.

5.1 Evaluation Metrics

Objective Evaluation Metric. To assess the quality of generated speech, we use Mel Cepstral Distortion (MCD) [kubichek1993mel] metric, which compares Mel Frequency Cepstral Coefficient (MFCC) vectors and derived from the generated speech and ground truth, respectively. We sum the Euclidean distance over the first MFCC values:

(9)

where refers to the number of speech/audio frames. The and denote the -th MFCC value of the -th speech frame from generated and ground-truth speeches, respectively, while and .

Note that the MCD metric requires the lengths of two input speeches to be the same (i.e., ). When , the existing voice cloning methods like [skerry2018towards]

simply extend the shorter speech to the length of longer one by padding 0 for the time-domain waveform. In this way, the value of MCD may become extremely large if the mismatching occurs at the beginning of two speeches. To avoid this issue, Battenberg

et al. [battenberg2020location] use an improved MCD metric, called MCD-DTW, which adopts the Dynamic Time Warping (DTW) [muller2007dynamic] algorithm to find the minimum MCD between two speeches. However, MCD-DTW would achieve a better value as long as there is a match between two speeches. This is not reasonable as a better generated speech should have a similar length with the ground truth.

To alleviate the above issues, we propose a MCD-DTW weighted by Speech Length (MCD-DTW-SL), which evaluates both the length and the quality of alignment between two speeches. In MCD-DTW-SL, to evaluate whether the two speeches (i.e., and ) are aligned, we still use DTW algorithm to calculate the minimum distance among them. Specifically, we compute the cumulative distance , where is the minimum cumulative distance from index to . Then, we obtain the objective minimum distance by accumulating distances in total, where . Besides, considering the influence of lengths, we design a simple but effective coefficient . Formally, we calculate the metric

(10)

Subjective Evaluation Metric. To further evaluate the quality of generated speech, we conduct a human study by using a subjective evaluation metric. Specifically, following the settings in [Jia2018TransferLF], we use a Mean Opinion Score (MOS) evaluation approach based on subjective listening tests. In this approach, we use the Absolute Category Rating (ACR) scale [rec1996p] with rating scores from 1 to 5 (i.e., from “Bad” to “Excellent”) in 0.5 point increments. Based on such an approach, we mainly evaluate the generated speeches with respect to naturalness and similarity. 1) MOS-naturalness: to assess the naturalness of the generated speech, we randomly sample 100 generated audios from the testing set and divide them into 4 groups. Each group is rated by a single rater. 2) MOS-similarity: to evaluate whether the generated speech is well aligned with the desired voice and prosody, we compare each generated speech with the ground-truth one from the same speaker. We use the same samples as when evaluating MOS-naturalness above. Each pair is rated by the rater according to the similarity between two speeches.

Identity and Emotion Accuracy. To evaluate whether the generated speech carries proper speaker identity and emotion, we propose an identity accuracy and an emotion accuracy, respectively. The identity accuracy aims to verify whether the generated speech can be recognised as the same speaker as the input reference audio. Similarly, the emotion accuracy reflects whether the generated speech contains the same emotion like the reference video. To this end, we first use GE2E [wan2018generalized] model as our audio encoder to obtain a fixed-dimensional audio embedding for each speech. Then, based on the audio embeddings 333The embeddings have been normalised using L2 norm. of the -th speaker, we obtain the centroid of the -th speaker by , where refers to the number of audio belonging to the

-th speaker. Last, we compare the cosine similarity between the embedding of generated speech and each centroid, and then classify it into the category the most similar centroid belongs to. Similarly, we can calculate the emotion accuracy in the same way as well.

5.2 Quantitative Evaluation

To evaluate the performance of our method, we compare V2C-Net with several state-of-the-art methods. In Table 3, our V2C-Net consistently outperform the existing VC models (e.g., the MCD-DTW-SL of our V2C-Net is while the SV2TTS model only achieves ). Besides, we propose a variant of SV2TTS model, called SV2TTS*, which takes the video embedding derived from our method as an additional input. From Table 3, SV2TTS* achieves better performance compared with SV2TTS, which further demonstrates the effectiveness of our video component. Note that the Id. Acc. and Emo. Acc. on the ground truth are not 100%. This is because our pre-trained identity and emotion classification models are not perfect.

5.3 Qualitative Evaluation

Figure 5: Mel-spectrogram of generated and ground-truth audios. Orange curves are contours, where is fundamental frequency of audio. Purple curves refer to energy (volume) of audio. Horizontal axis is duration of audio. We highlight main difference via red circle.

To further test the performance of generated speech, we show the visualised results of the proposed methods, baseline method and ground-truth, respectively. In Figure 5, compared with FastSpeech2, both the energy (volume) curve and the fundamental frequency curve (i.e., curve) of the mel-spectrogram generated by our V2C-Net is more similar to the ground-truth (GT) ones. Notably, as the duration of audio should be predicted as well, the lengths of the generated audio and the GT one may be different. More visual results can be found in the supplementary.

5.4 Effect of Reference Audio and Video

To investigate the effect of the reference audio and video, we conduct an ablation study to compare the generated speeches by removing them alternately and show the quantitative results (i.e., identity and emotion accuracy mentioned above) in Table 4. The results show that with the control of reference audio, our V2C-Net achieves higher identity accuracy obviously than the counterpart without reference audio (i.e., from to ). After further incorporating the information of reference video, the model obtains the best performance on both two metrics.

ref. A ref. V Id. Acc. Emo. Acc.
V2C-Net 25.00 47.61
59.38 53.13
62.50 56.25
Table 4: Effect of reference audio and video. “ref. A” and “ref. V” denote the reference audio and reference video separately. “Id. Acc.” and “Emo. Acc.” are identity and emotion accuracy of the generated speech, respectively. means higher value is better.

5.5 Comparing Difficulties of V2C and VC Tasks

To investigate whether our V2C task is more challenging than the conventional VC task, we compare results of the SV2TTS method [Jia2018TransferLF] on both our V2C-Animation dataset and the existing VC datasets (e.g., VCTK [yamagishi2019cstr], LibriSpeech [panayotov2015librispeech] and Multi-speaker [skerry2018towards]). In Table 5, SV2TTS obtains and MOS-naturalness on VCTK and LibriSpeech, respectively, which are higher than that on our V2C-Animation dataset ( in Table 5). Besides, the SV2TTS model achieves MCD value on Multi-speaker dataset [skerry2018towards], which is better than the results of the same model on our V2C-Animation dataset (i.e., in Table 5). It demonstrates that the proposed V2C task is more non-trivial as the same VC model obtains worse results on our V2C-Animation dataset than others.

Task Dataset MCD MOS-naturalness
VC VCTK [yamagishi2019cstr] - 4.07 0.06
LibriSpeech [panayotov2015librispeech] - 3.98 0.06
Multi-speaker [skerry2018towards] 12.37 -
V2C V2C-Animation (Ours) 21.08 2.03 0.22
Table 5: Comparisons on the difficulties of the conventional Voice Cloning (VC) and Visual Voice Cloning (V2C) tasks. We show the performance of SV2TTS [Jia2018TransferLF] trained on different datasets. VCTK, LibriSpeech and Multi-speaker are widely used in VC task. means higher (lower) value refers to better performance.

5.6 Future Work and Discussion on Social Impacts

Limitations and Future Work. In the future, we may extend V2C-Net in two aspects. First, to grasp the emotion from video, we simply use I3D model as emotion encoder. However, it may not well disentangle emotion from character identity (see results in Table 4). To alleviate this issue, we may design an emotion-aware loss to capture more discriminative emotion features. Second, we integrate the multi-modal features (i.e., text, audio, video) by a simple operation (i.e., element-wise add), which may result in sub-optimal performance. Thus, a more promising model for feature fusion is necessary, e.g., COOT [ging2020coot] or VATT [akbari2021vatt].

Discussion on Social Impacts. We provide a new V2C task, which takes advantage of many real-world applications, e.g., movie dubbing or restoring the ability to communicate naturally for users who have lost their voice. However, the technology has a risk to be used maliciously for fake voice generation, which may be misused to financial scams by combining with video deepfakes. To avoid this issue, in this paper, we only focus on the voice generation based on animated movies without any personally identifiable information, e.g., face of the real person.

6 Conclusion

In this paper, we propose a novel task, Visual Voice Cloning (V2C), extended from conventional Voice Cloning. It seeks to convert a paragraph of text to speech with the desired voice and emotion from reference audio and video, respectively. To facilitate the research of this new task, we collect the first V2C-Animation dataset. We also design a V2C baseline method, namely Visual Voice Cloning Network (V2C-Net), based on FastSpeech2 (a widely used TTS framework). Moreover, to assess the quality of the generated speech, we propose a variant of MCD-DTW, called MCD-DTW-SL, which is weighted by speech length. The experimental results demonstrate the effectiveness of our V2C-Net, but it is still far from saturation.

References

Appendix A More Analysis of V2C-Animation Dataset

a.1 Word Cloud and Count

In Figure 6, we visualise the texts/subtitles of our V2C-Animation dataset as Venn-style word cloud [coppersmith2014dynamic]

, where the size of each word refers to the harmonic mean of its count.

Figure 6: Word cloud of the texts on our V2C-Animation dataset.

Besides, we also provide the top 30 words on our V2C-Animation dataset along with their counts in Figure 7. More results (top 100) are in the following:
(‘know’, 437), (‘oh’, 305), (‘right’, 255), (‘one’, 254), (‘now’, 250), (‘well’, 250), (‘go’, 233), (‘okay’, 217), (‘come’, 210), (‘want’, 201), (‘look’, 196), (‘got’, 181), (‘going’, 173), (‘think’, 167), (‘will’, 165), (‘thing’, 163), (‘gonna’, 163), (‘need’, 159), (‘see’, 155), (‘back’, 153), (‘never’, 151), (‘us’, 147), (‘time’, 141), (‘say’, 139), (‘hey’, 138), (‘mean’, 137), (‘let’, 137), (‘good’, 135), (‘yeah’, 131), (‘guy’, 128), (‘really’, 124), (‘make’, 124), (‘thank’, 124), (‘little’, 112), (‘way’, 108), (‘love’, 108), (‘ye’, 108), (‘find’, 104), (‘help’, 97), (‘tell’, 96), (‘wait’, 95), (‘take’, 93), (‘kid’, 92), (‘please’, 91), (‘sorry’, 88), (‘something’, 87), (‘great’, 87), (‘dad’, 87), (‘friend’, 84), (‘day’, 82), (‘game’, 80), (‘stop’, 75), (‘even’, 75), (‘Uh’, 74), (‘big’, 67), (‘work’, 66), (‘Ralph’, 66), (‘much’, 62), (‘give’, 62), (‘first’, 61), (‘everything’, 60), (‘new’, 59), (‘still’, 58), (‘life’, 58), (‘keep’, 58), (‘dragon’, 58), (‘family’, 57), (‘sure’, 56), (‘made’, 56), (‘talk’, 55), (‘world’, 53), (‘place’, 53), (‘heart’, 53), (‘every’, 53), (‘maybe’, 53), (‘stay’, 52), (‘wanna’, 51), (‘better’, 51), (‘people’, 50), (‘huh’, 50), (‘anything’, 50), (‘getting’, 49), (‘thought’, 48), (‘man’, 48), (‘mom’, 48), (‘listen’, 48), (‘guess’, 47), (‘fine’, 47), (‘around’, 47), (‘gotta’, 46), (‘believe’, 46), (‘two’, 45), (‘someone’, 45), (‘home’, 45), (‘call’, 45), (‘boy’, 45), (‘son’, 44), (‘put’, 43), (‘fix’, 43), (‘always’, 43)

Figure 7: Top 30 words on V2C-Animation along with the counts.

a.2 Distribution of Emotion Labels

Following the categories of FER-2013 [goodfellow2013challenges] (a dataset for human facial expression recognition), we divide the collected video/audio clips into 8 types (i.e., 0: angry, 1: disgust, 2: fear, 3: happy, 4: neutral, 5: sad, 6: surprise, and 7: others). The number and distribution of each emotion label can be found in Table 6 and Figure 8, respectively.

Figure 8: Distribution of emotion labels on V2C-Animation.
Emotion angry disgust fear happy
Count 756 64 305 1799
Emotion neutral sad surprise others
Count 4919 572 240 1562
Table 6: Counts of the emotion labels on V2V-Animation dataset.

a.3 Distribution of Utterance Length

Figure 9: Distribution of utterance/text length.

Figure 9 exhibits the distribution of utterance/text length on V2C-Animation dataset, which shows that most utterance range from 3 to 8 words. Besides, we also list the number of utterance/text and their corresponding percentages in the following (format: length, count, percentage):
(1, 594, 5.81%), (2, 708, 6.93%), (3, 914, 8.95%), (4, 1116, 10.92%), (5, 1232, 12.06%), (6, 1213, 11.87%), (7, 1040, 10.18%), (8, 919, 8.99%), (9, 783, 7.66%), (10, 615, 6.02%), (11, 423, 4.14%), (12, 270, 2.64%), (13, 192, 1.88%), (14, 105, 1.03%), (15, 54, 0.53%), (16, 24, 0.23%), (17, 11, 0.11%), (18, 3, 0.03%), (19, 1, 0.01%)

a.4 More Examples of Subtitle and Video Clip

We show several examples of how to crop movies based on a corresponding subtitle file. Here, we use an SRT type subtitle file. Besides the subtitles/texts, the SRT file also contains starting and ending time-stamps to ensure the subtitles match with video and audio. The sequential number of subtitle (e.g., No. 726 and No. 1340 in Figure 10) indicates the index of each video clip. Based on the SRT file, we cut movie into a series of video clips via FFmpeg toolkit [tomar2006converting] (an automatic audio and video processing toolkit).

Figure 10: Examples of how to cut a movie into a series of video clips according to subtitle files. Note that the subtitle files contain both starting and ending time-stamps for each video clip.

a.5 Samples of Character’s Emotion

Figure 11 shows some samples of the reference videos on V2C-Animation dataset with their corresponding emotions.

Figure 11: Samples of the character’s emotion (e.g., happy and sad) involved in the reference video. Here, we take Elsa (a character in movie Frozen) as an example.

a.6 List of Animated Movies and Characters

As shown in Figure 12, we report all the names of our collected animated movies with their corresponding characters/speakers on the V2C-Animation dataset.

Figure 12: Movies with the corresponding speakers/characters on the V2C-Animation dataset.

Appendix B Analysis of Qualitative Results

To further assess the quality of the generated speeches, we provide a video in this supplementary, which compares the generated audios from our V2C-Net, baseline method (i.e., FastSpeech2 [ren2020fastspeech]), and ground truth (i.e., “comparison_with_SoTA.mp4”). Besides, to investigate whether the proposed V2C-Net is able to clone the voice from reference audio, we fix the input text/subtitle and reference video. Then, we generate speeches using the voice derived from different reference audios (i.e., “voice_cloning.mp4”).

Appendix C V2C-Animation vs. Related Datasets

To compare the differences between the collected V2C-Animation dataset and several related datasets (i.e., LJ Speech, LibriSpeech and LibriTTS), we visualise the pitch tricks of the samples from our dataset and others. Due to the varying lengths of audios, for a fair comparison, we cut two seconds of audio from each sample. As shown in Figure 13, the audio pitches from the existing datasets are more smooth and their ranges of frequency (Hz) are narrower than ours.

Figure 13: Visual comparison between our V2C-Animation dataset and the related datasets (i.e., LJ Speech, LibriSpeech and LibriTTS). A pitch of 0 Hz refers to an unvoiced segment.

Appendix D More Visual Results of Mel-spectrogram

We provide more visualised results of our V2C-Net with comparisons against the baseline method and ground truth. As shown in Figure 14, the mel-spectrograms generated by the proposed V2C-Net are more similar to the ground-truth ones. Note that the baseline method FastSpeech2 does not take the reference videos (i.e., emotions) as inputs, which may lead to some misses of the prosody involving in the videos. The results further demonstrate the effect of the reference video when generating speech with rich emotions. Besides, the ranges of pitch for the mel-spectrograms are various due to the different emotions. For example, the pitch of the mel-spectrogram would be more drastic with the emotions “happy” or “sad”, while it would be more smooth if the emotion is “neutral”.

Figure 14: More visualised mel-spectrograms of generated and ground-truth audios. The orange curves are contours, where denotes the fundamental frequency of audio. The purple curves refer to energy (volume) of audio. Horizontal axis is the duration of audio. We highlight the main difference via red circle.

Appendix E Implementation Details

For the speaker encoder , we use the same architecture as [wan2018generalized], comprising three LSTM layers. The audio encoder maps a sequence of mel-spectrogram frames, derived from the reference audio, to a vector with a fixed dimension of 256. We optimise the model with a generalised end-to-end speaker verification loss, which ensure features from the same speaker are more similar than ones from different speakers. For the emotion encoder , we use a conventional I3D model [carreira2017quo], trained on our V2C-Animation dataset [carreira2017quo] with iterations and final output a vector with 1024 dimensions. For our synthesizer, we train the text encoder and the synthesizer in an end-to-end manner with batch size and iterations on our proposed V2C-Animation dataset. We train all models on a single GPU device (GeForce RTX 3090).

Appendix F Details of Vocoder

To synthesise the waveform of the speech from our generated mel-spectrogram, we use HiFi-GAN [kong2020hifi]

as our vocoder. The HiFi-GAN model is based on Generative Adversarial Networks (GANs) 

[goodfellow2014generative], which consists of one generator and two discriminators, i.e., a multi-period discriminator (MPD) and a multi-scale discriminator (MSD).

The generator of HiFi-GAN can be divided into two major modules: a transposed convolution (ConvTranspose) network and a multi-receptive field fusion (MRF) module. Specifically, we first upsample the mel-spectrogram by ConvTranspose, which seeks to take an alignment between the length of the output features and the temporal resolution of raw waveforms. Then, we feed the upsampled features into MRF module, which consists of multiple residual blocks [he2016deep], and take the sum of outputs from these blocks as our predicted waveform. Here, the residual blocks with different kernel sizes and dilation rates are used to ensure different receptive field.

For the two discriminators, the multi-period discriminator (MPD) contains several sub-discriminators, where each sub-discriminator handles a specific periodic part of the input audio. By contrast, the multi-scale discriminator (MSD) proposed in MelGAN [kumar2019melgan], consisting of three sub-discriminators, tries to capture the consecutive patterns and long-term dependencies from input audio.

The generator and discriminators are trained adversarially, aiming to improve the training stability and the model performance. Specifically, the vocoder (i.e., HiFi-GAN) is optimised via the objective function that contains an LSGAN-based loss [mao2017least], a mel-spectrogram loss [isola2017image], and a feature matching loss [kumar2019melgan]. In practice, we use the vocoder (i.e., HiFi-GAN) pretrained on the LibriSpeech dataset [panayotov2015librispeech].