Vocoder-Based Speech Synthesis from Silent Videos

04/06/2020 ∙ by Daniel Michelsanti, et al. ∙ Universitat Pompeu Fabra Aalborg University 0

Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over existing video-to-speech approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most of the events that we experience in our life consist of visual and acoustic stimuli. Recordings of such events may lack the acoustic component, for example due to limitations of the recording equipment or technical issues in the transmission of the information. Since acoustic and visual modalities are often correlated, methods to reconstruct audio signals using videos have been proposed [davis2014visual, owens2016visually, zhou2018visual].

In this paper, we focus on one particular case of the aforementioned problem: speech reconstruction (or synthesis) from a silent video. Solving this task might be useful to automatically generate speech for surveillance videos and for extremely challenging speech enhancement applications, e.g. hearing assistive devices, where noise completely dominates the target speech, making the acoustic signal worth less than its video counterpart.

A possible way to tackle the problem is to decompose it into two steps: first, a visual speech recognition (VSR) system [assael2016lipnet, Stafylakis2017, chung2017lip] predicts the spoken sentences from the video; then, a text-to-speech (TTS) model [sotelo2017char2wav, ping2018deep, shen2018natural] synthesises speech based on the output of the VSR system. However, at least two drawbacks can be identified when such an approach is used. In order to generate speech from text, each word should be spoken in its entirety to be processed by the VSR and the TTS systems, imposing great limitations for real-time applications. Furthermore, when the TTS method is applied, useful information that should be captured by the system, such as emotion and prosody, gets lost, making the synthesised speech unnatural. For these reasons, approaches that estimate speech from a video, without using text as an intermediate step, have been proposed.

Le Cornu and Miller [cornu2015reconstructing, le2017generating] developed a video-to-speech method with a focus on speech intelligibility rather than quality. This is achieved by estimating spectral envelope (SP) audio features from visual features and then reconstructing the time-domain signal with the STRAIGHT vocoder [kawahara1999restructuring]. Since the vocoder also requires other audio features, i.e. the fundamental frequency (F0) and the aperiodic parameter (AP), these are artificially created independently of the visual features.

Ephrat and Peleg [ephrat2017vid2speech]

treated speech reconstruction as a regression problem using a neural network which takes as input raw visual data and predicts a line spectrum pairs (LSP) representation of linear predictive coding (LPC) coefficients computed from the audio signal. The waveform is reconstructed from the estimated audio features using Gaussian white noise as excitation, producing unnatural speech. This issue is tackled in a subsequent work

[ephrat2017improved], where a neural network estimates the mel-scale spectrogram of the audio from video frames and optical flow information derived from the visual input. The time-domain speech signal is reconstructed using either example-based synthesis, in which estimated audio features are replaced with their closest match in the training set, or speech synthesis from predicted linear-scale spectrograms.

Akbari et. al. [akbari2018lip2audspec]

tried to reconstruct natural sounding speech using a neural network that takes as input the face region of the talker and estimates bottleneck features extracted from the auditory spectrogram by a pre-trained autoencoder. The time-domain signal is obtained with the algorithm in

[chi2005multiresolution]. This approach shows its effectiveness when compared to [ephrat2017vid2speech].

All the methods reported until now have a major limitation: they estimate either a magnitude spectrogram, SPs or LSPs, which do not contain all the information of a speech signal. Vougioukas et al. [Vougioukas2019] addressed this issue and proposed an end-to-end model that can directly synthesise audio waveforms from videos using a generative adversarial network (GAN). However, their direct estimation of a time-domain signal causes artefacts in the reconstructed speech.

In this work, we propose an approach, vid2voc, to estimate WORLD vocoder [morise2016world] features from the silent video of a speaker111Although this paper aims at synthesising speech from frontal-view silent videos, it is worth mentioning that some methods using multi-view video feeds have also been developed [kumar2018harnessing, kumar2018mylipper, kumar2019lipper, Uttam2019].. We trained the systems using either the whole face or the mouth region only, since previous work [ephrat2017vid2speech] shows a benefit in using the entire face. Our method differs from the work in [cornu2015reconstructing, le2017generating], because we predict all the vocoder features (not only SP) directly from raw video frames. The estimation of F0 and AP, alongside with SP, allows to have a framework with a focus on speech intelligibility (as in [cornu2015reconstructing, le2017generating]) and speech quality, able to outperform even the recently proposed GAN-based approach in [Vougioukas2019] in several conditions. In addition, we train a system that can simultaneously perform speech reconstruction (our main goal) and VSR, in a multi-task learning fashion. This can be useful in all the applications that require video captioning without adding considerable extra complexity to the system. Although Kumar et al. [kumar2019lipper]

incorporate a text-prediction model in their multi-view speech reconstruction pipeline, this model is trained separately from the main system and it is quite simple: it classifies encoded audio features estimated with a pre-trained network into 10 text classes. This makes the method dependent on the number of different sentences of the specific database used for training and not suitable for real-time applications. Instead, we make use of the more flexible connectionist temporal classification (CTC)

[graves2006connectionist] sequence modelling which has already shown its success in VSR [assael2016lipnet].

Additional material, including samples of reconstructed speech that the reader is encouraged to listen to for a better understanding of the effectiveness of our approach, can be found in https://danmic.github.io/vid2voc/.

2 Methodology and Experimental Setup

2.1 Audio-Visual Speech Corpus

Experiments are conducted on the GRID corpus [cooke2006audio], which consists of audio and video recordings from 34 speakers (s134), 18 males and 16 females, each of them uttering 1000 six-word sentences with the following structure: command color preposition letter digit adverb. Each video has a resolution of 720576 pixels, a duration of 3 s and a frame rate of 25 frames per second. The audio tracks have the same duration as the videos and a sample frequency of 50 kHz. In addition, text transcription for every utterance is provided.

As in [Vougioukas2019], we evaluate our systems in speaker dependent and speaker independent settings. Regarding the speaker dependent scenario, the data from 4 speakers (s1, s2, s4, s29) is pooled together, then 90% of the data is used for training, 5% for validation and 5% for testing. Regarding the speaker independent scenario, the data from 15 speakers (s1, s3, s58, s10, s12, s14, s16, s17, s22, s26, s28, s32) is used for training, the data from 7 speakers (s9, s20, s23, s27, s29, s30, s34) for validation and the data from 10 speakers (s2, s4, s11, s13, s15, s18, s19, s25, s31, s33) for testing.

2.2 Audio and Video Preprocessing

The acoustic model used in this work is based on the WORLD vocoder [morise2016world], with a sample frequency of 50 kHz and a hop size of 250 samples222The window length is automatically determined by the WORLD algorithm.. WORLD consists of three analysis algorithms to determine SP, F0 and AP features, and a synthesis algorithm which incorporates these three features. Here, we use SWIPE [camacho2007swipe] and D4C [morise2016d4c] to estimate F0 and AP, respectively. As done in [blaauw2017neural], a dimensionality reduction of the features is applied: SP is reduced to 60 log mel-frequency spectral coefficients (MFSCs) and AP is reduced to 5 coefficients according to the D4C band-aperiodicity estimation. In addition, a voiced-unvoiced (VUV) state is obtained by thresholding the F0 obtained with SWIPE. All the acoustic features are min-max normalised using the statistics of the training set as in [chandna2019vocoder].

As in [Vougioukas2019], videos are preprocessed as follows: first, the faces are aligned to the canonical face333We use the face processor library in https://github.com/DinoMan/face-processor, which makes use of [bulat2017far].; then, the video frames are normalised in the range , resized to 128

96 pixels and, for the models that use only the mouth region as input, cropped preserving the bottom half; finally, the videos are mirrored with a probability of 0.5 during training.

2.3 Architecture and Training Procedure

As shown in Figure 1, our network maps video frames of a speaker to vocoder features and consists of a video encoder, a recursive module and five decoders: SP decoder, AP decoder, VUV decoder, F0 decoder and VSR decoder. We also tried not to use the VSR decoder, to see whether it has any impact on the performance.

The video encoder is inspired by [Vougioukas2019]: it takes as input one video frame concatenated with the three previous and the three next frames and applies five 3-D convolutions (conv3D). Each of the first four convolutional layers is followed by batch normalisation (BN) [ioffe2015batch]

, ReLU activation and dropout

[srivastava2014dropout], while the last one is followed by Tanh activation.

To model the sequential nature of video data, a recursive module is used: it consists of a single-layer gated recurrent unit (GRU) [cho2014learning], BN, ReLU activation and dropout.

Each decoder takes the GRU features as input. For every video frame the SP decoder produces an eight-frame-long estimate of the normalised dimensionality-reduced SP, , through three 2-D transposed convolutions (convT2D), each followed by BN, ReLU activation and dropout, and another convT2D followed by ReLU activation.

The VUV decoder consists of a linear layer followed by ReLU activation. A threshold of 0.2 is applied to the output obtaining , an estimate of the VUV state, .

The AP decoder has a structure similar to the SP decoder, with a total of three convT2D in this case. Its output, , together with is used to get , an estimate of , where indicates an all-ones matrix with 5 rows and 8 columns, and is the normalised dimensionality-reduced AP:

(1)

where indicates the -th row of and denotes the element-wise product.

The F0 decoder has a linear layer followed by a sigmoid activation function. Its output,

, is point-wise multiplied with to obtain , an estimate of the normalised F0, :

(2)

Finally, the VSR decoder, consisting of a linear and a softmax layers, outputs a CTC character that will be used to predict the text transcription of the utterance.

The system is trained to minimise the following loss:

(3)

where , , , , , and:

  • : mean squared error (MSE) between and .

  • : MSE between and .

  • : MSE between and .

  • : MSE between and .

  • : CTC loss [graves2006connectionist] between the target text transcription and the estimated one.

Details regarding architecture and training hyperparameters can be found in Table

1.

Input Size
Video Encoder
Layer
Input
Channels
Output
Channels
Kernel
Size
Stride Padding
Conv3D 3 64 (7,4,4) (1,2,2) (0,1,1)
Conv3D 64 128 (1,4,4) (1,2,2) (0,1,1)
Conv3D 128 256 (1,4,4) (1,,2) (0,1,1)
Conv3D 256 512 (1,4,4) (1,2,2) (0,1,1)
Conv3D 512 128 (1,,6) (1,1,1) (0,0,0)
Recursive Module
Layer Input Size Hidden Size
GRU 128 128
Spectral Envelope (SP) Decoder
Layer
Input
Channels
Output
Channels
Kernel
Size
Stride Padding
ConvT2D 128 256 (1,6) (1,1) (0,0)
ConvT2D 256 128 (2,4) (1,2) (0,0)
ConvT2D 128 64 (4,4) (1,2) (0,0)
ConvT2D 64 1 (4,2) (1,2) (0,0)
Aperiodic Parameter (AP) Decoder
Layer
Input
Channels
Output
Channels
Kernel
Size
Stride Padding
ConvT2D 128 128 (4,1) (1,1) (0,0)
ConvT2D 128 64 (3,3) (1,1) (0,0)
ConvT2D 64 1 (3,3) (1,1) (0,0)
Voiced-Unvoiced (VUV) Decoder
Layer Input Size Output Size
Linear 128 8
Fundamental Frequency (F0) Decoder
Layer Input Size Output Size
Linear 128 8
Visual Speech Recognition (VSR) Decoder
Layer Input Size Output Size
Linear 128 28
Extra Information

The system is implemented in Pytorch

[NEURIPS2019_9015] and trained for
iterations using the Adam optimizer [kingma2014adam] with a learning rate
of 0.0001, =0.5 and =0.9. The model that performs the
best in terms of PESQ on the validation set is used for testing.
=75 (sequence length). =3 (image channels).
=7 (consecutive video frames). =96 (video frame width).
If the full face is used as input:
=16 (batch size). =128 (video frame height). =3. =5.
If only the mouth is used as input:
=24 (batch size). =64 (video frame height). =2. =4.
In the speaker dependent case, the dropout probability of each
dropout layer is =0.2. =300000.
In the speaker independent case, =0.5 for the video encoder
and the GRU, and =0.2 for the rest. =185000.
Eight is the number of the output audio frames corresponding
to the video frame used as input (together with its context).
The 28 CTC characters consist of the 26 letters of the English
alphabet, one space character and one blank token.
Table 1: Architecture and training hyperparameters. Activation, batch normalisation and dropout omitted for brevity.
Input
Mouth Face
w/o VSR Decoder vid2voc-M vid2voc-F
w/ VSR Decoder vid2voc-M-VSR vid2voc-F-VSR
Table 2: Systems used in this study.

2.4 Waveform Reconstruction and Lipreading

The network outputs are used to reconstruct the speech waveform with the WORLD synthesis algorithm [morise2016world] and to get a text transcription adopting the best path CTC decoding scheme [graves2006connectionist].

2.5 Evaluation Metrics

The system is evaluated in terms of perceptual evaluation of speech quality (PESQ) [rix2001perceptual] and extended short-time objective intelligibility (ESTOI) [jensen2016algorithm], two of the most used measures that provide estimates of speech quality and speech intelligibility, respectively. PESQ scores are in the range from to and ESTOI scores practically lie between and . In both cases, higher values correspond to better performance.

For the systems having the VSR decoder, we also provide the word error rate (WER), a standard metric for automatic speech recognition systems. In this case, lower values correspond to better performance.

3 Results and Discussion

As shown in Table 2, four systems are trained based on the input (mouth or full face) and the presence of the VSR decoder (only speech synthesis or speech synthesis and VSR).

The systems are compared with the recently proposed GAN-based approach in [Vougioukas2019]. As an additional baseline, we also report the PESQ score for [akbari2018lip2audspec], since this method, which makes use of bottleneck features extracted from auditory spectrograms, outperforms [Vougioukas2019] in terms of estimated speech quality for the speaker dependent case.

Speaker Dependent Speaker Independent
Mean Scores PESQ ESTOI WER PESQ ESTOI WER
Approach in [akbari2018lip2audspec] 1.82 - - - - -
Approach in [Vougioukas2019] 1.71 0.329 - 1.24 0.198 -
vid2voc-M 1.89 0.448 - 1.20 0.214 -
vid2voc-M-VSR 1.90 0.455 15.1% 1.23 0.227 51.6%
vid2voc-F 1.85 0.439 - 1.19 0.202 -
vid2voc-F-VSR 1.88 0.447 14.4% 1.25 0.210 69.3%
WORLD 3.06 0.759 - 3.03 0.759 -
Value taken from the experiments in [Vougioukas2019].
WORLD indicates the reconstruction retrieved from the vocoder features of the
clean speech signals and it is a performance upper bound of our systems.
Table 3: Results for the speaker dependent and the speaker independent cases. Best performance (except WORLD) in bold.

3.1 Speaker Dependent Case

Table 3 (left part) shows the speaker dependent results. We observe that our models outperform the approach in [Vougioukas2019] in terms of both PESQ and ESTOI by a considerable margin. Vougioukas et al. [Vougioukas2019] mention that their system produces low-power hum artefacts that affect the performance. They tried to solve the issue by applying average filtering to the output of their network, experiencing a rise of the PESQ score from 1.71 to 1.80 (not shown in Table 3), comparable to [akbari2018lip2audspec], but still appreciably lower than the results we achieve. However, this filtering negatively affected the intelligibility of the produced speech signals, and was not used in the final system.

Among the systems we developed (cf. Table 2), we observe that including the VSR decoder in the pipeline is beneficial for the speech reconstruction task (see Table 3). Moreover, the use of the mouth as input not only is sufficient to synthesise speech, but it also allows to achieve higher estimated speech quality and intelligibility if compared to the models that use the whole face of the speaker as input. This might be explained by the fact that handling an input with a larger dimensionality is harder if we want to keep roughly the same deep architecture with a similar number of parameters. However, when the whole face is used as input, the WER is slightly lower, indicating that there might be a performance trade-off between VSR and speech reconstruction that should be further investigated in future work in relation with other multi-task learning techniques.

Figure 2: Results of the vid2voc-M-VSR models for the speaker dependent (SD) and the speaker independent (SI) cases. Each marker indicates the mean score of a speaker.

3.2 Speaker Independent Case

Regarding the speaker independent scenario (cf. right part of Table 3), we observe that the performance gap between the approach in [Vougioukas2019] and our systems is not as large as for the speaker dependent case. Although our models appear to perform slightly better than [Vougioukas2019] in terms of ESTOI, the PESQ scores are similar. This can be explained by the fact that some speech characteristics, e.g. F0, cannot be easily estimated for unseen speakers. Since it is reasonable to think that people having similar facial characteristics (e.g. due to gender, age etc.) have similar speech characteristics (cf. [oh2019speech2face], where the face of a person was predicted from a speech signal), we expect that training a network with a dataset that includes more speakers might be beneficial: such a network can produce an average voice of speakers from the training set that share similar facial traits with an unseen talking face.

Among the systems we developed, the presence of the VSR decoder still gives an advantage for speech reconstruction. Unlike the speaker dependent case, the WER for the model that uses the whole face as input is higher than the system using only the mouth. This is due to the early stopping technique that we adopt, which tends to favour speech reconstruction over VSR, indicating again the trade-off between these two tasks.

Finally, Figure 2 shows the results for the vid2voc-M-VSR models by speaker. We can see that the spread of the scores is much higher for the speaker independent case in particular for WER. This is in line with the observations reported in [Vougioukas2019], suggesting the different performance between the estimated speech of subjects whose facial traits substantially differ from the speakers in the training set and the others.

4 Conclusion

In this study, we reconstructed speech from silent videos using a deep model that estimates WORLD vocoder features. We tested our approach in both speaker dependent and speaker independent scenarios. In both cases, we were able to obtain speech signals with estimated speech quality and intelligibility generally higher if compared to a recently proposed GAN-based approach. In addition, we designed our system to simultaneously perform visual speech recognition by using a decoder that estimates CTC characters from a given video sequence.

Future work includes: (a) the adoption of self-paced multi-task learning techniques; (b) the improvement of the visual speech recognition performance, e.g. with a beam search decoding scheme; (c) the design of a system that can generalise well to unseen speakers in noncontrolled environments.

5 Acknowledgment

The authors would like to thank Konstantinos Vougioukas, Stavros Petridis, Pritish Chandna and Merlijn Blaauw.

This research is partially funded by: the William Demant Foundation; the TROMPA H2020 project (770376); the Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence Program (MDM-2015-0502) and the Social European Funds; the MICINN/FEDER UE project (PGC2018-098625-B-I00); the H2020-MSCA-RISE-2017 project (777826 NoMADS).

References