Facial animation plays a major role in computer generated imagery because the face is the primary outlet of information. The problem of generating realistic talking heads is multifaceted, requiring high-quality faces, lip movements synchronized with the audio, and plausible facial expressions. This is especially challenging because humans are adept at picking up subtle abnormalities in facial motion and audio-visual synchronization.
Of particular interest is speech-driven facial animation since speech acoustics are highly correlated with facial movements . These systems could simplify the film animation process through automatic generation from the voice acting. They can also be applied in movie dubbing to achieve better lip-syncing results. Moreover, they can be used to generate parts of the face that are occluded or missing in a scene. Finally, this technology can improve band-limited visual telecommunications by either generating the entire visual content based on the audio or filling in dropped frames.
The majority of research in this domain has focused on mapping audio features (e.g. MFCCs) to visual features (e.g. landmarks, visemes) and using computer graphics (CG) methods to generate realistic faces . Some methods avoid the use of CG by selecting frames from a person-specific database and combining them to form a video [3, 22]. Regardless of which approach is used these methods are subject dependent and are often associated with a considerable overhead when transferring to new speakers.
Subject independent approaches have been proposed that transform audio features to video frames  but there is still no method to directly transform raw audio to video. Furthermore, many methods restrict the problem to generating only the mouth. Even techniques that generate the entire face are primarily focused on obtaining realistic lip movements, and typically neglect the importance of generating natural facial expressions.
Some methods generate frames based solely on present information , without taking into account the facial dynamics. This makes generating natural sequences, which are characterized by a seamless transition between frames, challenging. Some video generation methods have dealt with this problem by generating the entire sequence at once  or in small batches . However, this introduces a lag in the generation process, prohibiting their use in real-time applications and requiring fixed length sequences for training.
We propose a temporal generative adversarial network (GAN), capable of generating a video of a talking head from an audio signal and a single still image (see Fig. 1 ). First, our model captures the dynamics of the entire face producing not only synchronized mouth movements but also natural facial expressions, such as eyebrow raises, frowns and blinks. This is due to the use of an RNN-based generator and sequence discriminator, which also gives us the added advantage of handling variable length sequences. Natural facial expressions play a crucial role in achieving truly realistic results and their absence is often a clear tell-tale sign of generated videos. This is exploited by methods such as the one proposed in , which detects synthesized videos based on the existence of blinks.
Secondly, our method is subject independent, does not rely on handcrafted audio or visual features, and requires no post-processing. To the best of our knowledge, this is the first end-to-end technique that generates talking faces directly from the raw audio waveform.
The third contribution of this paper is a comprehensive assessment of the performance of the proposed method. An ablation study is performed on the model in order to quantify the effect of each component in the system. We measure the image quality using popular reconstruction and sharpness metrics, and compare it to a non-temporal approach. Additionally, we propose using lip reading techniques to verify the accuracy of the spoken words and face verification to ensure that the identity of the speaker is maintained throughout the sequence. Evaluation is performed in a subject independent way on the GRID  and TCD TIMIT  datasets, where our model achieves truly natural results. Finally, the realism of the videos is assessed through an online Turing test, where users are shown videos and asked to identify them as either real or generated.
2 Related Work
2.1 Speech-Driven Facial Animation
The problem of speech-driven video synthesis is not new in computer vision and has been the subject of interest for decades. Yehiaet al.  first examined the relationship between acoustics, vocal-tract and facial motion, showing a strong correlation between visual and audio features and a weak coupling between head motion and the fundamental frequency of the speech signal .
Some of the earliest methods for facial animation relied on hidden Markov models (HMMs) to capture the dynamics of the video and speech sequences. Simons and Cox
used vector quantization to achieve a compact representation of video and audio features, which were used as the states for their HMM. The HMM was used to recover the most likely mouth shapes for a speech signal. A similar approach is used in
to estimate the sequence of lip parameters. Finally, theVideo Rewrite  method relies on the same principals to obtain a sequence of triphones that are used to look up mouth images from a database.
Although HMMs were initially preferred to neural networks due to their explicit breakdown of speech into intuitive states, recent advances in deep learning have resulted in neural networks being used in most modern approaches. Like past attempts, most of these methods aim at performing a feature-to-feature translation. A typical example of this is
, which uses a deep neural network (DNN) to transform a phoneme sequence into a sequence of shapes for the lower half of the face. Using phonemes instead of raw audio ensures that the method is subject independent. Similar deep architectures based on recurrent neural networks (RNNs) have been proposed in[9, 22], producing realistic results but are subject dependent and require retraining or re-targeting steps to adapt to new faces.
Convolutional neural networks (CNN) are used in  to transform audio features to 3D meshes of a specific person. This system is conceptually broken into sub-networks responsible for capturing articulation dynamics and estimating the 3D points of the mesh. Finally, Chung et al.  proposed a CNN applied on Mel-frequency cepstral coefficients (MFCCs) that generates subject independent videos from an audio clip and a still frame. The method uses an loss at the pixel level resulting in blurry frames, which is why a deblurring step is also required. Secondly, this loss at the pixel level penalizes any deviation from the target video during training, providing no incentive for the model to produce spontaneous expressions and resulting in faces that are mostly static except for the mouth.
2.2 GAN-Based Video Synthesis
The recent introduction of GANs in 
has shifted the focus of the machine learning community to generative modelling. GANs consist of two competing networks: a generative network and a discriminative network. The generator’s goal is to produce realistic samples and the discriminator’s goal is to distinguish between the real and generated samples. This competition eventually drives the generator to produce highly realistic samples. GANs are typically associated with image generation since the adversarial loss produces sharper, more detailed images compared toand losses. However, GANs are not limited to these applications and can be extended to handle videos [16, 14, 25, 24].
Straight-forward adaptations of GANs for videos are proposed in [25, 20], replacing the 2D convolutional layers with 3D convolutional layers. Using 3D convolutions in the generator and discriminator networks is able to capture temporal dependencies but requires fixed length videos. This limitation was overcome in  but constraints need to be imposed in the latent space to generate consistent videos.
The MoCoGAN system proposed in  uses an RNN-based generator, with separate latent spaces for motion and content. This relies on the empirical evidence shown in  that GANs perform better when the latent space is disentangled. MoCoGAN uses a 2D and 3D CNN discriminator to judge frames and sequences respectively. A sliding window approach is used so that the 3D CNN discriminator can handle variable length sequences.
GANs have also been used in a variety of cross-modal applications, including text-to-video and audio-to-video. The text-to-video model proposed in  uses a combination of variational auto encoders (VAE) and GANs in its generating network and a 3D CNN as a sequence discriminator. Finally, Chen et al.  propose a GAN-based encoder-decoder architecture that uses CNNs in order to convert audio spectrograms to frames and vice versa.
3 End-to-End Speech-Driven Facial Synthesis
The proposed architecture for speech-driven facial synthesis is shown in Fig. 2. The system is made up of a generator and 2 discriminators, each of which evaluates the generated sequence from a different perspective. The capability of the generator to capture various aspects of natural sequences is directly proportional to the ability of each discriminator to discern videos based on them.
The inputs to the generator networks consist of a single image and an audio signal, which is divided into overlapping frames each corresponding to seconds. The generator network in this architecture can be conceptually divided into subnetworks as shown in Fig. 3. Using an RNN-based generator allows us to synthesize videos frame-by-frame, which is necessary for real-time applications.
3.1.1 Identity Encoder
3.1.2 Context Encoder
Audio frames are encoded using a network comprising of 1D convolutions followed by batch normalization and ReLU activations. The initial convolutional layer starts with a large kernel, as recommended in , which helps limit the depth of the network while ensuring that the low-level features are meaningful. Subsequent layers use smaller kernels until an embedding of the desired size is achieved. The audio frame encodings are input into a 2 layer GRU, which produces a context encoding with elements.
3.1.3 Frame Decoder
The identity encoding is concatenated to the context encoding and a noise component to form the latent representation. The 10-dimensional vector is obtained from a Noise Generator, which is a 1-layer GRU that takes Gaussian noise as input. The Frame Decoder is a CNN that uses strided transposed convolutions to produce the video frames from the latent representation. A U-Net  architecture is used with skip connections between the Identity Encoder and the Frame Decoder to help preserve the identity of the subject.
(b) Sequence Discriminator, consisting of an audio encoder, an image encoder, GRUs and a small classifier.
Our system has two different types of discriminator. The Frame Discriminator helps achieve a high-quality reconstruction of the speakers’ face throughout the video. The Sequence Discriminator ensures that the frames form a cohesive video which exhibits natural movements and is synchronized with the audio.
3.2.1 Frame Discriminator
The Frame Discriminator is a 6-layer CNN that determines whether a frame is real or not. Adversarial training with this discriminator ensures that the generated frames are realistic. The original still frame is used as a condition in this network, concatenated channel-wise to the target frame to form the input as shown in Fig. 3. This enforces the person’s identity on the frame.
3.2.2 Sequence Discriminator
The Sequence Discriminator presented in Fig. 3 distinguishes between real and synthetic videos. The discriminator receives a frame at every time step, which is encoded using a CNN and then fed into a 2-layer GRU. A small (2-layer) classifier is used at the end of the sequence to determine if the sequence is real or not. The audio is added as a conditional input to the network, allowing this discriminator to classify speech-video pairs.
The Frame discriminator () is trained on frames that are sampled uniformly from a video using a sampling function . The Sequence discriminator (), classifies based on the entire sequence and audio . The loss of our GAN is an aggregate of the losses associated with each discriminator as shown in Equation 1.
An reconstruction loss is also used to improve the synchronization of the mouth movements. However we only apply the reconstruction loss to the lower half of the image since it discourages the generation of facial expressions. For a ground truth frame and a generated frame with dimensions the reconstruction loss at the pixel level is:
The final objective is to obtain the optimal generator , which satisfies Equation 3
. The model is trained until no improvement is observed on the reconstruction metrics on the validation set for 10 epochs. Thehyperparameter controls the contribution of each loss factor and was set to following a tuning procedure on the validation set.
We used Adam  for all the networks with a learning rate of for the Generator and Frame Discriminator which decay after epoch 20 with a rate of . The Sequence Discriminator uses a smaller fixed learning rate of .
Our model is implemented in PyTorch and takes approximately a week to train using an Nvidia GeForce GTX 1080 Ti GPU. During inference the average generation time per frame is 7ms on the GPU, permitting the use of our method use in real time applications. A sequence of 75 frames can be synthesized in 0.5s. The frame and sequence generation times increase to 1s and 15s respectively when processing is done on the CPU.
The GRID dataset has 33 speakers each uttering 1000 short phrases, containing 6 words taken from a limited dictionary. The TCD TIMIT dataset has 59 speakers uttering approximately 100 phonetically rich sentences each. We use the recommended data split for the TCD TIMIT dataset but exclude some of the test speakers and use them as a validation set. For the GRID dataset speakers are divided into training, validation and test sets with a split respectively. As part of our preprocessing all faces are aligned to the canonical face and images are normalized. We increase the size of the training set by mirroring the training videos. The amount of data used for training and testing is presented in Table 1.
|Dataset||Samples (Tr)||Hours (Tr)||Samples (V)||Hours (V)||Samples (T)||Hours (T)|
We use common reconstruction metrics such as the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) index to evaluate the generated videos. During the evaluation it is important to take into account the fact that reconstruction metrics penalize videos for any spontaneous expression. The frame sharpness is evaluated using the cumulative probability blur detection (CPBD) measure, which determines blur based on the presence of edges in the image and the frequency domain blurriness measure (FDBM) proposed in , which is based on the spectrum of the image. For these metrics larger values imply better quality.
The content of the videos is evaluated based on how well the video captures identity of the target and on the accuracy of the spoken words. We verify the identity of the speaker using the average content distance (ACD) , which measures the average Euclidean distance of the still image representation, obtained using OpenFace , from the representation of the generated frames. The accuracy of the spoken message is measured using the word error rate (WER) achieved by a pre-trained lip-reading model. We use the LipNet model , which surpasses the performance of human lipreaders on the GRID dataset. For both content metrics lower values indicate better accuracy.
4.3 Ablation Study
In order to quantify the effect of each component of our system we perform an ablation study on the GRID dataset (see Table 2). We use the metrics from section 4.2 and a pre-trained LipNet model which achieves a WER of on the ground truth videos. The average value of the ACD for ground truth videos of the same person is whereas for different speakers it is . The loss achieves slightly better PSNR and SSIM results, which is expected as it does not generate spontaneous expressions, which are penalized by these metrics unless they happen to coincide with those in ground truth videos. This variation introduced when generating expressions is likely the reason for the small increase in ACD. The blurriness is minimized when using the adversarial loss as indicated by the higher FDBM and CPBD scores and Fig. 4. Finally, the effect of the sequence discriminator is shown in the lip-reading result achieving a low WER.
|Ground Truth Videos||N/A||N/A||0.121||0.281||0.74||21.40%|
4.4 Qualitative Results
Our method is capable of producing realistic videos of previously unseen faces and audio clips taken from the test set. The examples in Fig. 5 show the same face animated using sentences from different subjects (male and female). The same audio used on different identities is shown in Fig. 6. From visual inspection it is evident that the lips are consistently moving similarly to the ground truth video. Our method not only produces accurate lip movements but also natural videos that display characteristic human expressions such as frowns and blinks, examples of which are shown in Fig. 7.
The works that are closest to ours are those proposed in  and . The former method is subject dependent and requires a large amount of data for a specific person to generate videos. For the latter method there is no publicly available implementation so we compare our model to a static method that produces video frames using a sliding window of audio samples like that used in . This is a GAN-based method that uses a combination of an loss and an adversarial loss on individual frames. We will also use this method as the baseline for our quantitative assessment in the following section. This baseline produces less coherent sequences, characterized by jitter, which becomes worse in cases where the audio is silent (e.g. pauses between words). This is likely due to the fact that there are multiple mouth shapes that correspond to silence and since the model has no knowledge of its past state generates them at random. Fig. 8 shows a comparison between our approach and the baseline in such cases.
4.5 Quantitative Results
We further evaluate the realism of the generated videos through an online Turing test. In this test users are shown 10 videos, which were chosen at random from GRID and TIMIT consisting of 6 fake videos and 4 real ones. Users are shown the videos in sequence and are asked to label them as real or fake. Responses from 153 users were collected with the average user labeling correctly 63% of the videos. The distribution of user scores is shown in Fig. 9.
5 Conclusion and Future Work
In this work we have presented an end-to-end model using temporal GANs for speech-driven facial animation. Our model is capable of producing highly detailed frames scoring high in terms of PSNR, SSIM and in terms of the sharpness measures on both datasets. According to our ablation study this can be mainly attributed to the use of a Frame Discriminator.
Furthermore, our method produces more coherent sequences and more accurate mouth movements compared to the static approach, as demonstrated by a resounding user preference and the difference in the WER. We believe that these improvements are not only a result of using a temporal generator but also due to the use of the conditional Sequence Discriminator. Unlike previous approaches  that prohibit the generation of facial expressions, the adversarial loss on the entire sequence encourages spontaneous facial gestures. This has been demonstrated with examples of blinks and frowns. All of the above factors make the videos generated using our approach difficult to separate from real videos as revealed from the Turing test results, with the average user scoring only slightly better than chance. It is also noteworthy that no user was able to perfectly classify the videos.
This model has shown promising results in generating lifelike videos. Moving forward, we believe that different architectures for the sequence discriminator could help produce more natural sequences. Finally, at the moment expressions are generated randomly by the model so a natural extension of this method would attempt to also capture the mood of the speaker from his voice and reflect it in the facial expressions.
This work has been funded by the European Community Horizon 2020 under grant agreement no. 645094 (SEWA).
Appendix A Supplementary Material
Details regarding the network architecture that were not included in the paper due to lack of space are included here.
a.1 Audio Preprocessing
The sequence of audio samples is divided into overlapping audio frames in a way that ensures a one-to-one correspondence with the video frames. In order to achieve this we pad the audio sequence on both ends and use the following formula for the stride:
a.2 Network Architecture
This section describes, in detail, the architecture of the networks used in our temporal GAN. All our networks use activations except for the final layers. The encoders and generator use the hyperbolic tangent activation to ensure that their output lies in the set and the discriminator uses a Sigmoid activation.
a.3 Audio Encoder
The Audio Encoder network obtains features for each audio frame. It is made up of 7 Layers and produces an encoding of size . This encoding is fed into a 2 layer GRU which will produce the final context encoding.
a.3.1 Noise Generator
a.3.2 Identity Encoder and Frame Decoder
The Identity Encoder is responsible for capturing the identity of the speaker from the still image. The Identity Encoder is a 6 layer CNN which produces an identity encoding of size . This information is concatenated to the context encoding and the noise vector at every instant and fed as input to the Frame Decoder, which will generate a frame of the sequence. The Frame Decoder is a 6 layer CNN that uses strided transpose convolutions to generate frames. The Identity Encoder - Frame Decoder architecture is shown in Fig. 12
The model is evaluated on the GRID and TCD TIMIT datasets. The subjects used for training, validation and testing are shown in Table 4
|GRID||1, 3, 5, 6, 7, 8, 10, 12, 14, 16, 17, 22, 26, 28, 32||9, 20, 23, 27, 29, 30, 34||2, 4, 11, 13, 15, 18, 19, 25, 31, 33|
|TCD TIMIT||1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 16, 17, 19, 20, 21, 22, 23, 24, 26, 27, 29, 30, 31, 32, 35, 37, 38, 39, 40, 42, 43, 46, 47, 48, 50, 51, 52, 53, 57, 59||34, 36, 44, 45, 49, 54, 58||8, 9, 15, 18, 25, 28, 33, 41, 55, 56|
Amos et al. 
B. Amos, B. Ludwiczuk, and M. Satyanarayanan.
OpenFace: A general-purpose face recognition library with mobile applications.Technical Report 118, 2016.
- Assael et al.  Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas. LipNet: End-to-End Sentence-level Lipreading. arXiv preprint arXiv:1611.01599, 2016.
- Bregler et al.  C. Bregler, M. Covell, and M. Slaney. Video Rewrite. In Conference on Computer Graphics and Interactive Techniques, pages 353–360, 1997.
- Chen and Rao  T. Chen and R. R. Rao. Audio-Visual Integration in Multimodal Communication. Proceedings of the IEEE, 86(5):837–852, 1998.
- Chung et al.  J. S. Chung, A. Jamaludin, and A. Zisserman. You said that? In British Machine Vision Conference (BMVC), pages 1–12, 2017.
- Cooke et al.  M. Cooke, J. Barker, S. Cunningham, and X. Shao. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421–2424, 2006.
- Dai et al.  W. Dai, C. Dai, S. Qu, J. Li, and S. Das. Very Deep Convolutional Neural Networks for Raw Waveforms. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 421–425, 2017.
- De and Masilamani  K. De and V. Masilamani. Image Sharpness Measure for Blurred Images in Frequency Domain. Procedia Engineering, 64:149–158, 1 2013.
- Fan et al.  B. Fan, L. Wang, F. Soong, and L. Xie. Photo-real talking head with deep bidirectional lstm. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4884–4888, 2015.
- Goodfellow et al.  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Networks. In Advances in neural information processing systems (NIPS), pages 2672–2680, 2014.
- Harte and Gillen  N. Harte and E. Gillen. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, 17(5):603–615, 2015.
- Karras et al.  T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(94), 2017.
- Kingma and Ba  D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014.
- Li et al.  Y. Li, M. R. Min, D. Shen, D. Carlson, and L. Carin. Video Generation From Text. arXiv preprint arXiv:1710.00421, 2017.
- Li et al.  Y. Li, M. Chang, and S. Lyu. In Ictu Oculi : Exposing AI Generated Fake Face Videos by Detecting Eye Blinking. arXiv preprint arXiv:1806.02877, 2018.
- Mathieu et al.  M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
- Narvekar and Karam  N. D. Narvekar and L. J. Karam. A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. International Workshop on Quality of Multimedia Experience (QoMEx), 20(9):87–91, 2009.
- Radford et al.  A. Radford, L. Metz, and S. Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434, 2015.
- Ronneberger et al.  O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241, 2015.
Saito et al. 
M. Saito, E. Matsumoto, and S. Saito.
Temporal Generative Adversarial Nets with Singular Value Clipping.In IEEE International Conference on Computer Vision (ICCV), pages 2830–2839, 2017.
- Simons and Cox  A. D. Simons and S. J. Cox. Generation of mouthshapes for a synthetic talking head. Proceedings of the Institute of Acoustics, Autumn Meeting, 12(January):475–482, 1990.
- Suwajanakorn et al.  S. Suwajanakorn, S. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing Obama: Learning Lip Sync from Audio Output Obama Video. ACM Transactions on Graphics (TOG), 36(95), 2017.
- Taylor et al.  S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. G. Rodriguez, J. Hodgins, and I. Matthews. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG), 36(93), 2017.
- Tulyakov et al.  S. Tulyakov, M. Liu, X. Yang, and J. Kautz. MoCoGAN: Decomposing Motion and Content for Video Generation. arXiv preprint arXiv:1707.04993, 2017.
- Vondrick et al.  C. Vondrick, H. Pirsiavash, and A. Torralba. Generating Videos with Scene Dynamics. In Advances In Neural Information Processing Systems (NIPS), pages 613–621, 2016.
- Yamamoto et al.  E. Yamamoto, S. Nakamura, and K. Shikano. Lip movement synthesis from speech based on hidden Markov Models. Speech Communication, 26(1-2):105–115, 1998.
- Yehia et al.  H. Yehia, P. Rubin, and E. Vatikiotis-Bateson. Quantitative association of vocal-tract and facial behavior. Speech Communication, 26(1-2):23–43, 1998.
- Yehia et al.  H. C. Yehia, T. Kuratate, and E. Vatikiotis-Bateson. Linking facial animation, head motion and speech acoustics. Journal of Phonetics, 30(3):555–568, 2002.