Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks

03/20/2018 ∙ by Seyed Ali Jalalifar, et al. ∙ Sharif Accelerator Ghent University 0

We present a novel approach to generating photo-realistic images of a face with accurate lip sync, given an audio input. By using a recurrent neural network, we achieved mouth landmarks based on audio features. We exploited the power of conditional generative adversarial networks to produce highly-realistic face conditioned on a set of landmarks. These two networks together are capable of producing a sequence of natural faces in sync with an input audio track.



There are no comments yet.


page 3

page 7

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Creating talking heads from audio input is interesting from both scientific and practical viewpoints, e.g. constructing virtual computer generated characters, aiding hearing-impaired people, live dubbing of videos with translated audio, etc. Due to its wide variety of applications, audio to video has been the focus of intensive research in recent years [1, 2, 3, 4]. Mapping audio to facial images with accurate lip-sync is an extremely difficult task because it is a mapping form 1-Dimensional to 3-Dimensional space and also because humans are expert at detecting any out-of-sync lip movements with respect to an audio.

Facial reenactment has seen considerable progress recently [5, 6, 7, 8]. Approaches to photo-realistic facial reenactment usually involve utilizing computer graphic methods to produce high-quality results. Suwajanakorn et al.[1] generates photo-realistic mouth texture directly from audio using compositing techniques. In [5]

, animating the facial expressions of the target video by source actor is achieved by deformation transfer between source and target. Although these methods usually produce highly-realistic reenactment, they suffer from occasional failures. A big challenge for these methods is synthesizing realistic teeth because of the subtle details in the mouth region. Unlike these approaches, we propose using pure machine learning techniques for the task of facial reenactment which we believe is more flexible and simpler to implement. By using generative adversarial networks, our model learns the manifold of human face and lip movements which is a great help for avoiding uncanny valley

111Objects which closely look like humans but are different in small details elicit a sense of unfamiliarity while less similar objects look more familiar..

Generative adversarial networks (GANs), first introduced by Goodfellow et al. [9], are great tools for learning image manifold. They have shown huge potential in mimicking the underlying distribution of data, and produced visually impressive results by sampling random images drawn from the image manifold [10, 11, 12, 13]. Despite their power, GANs are notorious for their uncontrollable output because of the entangled space of the input data and no control on the modes of the data being generated. That was the impetus behind proposing conditional generative adversarial networks [14] which offer some control over the output. Other approaches are also proposed to learn disentangled, interpretable representations in an unsupervised [15, 16] or supervised [17] manner.

We evaluated some extensions of generative adversarial networks and found out that conditional GAN suits best to our problem. We exploited the power of conditional GANs to generate natural faces conditioned on a set of mouth landmarks . Another network is trained to produce mouth landmarks out of an audio input using LSTM structure. By combining these two networks, our model is capable of generating natural face with accurate lip sync. To the best of our knowledge, this is the first time that C-GANs are applied to the problem of audio to video mapping. The closest work to ours is [1] but unlike our method, they composited mouth texture with proper 3D pose matching for accurate lip syncing. We do a case study on a specific person, President Barak Obama, because of the huge volume of data available from his weekly address and also because the videos are online and public domain. Generating talking heads of other people is easily achievable using the same pipeline, given that enough data is available.

2 Related Work

The related work can be divided into two categories: Creating accurate lip sync given an audio input, and manipulating face using generative adversarial networks.

Figure 1: Artificial faces of Obama, created entirly from audio input.

2.1 Lip Syncing from Audio

Approaches to automatically generating natural looking speech animation usually involve manipulating 3D computer generated faces[18, 19, 20]. It was not until recently that highly-realistic facial reenactment was achievable [1, 5]. Typical procedure to generating lip sync from audio usually consists of extracting some features from raw audio [1] or phoneme extraction [2]. A mapping between audio features to 3D face model for avatars is then achieved. For the case of facial reenactment, appropriate facial texture is created from audio features. Taylor et al. [2] proposed using a sliding window predictor that learns arbitrary non-linear mapping from phoneme label input sequence to mouth movements. Anderson et al. in [21], proposed a pipeline for generating text-driven 3D talking heads from limited number of 3D scans using Active Appearance Model(AAM) to construct 2D talking heads first and then create 3D models from them. One of the first highly-realistic facial reenactment approaches, Face2Face, was introduced by Thies et al. [5]. They proposed a new approach for real-time facial reenactment using monocular video sequence form source and target actor. Based on their work, Suwajanakorn et al. [1] introduced a new method for creating talking heads given an audio input by compositing techniques. While Face2Face transfers the mouth from another video sequence, they synthesize mouth shape directly from audio.

Although our work is similar to [1] in application, there are fundamental differences between utilized methods. Conventional approaches for facial reenactment heavily involve computer graphic methods which are prone to generating uncanny faces due to the lack of understanding of the human face manifold. These methods also need to overcome some challenges related to synthesizing realistic teeth. Unlike these approaches, we propose a new pipeline for generating highly-realistic videos with accurate lip sync from audio by learning the human face manifold. This greatly reduces the complications that typical methods usually have to deal with and also prevents occasional failures.

2.2 Generative Adversarial Networks

Generative adversarial nets has recently received an increasing amount of attention and produced promising results, especially in the tasks of image generation [10, 11, 12, 13, 22] and video generation [23]. The power of these networks is that they produce visually impressive outputs, because they learn the underlying distribution of data. They’ve opened a new door to the field of image editing. Efforts for editing faces in latent space usually consist of supervised and unsupervised methods to disentangle the latent space. Chen et al. [15] proposed an information-theoretic extension of GANs, InfoGAN, which is able to learn disentangled representation of latent space in a completely unsupervised manner. This is done by maximizing mutual information between some latent variables and the observation. They tested their approach on the CelebA dataset and managed to control pose, presence or absence of glasses, hair style and emotion of generated face images. Semi-Latent GAN, proposed by Yin et al. [17], learns to generate and modify images from attributes by decomposing noise of GAN into two parts, user defined attributes and latent attributes, which are obtained from the data.

GAN-based conditional image generation has also been the focus of research in recent years. In conditional GANs, both generator and discriminator are provided with class information. Ma et al. in [12] proposed a method for pose guided person image generation conditioned on a specific pose. Kaneko et al. [24] presented a generative attribute controller by utilizing conditional filtered generative adversarial networks. In this paper, we use conditional GANs to generate facial images, given a set of landmarks. Our model is capable of generating faces with accurate alignment with given landmarks. Another LSTM network learns to predict facial landmark positions from audio features. These two networks together are able to generate realistic facial images with accurate lip sync, given an audio input.

Figure 2: An overview of the proposed system. First an LSTM network is trained with audio features as input and lip landmark positions as labels. A C-GAN is trained to produce highly-realistic faces with respect to a given set of landmarks. Finally, these two networks together are able to produce convincing faces from an audio track.

3 System Overview

An overview of the system is shown at Fig. 2. At the heart of our system is a conditional GAN which is trained to produce highly realistic facial images conditioned on a given set of lip landmarks. An LSTM network is utilized to create lip landmarks out of audio input. Here we briefly introduce the implemented networks in our system.

3.1 Lstm

Long Short-Term Memory networks, first introduced by Hochreiter et al. [25], are a special type of recurrent neural networks. Unlike typical networks, RNNs have the ability to connect previous information to the present task. While RNNs are not capable of handling long-term dependencies [26], LSTMs are explicitly designed to handle such situations. The computation within an LSTM cell can be described as:


where , and are the inputs to the LSTM. ,,,,,, and are trainable parameters.

is the sigmoid activation function.

are the forgetting, input and output gates of an standard LSTM unit which control the contribution of historical information to current decision. The outputs of an LSTM cell are


3.2 Conditional Generative Adversarial Networks

Typically, A GAN consists of two networks, a generator and a discriminator. The generator network G tries to fool the discriminator D by creating samples as if they come from the real distribution of data. It is discriminator’s task to distinguish between fake and real samples while the generator tries to learn the true distribution of data in order to fool the discriminator. As training goes on, the discriminator becomes better and better at dividing real and fake samples so the generator has to produce more realistic samples to deceive the discriminator. This leads to a two player min-max game with the value function :


GANs find a mapping between prior noise distribution to data space. Since values of latent code are picked randomly from a distribution, there is no control over the output of generator. Conditioning both discriminator and generator on some extra information offers some control over the output [14]. could be any kind of auxiliary information, such as class label or as in our case, landmark positions. In the case of conditional GANs, the objective function of a two-player min-max game would be:


After training GAN network, the discriminator network is discarded and only the generator is used for creating realistic facial images.

4 Approach

Mapping a sequence of audio to a sequence of images is inherently a difficult task due to the ambiguities of mapping from low-dimensional to high-dimensional space. Our ultimate goal is to estimate the distribution

where is an image at frame, and

is audio feature vector with

as sequence size. Instead of directly computing , we try to estimate distributions and where consists of 8 landmark positions. Now the problem is finding and which represent model parameters of LSTM and Generator networks respectively. First an LSTM network is trained to output facial landmark positions based on the mel-frequency cepstral coefficients extracted from audio input. Another generative model is trained to create high-quality realistic faces conditioned on a set of landmarks. These two networks are trained independently and result in a mapping from MFCC audio features to a sequence of facial images in sync with a given audio.

4.1 Data Acquisition

For the training part, we used President Obama’s weekly address videos because of their availability, high quality and controlled environment. These videos are 14 hours in total but we used a subset of the dataset since we achieved the desired quality with about two hours of videos. For each frame, we extracted the face region, in addition to important lip landmarks with the method proposed in [27].We also extracted mel-frequency cepstral coefficients from audio input.

4.2 Audio to Landmarks

The shape of the mouth during speech depends not only on the current phoneme but also on the phonemes before and after. This is called co-articulation and it can affect up to 10 neighboring phonemes. Inspired by [3] and [1]

, we used LSTM for preserving these long-term dependencies. Extensive work has been done on the problem of audio feature extraction

[28, 29, 30]

. We used the typical mel-frequency cepstral coefficients as the audio features. We took discrete Fourier transform on every 33 milliseconds sliding window and applied 40 triangular mel-scale filters to the Fourier power spectrum. In addition to these 13 MFCCs, we also used their first temporal deviation and log mean energy as extra features to obtain a 28-D feature vector. From the 68 landmark points detected by Dlib

[31], we selected the most correlated ones with speech which are 8 points around the lip. These 8 points make up a 16-D vector. We used a single layer LSTM structure followed by two hidden layers for mapping from audio to the lip landmarks. More details about the implementation can be found on Section 5.

4.3 Landmarks to Image

We propose using conditional generative adversarial networks to create image from landmarks. We used the position of distinctive lip landmarks as an extra condition on the generator network. The input of generator consists of a 50-D noise vector and a 16-D vector of landmark positions. The 66-D input vector ultimately becomes a 128x128x3 image through deconvolution networks. We concatenate the resultant image with 16-D landmark positions in a way that the input shape of discriminator finally becomes 128x128x19. We followed the typical network structure proposed for DC-GANs except that we concatenated both generator and discriminator network inputs with landmark positions. The structure of generator and discriminator network is shown at Fig. 3.

Figure 3: Our conditional GAN network overview. Deconvolutional network of Generator (Top), Convolutional network of Discriminator (Bottom).

5 Results

In this part we discuss implementation details and results.

Figure 4: Bidirectional LSTM structure. Frames after and before the current frame effect the output.

5.1 Implementation of LSTM

We tested some architecture of LSTM and find out that bidirectional LSTM suits best to our problem. As mentioned in Section 4.2, the phenomenon called co-articulation causes mouth shape to be dependent on phonemes before and after the current phoneme. The choice of bidirectional LSTM is rational since it takes into account previous and next frames. We used a single layer bidirectional LSTM since it produces the desired quality and there is no need to introduce complexity to the network by adding extra layers. We used Adam optimizer [32]

for training using Tensorflow framework

[33]. Fig. 4 shows the bidirectional LSTM structure. In table 1, we compare performance of different LSTM structures and parameters.

Validation Loss(Epochs)

Network Structure 100 epochs 200 epochs 300 epochs
Single-layer bidirectional LSTM 0.91 0.88 0.85
Single-layer unidirectional LSTM 0.93 0.91 0.93
Two-layer bidirectional LSTM 0.92 0.88 0.84
Table 1: Validation loss of different LSTM network structures
Dropout rate: 0 0.3 0.5
Single-layer bidirectional LSTM 0.91 0.88 0.93
Single-layer unidirectional LSTM 0.94 0.92 0.95
Two-layer bidirectional LSTM 0.91 0.89 0.92
Table 2:

Validation loss of different LSTM networks versus dropout probability

5.2 Conditional Generative Adversarial Network

Our conditional network is able to create real facial images out of landmarks. While generating image sequence from audio input, we need to keep facial texture and background constant. In order to achieve so, we limit C-GAN training dataset in the last epochs to the target video. This keeps the facial texture constant during face generation while preserves the details of the reconstructed face. Some tricks proposed in [34] improved the quality of the output and reduced visual artifacts. Fig. 5 shows some of the results we achieved from a given set of landmarks.

The novelty of our approach is that the two modules that we used, LSTM and C-GAN, are almost independent from each other. This means that our model is able to transfer lip movements of other people, given their audio. Only a simple affine transformation should be applied to the source facial landmarks in order to be aligned with the target landmarks. Fig. 6 shows transfer from Hillary Clinton’s audio speech to President Barack Obama’s lip movements.

Figure 5: Images directly generated from landmarks. Original sequence (Top), Generated face using C-GAN (Buttom).
Figure 6: Creating Artificial faces of President Obama, given an audio track from Hillary Clinton. From top to bottom: 1) Original video, 2) Audio features, 3) Predicted landmarks, 4) Generated Images created from landmarks.

6 Conclusion, Limitations and Future Work

We propose using conditional generative adversarial networks for creating high-quality faces given their mouth landmarks. The mouth landmarks are also obtained from audio using an LSTM network. This gives us an end-to-end system with much flexibility, e.g. the ability to manipulate faces without losing their naturalness. This is a huge advantage over computer graphic methods since there is no need to get involved with details of face, e.g. synthesizing realistic teeth. The LSTM network and C-GAN network are almost independent from each other so we can reanimate target face with audios from other sources rather than the target person himself. This opens the door for many interesting new applications such as face transformation, Dubsmash like apps, etc.

We used Dlib landmark detector for extracting facial landmarks. There are new approaches with more accurate results available for facial landmark detection, especially in the mouth region [35]. Using these improved methods increase the quality of the LSTM network to predict mouth shape from audio features.

Sometimes our model fails to create natural faces. This is mainly because the fact that the provided lip landmarks are significantly different from what the C-GAN saw during training phase (Fig. 7). To address this problem, a more comprehensive dataset can be used to cover more head poses and lip landmark positions.

Finally, typical DC-GAN structure and training procedure are utilized during training phase. New architectures and algorithms such as [11] has been proposed to improve the quality of output image. Using these new structures, images with higher quality and finer details are achievable.

Figure 7: Some cases of failiure mainly caused by irrelevent lip landmarks.