High-Resolution Talking Face Generation via Mutual Information Approximation

by   Aihua Zheng, et al.

Given an arbitrary speech clip and a facial image, talking face generation aims to synthesize a talking face video with precise lip synchronization as well as a smooth transition of facial motion over the entire video speech. Most existing methods mainly focus on either disentangling the information in a single image or learning temporal information between frames. However, speech audio and video often have cross-modality coherence that has not been well addressed during synthesis. Therefore, this paper proposes a novel high-resolution talking face generation model for arbitrary person by discovering the cross-modality coherence via Mutual Information Approximation (MIA). By assuming the modality difference between audio and video is larger that of real video and generated video, we estimate mutual information between real audio and video, and then use a discriminator to enforce generated video distribution approach real video distribution. Furthermore, we introduce a dynamic attention technique on the mouth to enhance the robustness during the training stage. Experimental results on benchmark dataset LRW transcend the state-of-the-art methods on prevalent metrics with robustness on gender, pose variations and high-resolution synthesizing.


page 4

page 7

page 8

page 9


Talking Face Generation by Conditional Recurrent Adversarial Network

Given an arbitrary face image and an arbitrary speech clip, the proposed...

Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss

We devise a cascade GAN approach to generate talking face video, which i...

MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement

This paper presents a generic method for generating full facial 3D anima...

Talking Head Generation with Audio and Speech Related Facial Action Units

The task of talking head generation is to synthesize a lip synchronized ...

Towards Highly Accurate and Stable Face Alignment for High-Resolution Videos

In recent years, heatmap regression based models have shown their effect...

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pretrained StyleGAN

One-shot talking face generation aims at synthesizing a high-quality tal...

Quadratic mutual information regularization in real-time deep CNN models

In this paper, regularized lightweight deep convolutional neural network...

1 Introduction

Talking face generation aims to generate a realistic talking video for the giving still face image and speech clip. It has been an active research topic and has many real-world applications such as animating movies, teleconferencing, talking agents and enhancing speech comprehension while preserving privacy. Recent efforts mainly employ deep generative models to generate the talking face from scratch. They conventionally formularize the talking face generation task as synthesizing the talking face from the speech of a specified target identity. For instance, Rithesh et al. [22] and Supasorn et al. [33] generate the talking face of Obama with the supervision of text and audio respectively. In the following, some methods aim to synthesize the talking faces for more identities by taking the advantage of Generative Adversarial Networks (GANs) [16, 32]. More recently, some researchers devote to synthesize talking face for arbitrary identities that are not required to appear in the dataset [39, 3]. However, since different identities have large appearance difference, it is challenging to synthesize the talking face for arbitrary identities. Particularly, there are two types of modality difference for arbitrary identities synthesis. One modality difference is between audio and video and the other is between different identities.

Mutual information (MI) is a commonly used information theoretic measure to measure the difference between two distributions. As a quantity for capturing non-linear statistical dependencies between variables, it has found applications in a wide range of domains and tasks, including clustering gene expression data [28]

, feature selection

[30] and cross-modality localization [1]. One of the pioneer works is [27]

which propose to estimate the mutual information based on kernel density estimators, Kraskov et al.

[21] present two closely related families of mutual entropy estimators. More recently, Belghazi et al. [2] present a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent. As explained in [34], mutual information can be utilized to learn a parametrized mapping from a given input to a higher-level representation to preserves information of the original input. This can be referred as the infomax principle translating to maximize the mutual information between the audio input and the frame output from the generative network.

Talking face generation is intrinsically a speech-to-video cross-modality and cross-identity generation problem, where it is crucial to capture the cross-modality coherence between speech and lip movement. One of the challenges in speech-driven talking face generation is, it is difficult to encode the speech audio information into a video modality. Therefore, we propose to explore the speech-to-video cross-modality coherence via Mutual Information Approximation (MIA). By assuming the modality difference between audio and video is larger than that of real video and generated video, we estimate mutual information between real audio and video, and then use a discriminator to enforce generated video distribution approach real video distribution. Benefit from mutual information, MIA can learn the cross-modality coherence, and facilitate to encode the audio modality into the video one. At the same time, the mechanic of GANs pushes the generated distribution to the real distribution. This is different from MINE [2] that estimates MI directly from two target distributions (speech audio distribution and generated video frame distribution). Experimental results demonstrate MIA is more applicable to the talking face generation for arbitrary identities.

The proposed model consists of three components: a Talking Face Generator, a Frame Discriminator and a Mutual Information Approximator, as shown in Fig. 2

. First, the Talking Face Generator is designed to generate target frames from the given input: one audio clip, one still facial image and the previously generated frame. It leverages the temporal information from the previously generated frame. Then, we feed the audio and the generated frame into Frame Discriminator to detect whether they are matched or not. Additionally, Mutual Information Approxmiator is introduced to estimate the mutual information between audio and video distribution via the information learnt from a neural network based on MI measure.

The main contributions of our paper can be summarized as follows:

  • We propose to leverage the mutual information in cross-modality talking face generation for arbitrary person, which can better encode audio information into the generated video.

  • A Mutual Information Approximation (MIA) is introduced to describe the coherence between video and speech, which improves reconstruction and inference during adversarial learning.

  • We designed an end-to-end model for talking face generation which consists of Talking Face Generator, Frame Discriminator and Mutual Information Approximator.

  • Extensive experiments yield a new state-of-the-art on benchmark dataset LRW [5] with robustness on gender, pose variations and high-resolution generation.

Figure 2: Pipeline: (a) Talking face generator: we first put an audio clip to a audio encoder network (Audio Encoder) to extract a 256-dimension audio feature . And then feed the previously generated image to image encoder network (Image Encoder) to obtain a 256-dimension feature . Then the two features are concatenated channel-wise to obtain a 512-dimension fusion feature . At last, we feed to U-Net to maintain facial texture and to make the mouth animate, finally output a generated frame . (b) Mutual Information Approximator is fed by the pairs of or , where represents the real frame, the output approximates mutual information. (c) Frame Discriminator still leverages pairs of or

, while the output indicates a probability of whether the pairs are matched.

2 Related works

In this section, we briefly review the related works about talking face generation and mutual information estimators.

2.1 Talking Face Generation

Earlier works on talking face generation mainly synthesize the specific identity from the dataset by given an arbitrary speech audio. Rithesh et al. [22] use a time-delayed LSTM [12] to generate key points synced to the audio and use another network to generate the video frames conditioned on the key points. Furthermore, Supasorn et al. [33] propose a teeth proxy to improve the quality of the teeth during generation.

In the following, Chung et al. [4] attempt to adopt an encoder-decoder CNN model to learn the correspondences between raw audio and video data. Karras et al. [18] propose a deep neural network to learn a mapping from input waveforms to the 3D vertex coordinates of a face model. The network discovers a latent code to disambiguate facial expression variations simultaneously. Jalalifar et al. [16]

introduce a recurrent neural network into the conditional GAN

[11] to produce a sequence of natural faces in sync with an input audio track. Bo et al. [9] utilize an LSTM network [14] to create lip landmarks out of audio input. Vougioukas et al. [36] employ a Temporal GAN [32] to capture the temporal information and therefore to improve the quality of synthesizing. However, these methods are only applicable to synthesize the talking faces for the identities from the dataset. Recently, the synthesis of the talking face for the arbitrary identities out of the dataset has drawn much attention. Chen et al. [3] propose to leverage the optical flow for better express the information between frames. Zhou et al. [39] propose an adversarial learning method to disentangle the different information for one image during generation. However, since different identities have large appearance difference, it is challenging to synthesize the talking face for arbitrary identities.

2.2 Mutual Information Estimator

In information theory, mutual information measures the mutual dependence between two random variables. It has shown to be historically difficult to compute and estimate. Mutual information estimator aims to estimate hardly computing mutual dependence for more general problems


One of the pioneers is to calculate the relative frequencies on appropriate partitions to approximate mutual information [7]. Alexander et al. [21]

propose a popular KNN-based estimator modified from the entropy estimator

[20]. Recent works try to employ parameters-free approaches [21], or rely on approximate Gaussianity of data distribution [15]

to estimate the mutual information. In order to reduce the bias and preserve the variance, Sricharan et al.

[23] propose to estimate the entropy or divergence by ensembling some simple plug-in estimators with varying neighborhood sizes.

Recently, Moon et al. [26] derive the mean squared error convergence rates of kernel density-based plug-in estimators of mutual information measures between two multidimensional random variables. Belghazi et al. [2]

propose a backpropagation MI estimator that exploits a dual optimization based on dual representations of the KL-divergence

[31] to estimate divergences beyond the minimax objective as formalized in GANs. It is scalable, flexible, and completely trainable.

3 Proposed Method

Our model consists of a Talking Face Generator, a Frame Discriminator and a Mutual Information Approximator. The main architecture is shown in Fig. 2.

3.1 Talking Face Generator

There are three inputs of the generator: 1) the original frame , to ensure the texture information of the output frame. 2) the speech audio clip , working as the condition to supervise the change mouth. and 3) the previously generated frame , to guarantee the smoothness of the image generation by feeding more temporal information. The three inputs will feed to Identity Encoder, Audio Encoder and Image Encoder respectively and output the target video frame by Frame Decoder.

We propose to use conditional generative networks to synthesize frames from audio’s MFCC feature. It is necessary to preserve the background and the identity of the person while generating target video frames from the arbitrary audio input. U-Net, as one of the prevalent architectures which feeds the contextual information in the encoder to the decoder to obtain a general information, has been widely adopted in generation. Therefore, a U-Net [29] architecture is used with skip connections between the Identity Encoder and the Frame Decoder to help preserve the facial texture during generation and maintain the details of the reconstructed face.

The Image Encoder is based on LightCNN-9 [38], which leverages a variation of maxout activation, called Max-Feature-Map that can not only separate noisy and informative signals but also play the role of feature selection between two feature maps. The output of this encoder is a 256-dimension feature

. The Audio Encoder consists of a 3-layer CNN, a 4-layer CNN and a 1-layer classifier, two CNNs process the input MFCC feature simultaneously, then flatten and concatenate the output of CNNs and feed it to the classifier to obtain another 256-dimension feature


Although the shape of the mouth is determined by the audio, the temporal information is also necessary for the generator. We decide to use Image Encoder to extract feature from the previously generated frame in the -th generating step. We consider that this feature is representative of the previous temporal information as well as the feature of mouth of a specific image. When generating the first frame, is extracted from the . In different generation stages, all the parameters are shared.

However, the U-Net gives a strong constraint of the input which may cause the shape of mouth to change slightly, therefore we employ two strategies to reduce the constraint of input frame: 1) we remove the outermost two skip connections because the outer skip connection gives more detail information; 2) we introduce a dynamic attention to reduce the original mouth constrain, as illustrated in Fig. 3. We use to keep facial texture constant and use to maintain temporal and mouth information of visual part. It can also stabilize the training and improve the quality of the generation.

Figure 3: The simplified pipeline of the proposed dynamic attention, leveraging the dual-stream information which represents the texture-related and lip-related information respectively during talking face generation. The blue box in the bottom demonstrates the examples of facial images with varying rates of attention, the larger attention rate (smaller than or equal 1.0) the higher attention on the mouth. Note that, in the green box, there are several images which are just for illustrating the dynamic, in the training step, only one image feed to the encoder.

Dynamic attention.

We consider that a talking face video is mainly composited by the identity-related and lip-related features. Separating these features can help our model adapt for arbitrary identities generation since the strong constrain of original input mouth might restrict mouth change while encoding the audio information to generated frames.

Therefore, in order to improve the transition of the talking face generation for arbitrary identities, we introduce a dynamic attention technique, which gives different attention rates on mouth area during the training. When attention rate turns smaller, the dynamic attention technique can divide the feature of one identity into two parts: identity-related feature and lip-related feature. One of the examples of the dynamic attention is shown in Fig. 3.

The dimension of the attention mask is the same as , and the rate of the area near the mouth is smaller than 1 while is equal to 1 for the rest area. In the training stage, we start from a relatively large rate (), and progressively decrease it to a relatively small value (

), then we fix it to 1 for the last few epochs. The reason of this strategy is it may significantly affect the quality of generation in the early stage if we directly separate this two parts by a small rate (less attention on the mouth) since lacking of supervision of mouth information. Therefore, we progressively decrease the rate after several training epochs which will enforce the visual information of the mouth deriving from the previous frame

. This dynamic attention technique is designed for more robust for large lip movement during the cross-modality generation task. Note that we do not use dynamic attention during testing since it may hurt the generation speed. In the last few epochs of training, we set the rate as 1 to fit the real test environment.

To obtain the attention mask, we apply Dlib [19] to predict the landmarks of and only use the mouth area landmarks (20 points) to generate a bounding box. In practice, this box is 5 pixels larger than detected mouth area.

3.2 Frame Discriminator

Discriminator network is fed by the pairs of frame and audio clip, and , where and represent the real frame and the generated frame with corresponding audio clip respectively. The output of the discriminator is a probability of whether the inputs (audio and frame) are matched.

The discriminator consists of an Image CNN (6 convolution layers), an Audio FC (3 fully connection layers) and a classifier (3 fully connection layers). We flatten the output of the Image CNN to a 16384-dimension feature and the Audio FC extracts 4096-dimension feature. These two features are concatenated and feed to the final classifier to produce 1-dimension output.

3.3 Mutual Information Approximator

Mutual information is a measure of mutual dependency between two probability distributions,


where is the joint probability function of and , and and are the marginal probability distribution functions of and respectively.

As stated in Eq. (1), mutual information is equivalent to the Kullback-Leibler (KL-) divergence between the joint and the product of the marginal distributions and :


where is defined as,


Furthermore, the KL divergence admits the following DV representation [8, 2] :


where the supremum is taken over all functions so that the two expectations are finite. Therefore, we leverage the bound:


where denotes the neural information measure,


In this cross-modal problem, we argue that the information of the audio modality contains information about the visual modality, vice versa. But the key to utilizing this information is how to calculate it in a neural network manner. Follow by [2], we designed a network to estimate the information between audio and visual modality.

We denote , and

as the audio, the frame and a neural network respectively. The joint distribution

is a pair of real samples , while the marginal distributions and are randomly sampled from the dataset. The neural network , which is fed by the pairs of frame and audio clip, and , consists of an Image Encoder, Audio Encoder and a 3-layer classifier. The Image Encode and Audio Encoder have the same architecture defined in the generator. While the output of the classifier is 1-dimension scalar.

Inspired by [13], instead of DV representation of KL-Divergence (Eq. (4)), we adopt another representation, (following the formulation of [25]),


where represents softplus operation:


which better for training GANs. We employ the non-KL divergences due to the following two reasons: 1) we do not concern the accurate value of MI while maximizing it, 2) this estimator similar to the binary cross-entropy, which has been well studied in neural network optimization, and works more stable in practice. [13]

Our Mutual Information Approximator is trained using triplets of , where is a randomly selected audio clip from . While in the estimating stage, we estimate mutual information using , That is, we use real pairs to train and using a generated sample to estimate. GANs are usually used to learn the probability distribution consistent with the real data, and the mutual information is used to estimate the amount of shared information between the two distributions. Therefore, our solution uses mutual information in distributions, as shown in Fig. 1, which can stabilize the convergence and improve the quality of generation.

3.4 Training Details

In the training stage, we feed Mutual Information Approximator and Frame Discriminator pairs of a frame and audio clip, and . The loss of our GAN can be defined as,


In order to synchronize of the mouth movements more accurately, we make use of perceptual loss, which is originally proposed by [17]

as a method used in image style transfer and super-resolution. It utilizes high-level features to compare generated images and ground-truth images, resulting in the better sharpness of the synthesized image. The perceptual loss is defined as:



is a feature extraction network.

To focus on the lip movement, we only utilize the mouth area of the frame for reconstruction loss,


where is a mask of mouth.

and measure the distance between visual concept, it is also important to shorten the distance between audio and visual modalities in high-level representation. We implement the mutual information as described in Sec. 3.3. We try to maximize it between generated frames and audios,


Our full model is optimized according to the following objective function:


4 Experiments

In this section, we first introduce the dataset and experimental settings, followed by the qualitative, quantitative results, cross-dataset evaluation, ablation study and generating speed comparison.

4.1 Preparation

Dataset and Settings.

We evaluate our method on prevalent benchmark datasets LRW [5] and GRID [6]. The former is an in-the-wild dataset that contains up to 1000 utterances composed of 500 different words, spoken by hundreds of different speakers. While the latter is captured in the constrained environments recordings of 1000 sentences spoken by 18 male and 16 female. We first extract frames from raw video file and then detect and align the frames using RSA algorithm [24]. All the frames are resized into . For the audio stream, we follow the implementation in [39] by extracting the MFCC features at the sampling rate of 5000Hz. Then we match each frame with an MFCC audio input with size of .

We adopt Adam optimizer and fix the learning rate as during training. All the parameters in networks are initialized with Xavier normal [10].


To evaluate the quality of the synthesized talking faces, we use the common reconstruction metrics such as the Peak Signal to Noise Ratio (PSNR) and Structure Similarity Index Measure (SSIM) [37]. The higher PSNR and SSIM, the better quality of the video. Furthermore, we use Landmark Distance (LMD) [3] to evaluate the accuracy of mouth in the generated video, which aims to calculate the Euclidean distance between the mouth landmarks detected by Dlib [19] on the generated video and the original video as,


where represents the frame length of the video and represents the total number of landmark points on each image each pair of landmarks. and indicate the mouth landmarks of the real video and a generated video of the -th frame at the -th point of the landmark respectively. The lower LMD, the better of the generation.

Figure 4: Examples of generating talking faces for arbitrary identities from the testing set of LRW [5] dataset.

4.2 Quantitative Results

We compare our model with four recent state-of-the-art methods, including Zhou [39], Vondrick [35], Chung [4], Chen [3]. Table 1 shows the quantitative results of our method and its competitors with higher PSNR, SSIM and lower LMD, suggesting the best quality of the generated video frames of the talking faces.

We observe that the methods can be ordered in ascending PSNR as Zhou [39], Vondrick [35], Chung [4], Chen [3] and our method. Although Zhou [39] obtains the lowest PSNR, it obtains the second highest SSIM and its SSIM is significantly better than Vondrick [35], Chung [4] and Chen [3]. Our method always achieves the highest PSNR and SSIM, demonstrating the effectiveness of our method. For LMD, our method significantly reduces the LMD value. Compared to the improvements from other methods, our improvement on LMD is obvious.

Since the LRW database is in-the-wild, the unstable videos in training set (the alignment of the dataset) may cause unstable generation. We assume that a high-quality and well-controlled dataset may facilitate training a better generative model. Therefore, we fine-tune our model on the GIRD dataset and directly evalute the fine-tuned model on the LRW dataset (denoted by ‘Ours (F.T. on GRID)’ in Table 1) which achieves the highest scores on PSNR, SSIM. Only LMD performs slightly decreasing since the GRID dataset is captured in controlled environment, which may affect the structure of pre-trained model on LRW.

Methods Evaluation on LRW [5]
G.T. 0
Zhou [39] 26.8 0.884
Vondrick [35] 28.03 0.34 3.28
Chung [4] 28.06 0.46 2.225
Chen [3] 28.65 0.53 1.92
Ours 29.64 0.92 1.18
Ours (F.T. on GRID) 32.08 0.92 1.21
Table 1: Quantitative results of our method on PSNR, SSIM and LMD comparing to the state-of-the-arts.

4.3 Cross-dataset Evaluation

To further verify the robustness of our method for arbitrary person generation, we evaluate our method on another benchmark dataset GRID [6] and report the comparison results in Table 2. Note that we directly conduct our model trained on the LRW dataset [5], without retraining or fine-tuning (denoted by ’Ours’ in Table 1) on the GRID dataset, whereas all compared methods are directly trained on the GRID dataset.

From Table 2, we observe that our model achieves the highest SSIM and the lowest LMD, demonstrating the effectiveness and robustness of our method. Although our method does not obtain the highest PSNR, its PSNR is very close to the best PSNR. This is because our method is not trained on any samples from the GRID dataset.

However, when we fine-tune our model on the GRID dataset, all results of our method can be further improved. As expected, our method achieves the highest score on PSNR, SSIM and LMD. Compared to the improvement between Chung [5] and Chen [3], the improvement of our method over its competitors is significant, suggesting the effectiveness of our network structure and mutual information learning.

Methods Evaluation on GRID [6]
G.T. 0
Vondrick [35] 28.45 0.60 2.38
Chung [4] 29.36 0.74 1.35
Chen [3] 29.89 0.73 1.18
Ours 29.25 0.96 0.82
Ours (F.T. on GRID) 30.67 0.97 0.73
Table 2: Cross-dataset evaluation of our method on GRID dataset pre-trained on LRW dataset comparing to the state-of-the-arts.
(a) Generated frames with large movement and pose variation
(b) Generated frames with gentle movement
Figure 5: Examples of generating talking faces for arbitrary identities from the wild (not existing in the dataset). By fixing the audio supervision, our model can generate the smoothing talking faces with the while darker lighting conditions (the first row of each subimage), cross-gender (the third row of each subimage) and pose variant (the third row of (a)).

4.4 Qualitative Results

Our method is capable of synthesizing realistic videos of talking faces for new identities. We first demonstrate the synthesized talking faces for the arbitrary identities from the dataset in Fig. 4. It is clear to see that, our method can not only synchronize the lip shapes to ground truth, but also maintain the identity information, such as teeth and winkle.

Fig. 5 demonstrates the qualitative results of generation for the arbitrary identities from the wild (not existing in the dataset) with large movement and pose variation (as shown in Fig. 5 (a)) and gentle movement (as shown in Fig. 5 (b)). We observe that, by fixing the audio supervision, our model can synthesize the talking faces of arbitrary identities with desired lip movement with large movement and preserving the stability of the generation with gentle movement. Furthermore, it can generate the smoothing talking faces both with the same gender (the first row of each subimage) and cross-gender (the third row of each subimage).

Figure 6: Qualitative examples of variants of our model in Table 3 (d) and Table 3 (e). The results of variant MIA+DA show obviously better than variant OMI+DA and OMI+MIA+DA in both synchronization and realistic.
Method (a) (b) (c) (d) (e) (f) (g)
PSNR 28.88 29.19 29.41 29.08 29.64 28.77 28.93
SSIM 0.895 0.900 0.918 0.891 0.92 0.883 0.89
LMD 1.36 1.37 1.22 1.32 1.18 1.55 1.54
Table 3: Ablation study on three settings: Dynamic Attention (DA), Original MINE (OMI) , Mutual Information Approximation (MIA) and LSTM.

4.5 Ablation Study

In order to quantify the effect of each component of our system, we conduct ablation study experiments to verify the contributions of four settings in our full model: Dynamic Attention (DA), Original MINE (OMI), Mutual Information Approximation (MIA) and LSTM.

As can be seen in Table 3, 1) By introducing dynamic attention (Table 3 (b)) or original MINE (Table 3 (c)), we achieve better PSNR and SSIM (comparing to Table 3 (a)). 2) By swapping the original MINE (OMI) with the proposed Mutual Information Approximation (MIA), our full model (Table 3 (e)) achieves the promising results on PSNR, SSIM and LMD. 3) By taking advantages of DA, OMI and MIA (Table 3 (f)), it results in the worst performance on PSNR, SSIM and LMD. We consider this may be caused by the disturbing of the two mutual information systems (OMI and MIA).

Usually, the high-level temporal information can provide a better guide of generating the next frame in other video generation tasks such as video prediction. However, we observe that it is not the key to talking face generation. In our task, the current shape of mouth is mainly determined by the audio input and the mouth of previous frame, we assume that LSTM might disturb the current audio input by previous memory. We conduct the experiment by leveraging LSTM to obtain temporal information on LRW dataset [5] and report the results in Table 3 (g). We observe that, the proposed model (Table 3 (e)) is sufficient for this task while LSTM suppresses the performance of our model.

4.6 Generating Speed Comparison

In order to evaluate the capability of synthesizing high-speed realistic videos of our method, we further conduct the speed comparison to Zhou et al. [39]. We conduct this experiment on one GPU (NVIDIA 1080ti). For fair comparison, we only count the time of the model cost (model generation speed) while excluding the time of data preparing and generative saving. Our method achieves 160.7916 fps (1.4553 seconds for 234 frames), which is 45% faster than Zhou et al. [39] with 110.4190 fps (2.1192 seconds for 234 frames).

5 Conclusion

We have proposed a novel model of talking face generation for arbitrary identities via exploring the cross-modality coherence in this paper. Our model mainly leverages the mutual information estimator to learn the correlation of audio features and facial image features and introduces the mutual information as a loss into the generation framework. In addition, we utilize a simple way to simulate the process of disentangling person identity features and lip features by a dynamic attention technique. Extensive experimental results on benchmark dataset demonstrate the promising performance of our method overpassing the state-of-the-art methods.


  • [1] A. Alempijevic, S. Kodagoda, and G. Dissanayake. Cross-modal localization through mutual information. In International Conference on Intelligent Robots and Systems (IROS), pages 5597–5602. IEEE, 2009.
  • [2] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, R. D. Hjelm, and A. C. Courville. Mutual information neural estimation. In

    International Conference on Machine Learning (ICML)

    , 2018.
  • [3] L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu. Lip movements generation at a glance. CoRR, abs/1803.10404, 2018.
  • [4] J. S. Chung, A. Jamaludin, and A. Zisserman. You said that? CoRR, abs/1705.02966, 2017.
  • [5] J. S. Chung and A. Zisserman. Lip reading in the wild. In

    Asian Conference on Computer Vision (ACCV)

    , 2016.
  • [6] M. Cooke, J. Barker, S. Cunningham, and X. Shao. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421–2424, 2006.
  • [7] G. A. Darbellay and I. Vajda. Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory (TIT), 45(4):1315–1321, 1999.
  • [8] M. D. Donsker and S. S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212, 1983.
  • [9] B. Fan, L. Wang, F. K. Soong, and L. Xie. Photo-real talking head with deep bidirectional lstm. In International Conference on Acoustics Speech and Signal Processing (ICASSP), pages 4884–4888. IEEE, 2015.
  • [10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In

    International Conference on Artificial Intelligence and Statistics (AISTATS)

    , pages 249–256, 2010.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances In Neural Information Processing Systems (NIPS), pages 2672–2680, 2014.
  • [12] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
  • [13] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
  • [14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • [15] M. M. V. Hulle. Edgeworth approximation of multivariate differential entropy. Neural Computation, 17(9):1903–1910, 2005.
  • [16] S. A. Jalalifar, H. Hasani, and H. Aghajan. Speech-driven facial reenactment using conditional generative adversarial networks. CoRR, abs/1803.07461, 2018.
  • [17] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision (ECCV), pages 694–711. Springer, 2016.
  • [18] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36:94, 2017.
  • [19] D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research (JMLR), 10(Jul):1755–1758, 2009.
  • [20] L. Kozachenko and N. N. Leonenko.

    Sample estimate of the entropy of a random vector.

    Problemy Peredachi Informatsii, 23(2):9–16, 1987.
  • [21] A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutual information. Physical review E, 69(6):066138, 2004.
  • [22] R. Kumar, J. Sotelo, K. Kumar, A. de Brébisson, and Y. Bengio. Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442, 2017.
  • [23] D. W. Kumar Sricharan and A. O. Hero III. Ensemble estimators for multivariate entropy estimation. IEEE Transactions on Information Theory (TIT), 59(7):4374, 2013.
  • [24] Y. Liu, H. Li, J. Yan, F. Wei, X. Wang, and X. Tang. Recurrent scale approximation for object detection in cnn. 2017.
  • [25] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for gans do actually converge? In International Conference on Machine Learning (ICML), pages 3478–3487, 2018.
  • [26] K. R. Moon, K. Sricharan, and A. O. Hero. Ensemble estimation of mutual information. In International Symposium on Information Theory (ISIT), pages 3030–3034. IEEE, 2017.
  • [27] Y.-I. Moon, B. Rajagopalan, and U. Lall. Estimation of mutual information using kernel density estimators. Physical Review E, 52(3):2318, 1995.
  • [28] I. Priness, O. Maimon, and I. Ben-Gal. Evaluation of gene-expression clustering via mutual information distance measure. BMC bioinformatics, 8(1):111, 2007.
  • [29] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (MICCAI), pages 234–241. Springer, 2015.
  • [30] F. Rossi, A. Lendasse, D. François, V. Wertz, and M. Verleysen. Mutual information for the selection of relevant variables in spectrometric nonlinear modelling. Chemometrics and Intelligent Laboratory Systems, 80(2):215–226, 2006.
  • [31] A. Ruderman, M. Reid, D. García-García, and J. Petterson. Tighter variational representations of f-divergences via restriction to probability measures. arXiv preprint arXiv:1206.4664, 2012.
  • [32] M. Saito, E. Matsumoto, and S. Saito.

    Temporal generative adversarial nets with singular value clipping.

    In International Conference on Computer Vision (ICCV), volume 2, page 5, 2017.
  • [33] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4):95, 2017.
  • [34] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol.

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

    Journal of Machine Learning Research (JMLR), 11(Dec):3371–3408, 2010.
  • [35] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems (NIPS), pages 613–621, 2016.
  • [36] K. Vougioukas, S. Petridis, and M. Pantic. End-to-end speech-driven facial animation with temporal gans. In British Machine Vision Conference (BMVC), 2018.
  • [37] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), 13(4):600–612, 2004.
  • [38] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
  • [39] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang. Talking face generation by adversarially disentangled audio-visual representation. CoRR, abs/1807.07860, 2018.