Real-world talking faces often accompany with natural head movement. However, most existing talking face video generation methods only consider facial animation with fixed head pose. In this paper, we address this problem by proposing a deep neural network model that takes an audio signal A of a source person and a very short video V of a target person as input, and outputs a synthesized high-quality talking face video with personalized head pose (making use of the visual information in V), expression and lip synchronization (by considering both A and V). The most challenging issue in our work is that natural poses often cause in-plane and out-of-plane head rotations, which makes synthesized talking face video far from realistic. To address this challenge, we reconstruct 3D face animation and re-render it into synthesized frames. To fine tune these frames into realistic ones with smooth background transition, we propose a novel memory-augmented GAN module. By first training a general mapping based on a publicly available dataset and fine-tuning the mapping using the input short video of target person, we develop an effective strategy that only requires a small number of frames (about 300 frames) to learn personalized talking behavior including head pose. Extensive experiments and two user studies show that our method can generate high-quality (i.e., personalized head movements, expressions and good lip synchronization) talking face videos, which are naturally looking with more distinguishing head movement effects than the state-of-the-art methods.READ FULL TEXT VIEW PDF
Visual and auditory modalities are two important sensory channels in human-to-human or human-to-machine interaction. The information in these two modalities are strongly correlated 
. Recently, cross-modality learning and modeling have attracted more and more attention in interdisciplinary research, including computer vision, computer graphics and multimedia (e.g.,[6, 8, 7, 9, 34, 46]).
In this paper, we focus on talking face video generation that transfers a segment of audio signal of a source person into the visual information of a target person. This kind of audio-driven vision models have a wide range of applications, such as bandwidth-limited video transformation, virtual anchors and role-playing game/move generation, etc. Recently, many works have been proposed for this purpose (e.g., [6, 7, 44, 60]). However, most of them only consider facial animation with fixed head pose.
In real-world scenarios, natural head movement plays an important role in high-quality communication  and human perception is very sensitive to subtle head movement in real videos. In fact, human can easily feel uncomfortable in communication by talking with fixed head pose. In this paper, we propose a deep neural network model to generate an audio-driven high-quality talking face video with personalized head pose.
Inferring head pose from speech (abbreviated as pose-from-speech) is not a new idea (e.g., [19, 20]). Although some measurable correlations have been observed between speech and head pose [3, 34], predicting head motion from speech is still a challenging problem. A practical method was suggested in  that first infers facial activity from speech and then models head pose from facial features. In our work, by observing that simultaneously learning two related tasks in deep network may help improve the performance of both tasks, we simultaneously infers facial expressions and head pose from speech.
Since natural head poses often cause in-plane and out-of-plane head rotations, it is very challenging to synthesize a realistic talking face video with high-quality facial animation and smooth background transition. To circumvent the difficult pose-from-speech problem and focus on addressing the realistic video synthesis challenge, we design the input of our system to include a segment of audio signal of a source person and a short (only a few seconds) talking face video of a target person. Note that with the popularization of smartphone, the cost of capturing a very short video (e.g., 10 seconds) is almost the same as taking a photo (e.g., selfie). Therefore we use both facial and audio information in the input short video to learn the personalized talking behavior of the target person (e.g., lip and head movements), which greatly simplifies our system.
To output a high-quality synthesized video of the target person with personalized head pose when speaking the input audio signal of source person, our system reconstructs 3D face animation and re-renders it into video frames. Given a light-weight rendering engine with limited information, these rendered frames are often far from realistic. We then propose a novel memory-augmented GAN module that can refine the rough rendered frames into realistic frames with smooth transition, according to the identity feature of the target person. To the best of our knowledge, our proposed method is the first system that can transfer the audio signal of an arbitrary source person into the face talking video of an arbitrary target person with personalized head pose. As a comparison, the previous work  can only generate a high-quality talking face video with personalized head pose for a specified person (i.e., Obama) — since it requires a large number of training data related to this specified person — and thus, it cannot generalize to arbitrary subjects. Furthermore, when the input short talking face video is not available, our method can also use a face image as input and achieves comparable lip synchronization and video quality with previous methods [6, 7, 60]. Our code is publicly available111https://github.com/yiranran/Audio-driven-TalkingFace-HeadPose.
The contributions of this paper are mainly three-fold:
We propose a novel deep neural network model that can transfer an audio signal of arbitrary source person into a high-quality talking face video of arbitrary target person, with personalized head pose and lip synchronization.
Different from the network  that fine tunes the rendering of a specified parametric face model into photo-realistic video frames, our memory-augmented GAN module can generate photo-realistic video frames for various face identities (i.e., corresponding to different target person).
By first training a general mapping based on a publicly available dataset  and fine-tuning the mapping using the input short video of the target person, we develop an effective strategy that only requires a small number of frames (about 300 frames) to learn personalized talking behavior including head pose.
Existing talking face video generation methods can be broadly categorised into two classes according to the driven signal. One driven signal is video frames [49, 1, 38, 56, 30, 58, 59] and the other is audio [13, 46, 56, 7, 60, 6, 44, 53, 48]. Video-driven talking face video generation (a.k.a face reenactment) transferred expression and sometimes head pose from a driving frame to a face image of target actor. Traditional optimization methods transferred expression using 3DMM parameters [49, 50] or image warping . Learning-based methods [30, 58] were trained by videos of target actor or general audio-visual data using GAN model conditioned on image or additional landmarks. Video-frame-driven methods only use one modality, i.e., visual information.
Audio-driven methods make use of both visual and auditory modalities, which can be further classified into two sub-classes: talking face video generation for specific face[13, 46] and for arbitrary target face [6, 7, 44, 60]. The latter methods usually take a clip of audio and one arbitrary face image as input. Chung et al.  learned a joint embedding of the face and audio signal, and used an encoder-decoder CNN model to generate talking face video. Zhou et al.  proposed a method in which both audio and video can serve as input by learning joint audio-visual representation. Chen et al.  first transferred the audio to facial landmarks and then generated video frames conditioned on the landmarks. Song et al.  proposed a conditional recurrent adversarial network that integrated audio and image features in recurrent units. However, in talking face videos generated by these 2D methods, the head pose is almost fixed during talking. This drawback is caused by the defect inherent in 2D-based methods, since it is difficult to only use 2D information alone for naturally modeling the change of pose. Although Song et al.  mentioned that their method can be extended to personalized pose for a special case, full details on this extension were not yet presented. In comparison, we introduce 3D geometry information into the proposed system to simultaneously model personalized head pose, expression and lip synchronization.
3D face reconstruction aims to reconstruct 3D shape and appearance of human face from 2D images. A large number of methods have been proposed in this area and the reader is referred to the survey  and reference therein. Most of these methods were based on 3D Morphable Model (3DMM) , which learned a PCA basis from scanned 3D face data set to represent general face shapes. Traditional methods fit 3DMM by an analysis-by-synthesis approach, which optimized 3DMM parameters by minimizing difference between rendered reconstruction and the given image [2, 14, 27].
Learning-based methods [40, 61, 41, 26, 43, 47, 51, 16, 15, 21, 12] used CNN to learn a mapping from face images to 3DMM parameters. To deal with the lack of sufficient training data, some methods used synthetic data [61, 40, 43, 21]
while others use unsupervised or weakly-supervised learning[47, 51, 16, 12]. In this paper, we adopt the method  for 3D face reconstruction.
Generative Adversarial Networks (GANs) 
have been successfully applied to many computer vision problems. The Pix2Pix proposed by Isola et al. has shown great power in image-to-image translation between two different domains. Later it was extended to video-to-video synthesis [55, 54]. It has also been applied to the field of facial animation and texture synthesis. Kim et al.  use a GAN to transform rendered face image to realistic video frame. Although this method can achieve good results, it was only suitable for a specific target person, and it had to be trained by thousands of samples related to this specific person. Olszewski et al.  proposed a network to generate realistic dynamic textures.
, and image colorization. Since this scheme can remember selected critical information, it is effective for one-shot or few-shot learning. In this paper, we use a GAN augmented with memory networks to fine tune rendered frames into realistic frames for arbitrary person.
In this paper, we tackle the problem of generating high-quality talking face video, when given an audio speech of a source person and a short video (about 10 seconds) of a target person. In addition to learn the transformation from the audio speech to lip motion and face expression, our talking face generation also considers the personalized talking behavior (i.e., head pose) of the target person.
To achieve this goal, our idea is to use 3D facial animation with personalized head pose as the kernel to bridge the gap between audio-visual-driven head pose learning and realistic talking face video generation. The flowchart of our method is illustrated in Figure 1, which can be interpreted in the following two stages.
Stage 1: from audio-visual information to 3D facial animation. We use the LRW video dataset  to train a general mapping from the audio speech to the facial expression and common head pose. Then, given an audio signal and a short video, we first reconstruct the 3D face (Section 3.1) and fine tune the general mapping to learn personalized talking behavior from the input video (Section 3.2). To this end, we obtain the 3D facial animation with personalized head pose.
Stage 2: from 3D facial animation to realistic talking face video generation. We render the 3D facial animation into video frames using the texture and lighting information obtained from input video. With these limited information, the graphic engine can only provide a rough rendering effect that is usually not realistic enough for a high-quality video. To refine these synthesized frames into realistic ones, we propose a novel memory-augmented GAN module (Section 3.4) that was also trained by the LRW video dataset. This GAN module can deal with various identities and generate high-quality frames containing realistic talking faces that matches the face identity extracted from input video.
Note that both mapping in the above two stages involves two steps: one step is the general mapping learned from the LRW video dataset and the second is a light-weight fine-tuning step that learns/retrieves personalized talking or rendering information from the input video.
We adopt a state-of-the-art deep learning based method
for 3D face reconstruction. It uses a CNN to fit a parametric model of 3D face geometry, texture and illumination to an input face photo. This method reconstructs the 3DMM coefficients , where
is the coefficient vector for face identity,is for expression, is for texture, is the coefficient vector for illumination, and is the pose vector including rotation and translation. Then the face shape and face texture can be represented as , , where and are average shape and texture, , and are PCA basis for shape, expression and texture separately. Basel Face Model  is used for and , and FaceWareHouse  is used for .
The illumination is computed using the Lambertian surface assumption and approximated with spherical harmonics (SH) basis functions . The irradiance of vertex with normal vector and texture is , where are SH basis functions, are SH coefficients and is the number of SH bands. The pose is represented by rotation angles and translation. A perspective camera model is used to project the 3D face model onto the image plane.
It is well recognized that the audio signal has strong correlation with lip and lower-half face movements. However, talking faces with only lower-half face movements are stiff and far from natural. In other words, upper-half face (including eyes and brows) movements and head pose are also essential for a natural talking face. We use both the audio information and the 3D face geometry information extracted from input video to establish a mapping from the input audio to the facial expression and head pose. Note that although a person may have different head poses when speaking the same word, the speaking style in a short period is often consistent and we provide a correlation analysis between audio and pose in Appendix A.
We extract the Mel-frequency cepstral coefficients (MFCC) feature of the input audio, and model the facial expression and head pose using 3DMM coefficients. To establish the mapping inbetween, we design a LSTM network as follows. Given the MFCC features of an audio sequence , a ground-truth expression coefficient sequence , and a ground-truth pose vector sequence , we generate predicted expression coefficient sequence and pose vector sequence . Denoting the LSTM network as , our audio-to-expression-and-head pose mapping can be formulated as
where is an additional audio encoder that is applied to the MFCC feature of audio sequences , and are hidden state and cell state of LSTM unit at time respectively.
We use a loss function containing four loss terms to optimize the network: a mean squared error (MSE) loss for expression coefficients, a MSE loss for pose coefficients, an inter-frame continuity loss for pose, and an inter-frame continuity loss for expression. Denote the shorthand notation of Eq. (1) as , the loss function is formulated as:
where inter-frame continuity loss is computed by the squared norm of the gradient of the pose / expression.
By reconstructing the 3D face of the target person (Section 3.1) and generating the expression and pose sequences (Section 3.2), we collect a mixed sequence of 3DMM coefficients synchronized with audio speech, in which the identity, texture and illumination coefficients are from the target person, and expression and pose coefficients are from the audio. Given this mixed sequence of 3DMM coefficients, we can render a face image sequence using the rendering engine in .
If we compute the albedos from reconstructed 3DMM coefficients, these albedos are of low-frequency and too smooth, resulting in the rendered face images that do not appear visually similar to the input face images. An alternative is to compute a detailed albedo from input face images. I.e., we first project the reconstructed 3D shape (a face mesh) onto the image plane, and then we assign the pixel color to each mesh vertex. In this way, the albedo is computed by dividing illumination. Finally, the albedo from the frame with the most neutral expression and the smallest rotation angles is set as the albedo of the video.
We use the above mentioned both schemes in our method. In the general mapping, we use the detailed albedo for rendering, because videos in the LRW dataset are very short (about 1 second). In the personalized mapping (i.e., tuning by input short video), we use the low-frequency albedo to tolerate the change of head pose, and the input video (about 10 seconds) can provide more training data of the target person to fine tune the synthesized frames (rendered with a low-frequency albedo) into realistic ones.
So far the rendered frames only have the facial part, without the hair and background regions that are also essential for a realistic talking face video. An intuitive solution is to match a background from the input video by matching the head pose. However, for a short video of about 10 seconds, we only have less than 300 frames to select a suitable background, which is very few and can be regarded as very sparse points in the possible high-dimensional pose space. Our experiment also shows that this intuitive solution cannot produce good video frames.
In our method, we propose to extract some keyframes from the synthesized pose sequence, where the keyframes correspond to critical head movements in the synthesized pose sequence. We choose the key frames to be the frames with largest head orientation in one axis in a short period of time, e.g., the frame with leftmost or rightmost head pose. Then we only match backgrounds for these keyframes. We call these matched backgrounds as key backgrounds. For those frames between two neighboring keyframes, we use linear interpolation to determine their backgrounds. The pose in each frame is also modified to fit the background. Finally the whole rendered frames are assembled by including the matched backgrounds.
If only a signal face image is input instead of a short video, we obtain the matched background by rotating to the predicted pose using the face profiling method in .
The synthesized frames rendered by the light-weight graphic engine  are usually far from realistic. To refine these frames into realistic ones, we propose a memory-augmented GAN. The differences between our method and the previous GAN-based face reenactment (FR) methods  are:
FR only refines the frames for a single, specified face identity, while our method can deal with various face identities. I.e., given different identity features of target faces, our method can output different frame refinement effects with the same GAN model.
FR uses thousands of frames to train a network for a single, specified face identity, while we only use a few frames for each identity in the general mapping learning. Based on the general mapping, we fine tune the network using a small number of frames for the target face (from the input short video).
We model the frame refinement process as a function that maps from the rendered frame (i.e., synthesized frame rendered by the graphic engine) domain to the real frame domain using paired training data , and . To handle multiple-identity refinement, we build a GAN network that consists of a conditional generator , a conditional discriminator and an additional memory network (Figure 2). The memory network stores paired features, i.e., (spatial feature, identity feature), which are updated during the training process. Its role is to remember representative identities including rare instances in the training set, and retrieve the best-match identity feature during the test. The conditional generator takes a window of rendered frames (i.e., a subset of 3 adjacent frames ) and an identity feature as input, and synthesize a refined frame using the U-Net  with AdaIN . The conditional discriminator takes a window of rendered frames and either a refined frame or a real frame as input, and decides whether the frame is real or synthesized.
Attention-based generator . We use an attention-based generator to refine rendering frames. Given a window of rendered frames and an identity feature (extracted from ArcFace ), the generator synthesizes both a color mask and an attention mask , and outputs a refined frame that is the weighted average of the rendered frame and color mask:
The attention mask specifies how much each pixel in the generated color mask contributes to the final refinement. Our generator architecture is based on a U-Net structure222The attention mechanism has also been used in the work . However, the difference in our network is that we also input an additional identity feature into the network, which enables generating different refining effects for different identities. and has two modifications. (1) To generate two outputs (i.e., color and attention masks), we modify the last convolution block to two parallel convolution blocks, in which each one generates one mask. (2) To take both a window of rendered frames and an identity feature as input, we adopt AdaIN  to incorporate identity features into our network, where AdaIN parameters are generated from input identity features. Experimental results show that our network can generate delicate target-person-dependent texture for various identities.
Memory network . We use a memory network to remember representative identities including rare instances in the training set, so that during the test we can retrieve similar identity feature from it. We adapt the network in  in our system by modifying it to output continuous frames. In particular, our memory network stores paired spatial features and identity features. The spatial feature is extracted by (1) feeding the input rendered frame into ResNet18 
pre-trained on ImageNet, (2) extracting the ‘pool5’ feature, and (3) passing the ‘pool5’ feature to a learn-able fully connected layer and normalization. The paired identity feature is extracted by feeding the corresponding ground-truth frame into ArcFace .
During the training, we update the memory network using paired features extracted from the training set. This updating includes (1) a threshold triplet loss333
We use the cosine similarity for both spatial and identity features. to make spatial features of similar identities closer and spatial features of different identities farther, and (2) a memory item updating process, where either an existing feature pair is updated or an old pair is replaced444An old pair is replaced when the similarity between current identity feature and the closest identity feature is smaller than a threshold. by a new pair. During the test, we retrieve the identity feature by using the spatial feature as query, finding its nearest spatial feature in memory and returning the corresponding identity feature. Noting that directly feeding this feature into the generator may lead to jittering effects, we smooth the retrieved features in multiple adjacent frames by interpolation and use the smoothed features as inputs for the generator.
Discriminator . The conditional discriminator takes a window of rendered frames and a checking frame (either a refined frame or a real frame) as input, and discriminates whether the checking frame is real or not. We adopt PatchGAN  architecture as our discriminator.
Loss function. The loss function of our GAN model555Note that during the training process, the memory network is updated separatedly and GAN is trained after each updating of the memory network. has three terms: a GAN loss, an loss, and an attention loss  to prevent the attention mask from saturation, which also enforces the smoothness of the attention mask. Denoting the input rendered frames as , the identity feature as , and the ground truth real frames as , the loss function is formulated as:
We train the GAN model to optimize the loss function:
We implemented our method in PyTorch. All experiments are performed on a PC with a Titan Xp GPU. The code is publicly available666https://github.com/yiranran/Audio-driven-TalkingFace-HeadPose. The dynamic results in this section can be found in accompanying demo video777https://cg.cs.tsinghua.edu.cn/people/~Yongjin/Yongjin.htm.
In our model, the two components (audio to expression and pose by LSTM, and memory-augmented GAN) involves two training steps: (1) a general mapping trained by the LRW video dataset  and (2) fine tuning the general mapping to learn the personalized talking behavior. At the fine tuning step, we collect 15 real-world talking face videos of single person from Youtube. In each video, we use its first 12 seconds (about 300 frames) as the training data. Given the well-trained general mapping, we observe that 300 frames888See Appendix B for details of this observation. are sufficient for the fine tuning task. In Section 4.3, we evaluate our personalized fine tuning effect (see Figure 3 for two examples) on both the audio from the original Youtube video and the audio from the VoxCeleb and TCD dataset. Below we denote the general mapping and fine-tuned personalized mapping as Ours-G and Ours-P, respectively. The network is first trained in Ours-G (using general dataset) and then fine-tuned in Ours-P (for a specific person).
As illustrated in Figure 1, our method involves two stages. Here we evaluate the importance of these two stages.
In the first stage, we predict both the head pose and expression from the input audio. If we only predict the expression without the pose estimation, as shown in the second row of Figure5, the generated results are good in lip synchronization, but look rigid due to the fixed head position, which is far from natural.
There are two distinct characteristics in the second stage. First we include the identity feature in the input of GAN. Second, we add a memory network in the GAN model to store representative identities in the training set and retrieve the best-matched identity feature during the test. If we exclude the identity feature from input and the memory network from the GAN model, the personalized refining effect of different identities and expressions would be the same and then the network can not be well optimized. As shown in the third row of Figure 5, without them, the generated results have bad mouth details (e.g., strange teeth), uneven cheek areas, and black spots on face. If we exclude the memory network from the GAN model but keep the identity feature input, and use the mean of identity features in fine tuning, the results (the middle row in Figure 6) are not as good as our results (the last row in Figure 6), which have much better fine details (e.g., wrinkles) and look more realistic.
As mentioned in Section 4.1, our model involves two important mappings: Ours-G and Ours-P. Note that (1) the inputs to the personalized mapping Ours-P are a short video of 300 frames (to fine-tune) and an audio, and (2) the inputs to the general mapping Ours-G are only one frame (since it does not need to fine-tune) and an audio.
In this section, we show that the Ours-P model can generate realistic talking face video with more distinguishing head movement effects than the state-of-the-art methods. Even for the degenerate case that uses a single face image as input, the Ours-G model can generate comparable lip synchronization with previous methods.
We first compare Ours-P with three state-of-the-art audio-driven talking face generation methods: YouSaidThat , DAVS  and ATVG . These three methods are all 2D-based and operate on the image directly, i.e., without using 3D face geometry and rendering. Thus their inputs include only one facial image and an audio. We emphasize that the head positions in the results output from these methods are fixed. Although Ours-P takes more frames (for the purpose of fine-tuning) as input, we learn personalized head pose. Some qualitative results are shown in Figure 7.
It is very challenging to evaluate the visual quality and naturalness of synthesized videos, in particular regarding the human face. We therefore design a user study to perform the assessment based on subjective score evaluation. To fine tune the network by considering personalized talking behavior, we collect 15 real-world talking face videos and use the portion of their first 12 seconds for training 15 personalized mappings. In our user study, for each of these 15 personalized mappings, we test two sets of audio: one is the audio from the remaining portions of the original real videos, and the other is an audio chosen from VoxCeleb  or TCD . We choose these two datasets because they have a long audio segment to better visualize the change of head pose. Then we can construct 30 comparison groups. Each group have five videos: one original video and four generated videos by four methods. Each personalized video devotes to two groups, based on its two sets of audios.
20 participants attended the user study and each of them compared all 30 groups and answered 3 questions for each group. For a fair comparison, each group is presented by a randomly shuffled order of five videos. Participants are asked to select the best video according to three criteria: image quality, lip synchronization and the naturalness of talking. The results of subjective scores are summarized in Table I, showing that our method achieves better performance in all three criteria.
|Methods||Image quality||Lip synchronization||Natural|
|You said that ||3.50%||20.50%||4.17%|
|Methods||Chen||Wiles||You said that||DAVS||ATVG||Ours-G|
Since most previous talking face generation methods do not consider personalized head pose, we further compare our Ours-G (i.e., without fine tuning personalized talking behavior) with representative audio-driven methods [5, 56, 7, 60, 6]. We directly compare the generated results by different methods with the ground-truth videos.
We follow ATVG  to apply three widely used metrics for audio-driven talking face generation evaluation, i.e., the classic PSNR and SSIM metrics for image quality evaluation, and the landmark distance (LMD) for accuracy evaluation of lip movement. The results are summarized in Table II, showing that our method has the best PSNR values and has comparable SSIM and LMD metric values with ATVG . Some qualitative comparisons are shown in Figure 8.
To objectively evaluate the quality of personalized head pose, we propose a new metric to measure the similarity of head poses between the generated video and real video. We use the three Euler angles to model head movements , i.e., pitch, yaw, and roll corresponding to the movement of head nod, head shake/turn, and head tilt, respectively. We compute a histogram of pose angles in real personalized video, and a histogram of pose angles in the generated video. Then we compute the normalized Wasserstein distance  between and . The lower the distance, the more similar the two head pose distribution. Our new metric is formulated as
where is in the range and larger indicates higher similarity of head pose. The average score of 15 pairs of personalized videos is , and the maximum and minimum score are 0.956 and 0.702 respectively, showing that our generated video has a high similarity to real video in term of head movement behavior.
To evaluate the validity of this new metric, we further perform another user study to examine the correlation between the subjective evaluation and our metric values. 20 participants attended this user study. Each participant was asked to compare 15 pairs of generated videos and real videos. They ranked from 1 to 5 based on the head pose similarity of the two videos (1-not similar, 2-maybe not similar, 3-don’t know, 4-maybe similar, 5-similar). The results of votes (in parentheses) are 1 (25), 2 (47), 3 (35), 4 (110) and 5 (83). The average score is , and the percentage of scores 4 and 5 (‘maybe similar’ and ‘similar’) is . Only ranks are ‘not similar’ or ‘maybe not similar’. Furthermore, the correlation coefficient between subjective ranking and the metric is , demonstrating that our metric has strong positive correlation with human perception.
In this paper, we propose a deep neural network model that generates a high-quality talking face video of a target person who speaks the audio of a source person. Since natural talking head poses often cause in-plane and out-of-plane head rotations, to overcome the difficulty of rendering a realistic frames directly from input video to output video, we reconstruct the 3D face and use the 3D facial animation to bridge the gap between audio-visual-driven head pose learning and realistic talking face video generation. The 3D facial animation incorporates personalized head pose and is re-rendered into video frames using a graphic engine. Finally the rendered frames are fine-tuned into realistic ones using a memory-augmented GAN module. Experiments results and user studies show that our method can generate high-quality talking head video with personalized head pose, and this distinct feature has not been considered in state-of-the-art audio-driven talking face generation methods.
Our proposed deep network model learns a mapping from audio features to facial expression and head pose. We infer the head pose from an audio, based on the observation that the speaking style of a person in a short period is often consistent. In our model, we have two training steps: (1) a general mapping trained by LRW, and (2) fine-tuning step using a short video of the target person as the training data. The head pose pose estimation of the target person is mainly learned during the fine-tuning step, because LRW includes different person’s data and their head movements vary; so we design to learn the head movement behavior from the short video of the target person.
To verify the observation that we can infer the head pose from the audio features, we conduct a correlation analysis between the audio and pose in these short videos. We represent the audio using the MFCC feature and represent the pose using the three Eular angles (i.e., pitch, yaw, roll). For a MFCC feature , we find all MFCC features in the same short video that are in its local neighborhood, and calculate the distance between the neighboring MFCC pair and the distance between the corresponding poses. We calculate the correlation between these two distances using a spherical neighborhood with radius. The average correlation coefficient of 15 short videos is 0.45, and the maximum and minimum correlation coefficients are 0.58 and 0.24 respectively. These results indicate there exists a positive correlation between the audio and pose.
In our method, we use a short video of a target person to fine tune the general mapping into a personalized mapping, which learns personalized talking behavior. Here, we study the relation between the length of the input short video and the quality of the output talking face video. We generate the results by inputting short videos of different lengths, i.e., 4s (100 frames), 8s (200 frames), 12s (300 frames), 20s (500 frames), and 32s (800 frames). Some qualitative results are shown in Figure 9.
We first conducted an expert interview and asked an expert who is good at video quality assessment to choose the results that have the best quality and explain why. The expert chose the results trained by 300, 500 and 800 frames, and the reason was that the results trained with less frames have obviously lower image quality around the mouth and teeth areas, and somehow look strange.
Then we further conducted the following user study. We asked each user to (1) watch a real video, (2) watch the results generated by the model trained with frames, and (3) select the best ones (in terms of visual quality) from generated results. Note that in our user study, more than one result can be selected as the best; i.e., the user can select multiple results that have the same best quality. 11 participants attended this user study. For the results generated with frames, users selected them as the best one, respectively. These results validated that the models trained by less than 300 frames produce apparently worse results and the model trained by 300 frames achieves a good balance between visual quality and computational efficiency (using fewer frames for training).
Discussion on frame numbers in personalized and general mapping. generates low video quality, possibly because in the personalized mapping (i.e. fine-tuning by the input short video), we use the low-frequency albedo (to tolerate the change of head pose), and it requires more frames to fine-tune the low-frequency albedo into realistic ones. While in the general mapping (i.e. trained by LRW, without fine-tuning by the short video), we use the detailed albedo. So using only one frame in the general mapping may generate a little bit more realistic results than using a few frames (e.g., 10-30 frames) to fine-tune the albedo in the personalized mapping.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7832–7841. Cited by: §1, §1, §1, §2.1, §2.1, Fig. 7, Fig. 8, §4.3.1, §4.3.2, §4.3.2, TABLE I, TABLE II.
ArcFace: additive angular margin loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4690–4699. Cited by: §3.4, §3.4.
Predicting head pose from speech with a conditional variational autoencoder. In 18th Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 3991–3995. Cited by: §1.
Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976. Cited by: §2.3, §3.4.
Ask me anything: dynamic memory networks for natural language processing. In
Proceedings of the 33nd International Conference on Machine Learning (ICML 2016), pp. 1378–1387. Cited by: §2.3.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI 2019), pp. 919–925. Cited by: §1, §2.1, §2.1.