Delivering information in an audio-visual manner is more attractive to humans compared to an audio-only fashion. Given an audio clip and one image of an arbitrary speaker, authentic audio-visual content creation has received great attention recently and also has widespread applications, such as human-machine interaction and virtual reality. A large number of one-shot talking-head works [5, 31, 20, 23, 16, 33] have been proposed to synchronize audio and lip movements. However, most of them neglect to infer head motions from audio, and a still head pose is less satisfactory for human observation. Natural and rhythmic head motions are also one of the key factors in generating authentic talking-head videos [8, 4].
Albeit recent works [4, 32] take head movements into consideration, they often suffer from the ambiguous correspondences between head motions and audio in the training dataset, and fail to produce realistic head movements. For example, given the same audio content, one performer may move the head from left to right while another may move the head from right to left. These examples would introduce ambiguity in training and thus a network will produce a still-alike head to minimize the head motion loss. Moreover, when generating head motions, reducing artifacts in background regions is critical as well. However, the methods [4, 32] only focus on the face regions and thus result in obvious distortions in non-face regions (e.g., hair and background). Therefore, how to produce realistic head motions and artifact-free frames for one-shot talking-head generation is still challenging.
In this paper, a novel one-shot talking-head video generation framework is proposed to produce natural and rhythmic head motions while remarkably suppressing background artifacts caused by head motions. Firstly, we propose to disentangle head motions and expression changes since head motions might exhibit ambiguity as aforementioned. The head motion is modeled as a rigid 6 degrees of freedom (DoF) movement and it will act as the low-frequency holistic movement of talking head. We generate head motions via a motion aware recurrent neural network (RNN) (see Figure1) and let another network focus on producing detailed facial movements.
Considering the ambiguous correspondences between head motions and audio, we opt to enforce the structural similarity between the generated and the ground truth head motion matrices making up of successive 6D motion vectors, thus achieving more diverse head motions.
After obtaining head motions, we resort to a keypoint based dense motion field representation  to depict the entire image content motions, including facial region, head and background movements. We develop an image motion field generator to produce keypoint based dense motion fields from the input audio, head poses and reference image, which will be used to synthesize new frames. The relative dense motion field is then described by the differences between the predicted keypoints and those of the reference image (see Figure 2). Compared to the 3D face model or landmark-based representations that are used in prior works, the keypoint based representation models the motions of the facial regions, head, and backgrounds integrally, allowing for better governing the spatial and temporal consistency in the generated videos. Finally, an image generation network [26, 28, 27, 29] is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image.
Extensive comparisons with the state-of-the-art methods show that our method achieves superior visual quality and authentic head motions without introducing noticeable artifacts. In summary, we make the following technical contributions:
We develop a new audio-driven talking-head generation framework that produces photo-realistic videos with natural head motions from a single image.
We design a motion aware recurrent neural network to predict natural-looking head motions that match the input audio rhythm.
We present an image motion field generator to produce keypoint based dense motion fields and thus are able to govern the spatial and temporal consistency of the generated videos.
We achieve state-of-the-art results in terms of visual quality and rhythmic head motions.
2 Related Work
2.1 Speech-driven Talking-head Generation
Given the input audio, some approaches achieve great success to synthesize the talking head of a specific speaker [21, 9, 22, 13]. However, they need to be trained on minutes to hours of videos of each speaker. Other works aim to reduce the speaker’s reference information to a single image. Song et al. ijcai2019-129 and Vougioukas et al. vougioukas2019realistic take the temporal dependency into account and generate the talking face video with GANs. Zhou et al. zhou2019talking design an end-to-end framework to generate videos from the disentangled audio-visual representation. Prajwal et al. prajwal2020lip employ a carefully-designed lip-sync discriminator for better lip-sync accuracy. These works directly learn the mapping from audio to facial pixels, which tend to produce blurry results. For better facial details, some works utilize intermediate representations to bridge the variation of audio and pixels. Chen et al. chen2020talking and Zhang et al. zhang2021flowguided predict the coefficients of a 3D face model from audio, which is used to guide the image generation. Chen et al. chen2019hierarchical and Zhou et al. zhou2020makelttalk learn to generate facial landmarks from audio first, and then synthesize images from landmarks. Although these methods improve the visual quality of the inner face, none of the mediums models the non-face regions (e.g. hair and background). Hence, the texture and the temporal consistency of the outer face regions are not as good as that of the inner face.
2.2 Video-driven Talking-head Generation
Video-driven methods control the motions of a subject with a driving video. Subject-specific methods [2, 12] focus on a specific person by training their models on that person. Subject-independent approaches drive the reenactment using landmarks [30, 11, 15], latent embeddings  or feature warping . Recently, Siarohin et al. siarohin2019first represent the dense motion flow of the driving video as self-learned keypoints, and use the flow to warp the features of the reference image for reenactment. High visual quality is obtained due to the dense intermediate representation of the entire image.
2.3 Head Motion Prediction
Traditional head motion prediction methods are designed for 3D avatars, which only involve the 3D head rotation [8, 10, 17]. Methods designed for 2D images, on the other hand, need to produce both head rotation and translation. Chen et al
. chen2020talking constrain the mean value and the standard deviation of the predicted head pose sequence to be similar to that of the real sequence. However, the statistical constraints cannot model the local motion details. Zhouet al. zhou2020makelttalk constrain the predicted facial landmarks to be consistent with that of the ground truth. Due to the L2 loss term, their method suffers from the ambiguous correspondences between head motion and audio, and converge to the slightly swinging. Different from the above two works, our method constrains the full head motion sequence for more natural head motion patterns.
3 Proposed Method
Our method takes a reference image and an audio clip as input, and synthesizes video frames of the reference speaker synchronized with . As illustrated in Figure 2, the pipeline of our method consists of four components.
Head Motion Predictor .
As the representation of low-frequency holistic movements, the head poses are predicted individually. From both and , produces the natural-looking and rhythmic head motion sequence .
Motion Field Generator .
produces the self-learned keypoint sequence that controls the dense motion field. contains the information of synchronized facial expressions, head motions, and non-face region motions.
Keypoint Detector and Image Generator .
detects the initial keypoints from . renders the synthesized images from the relative dense motion between and . The network architectures of and are adopted from .
3.2 Head Motion Predictor
|LRW||Prajwal et al. prajwal2020lip||20.17||0.62||42.41|
|Zhou et al. zhou2020makelttalk||19.20||0.59||48.13|
|GRID||Prajwal et al. prajwal2020lip||28.82||0.86||31.14|
|Zhou et al. zhou2020makelttalk||24.55||0.78||34.57|
|VoxCeleb||Prajwal et al. prajwal2020lip||17.33||0.50||61.58|
|Zhou et al. zhou2020makelttalk||17.45||0.50||53.95|
As preprocessing, the input raw audio is first converted to an acoustic feature sequence . refers to an acoustic feature frame. To be consistent with the frequency of the videos sampled at 25 fps, makes up of acoustic features from successive sliding windows. In total, acoustic features are extracted from each sliding window, including 13 Mel Frequency Cepstrum Coefficients (MFCC), 26 Mel-filterbank energy features (FBANK), pitch and voiceless. The sliding window has a window size of 25ms and a step size of 10ms.
In our work, is used to predict a head pose sequence describing head rotation and translation in the camera coordinate system. For reference images with different head scales and body positions, the natural head trajectories in pixel coordinates are also different. Specially, we employ an encoder (ResNet-34) to extract the initial head and body state from . The extracted spatial embedding encodes the initial head rotation and translation. Afterwards, we use a two-layer LSTM to create natural head motion sequence that matches audio rhythm, as shown in Figure 3. At each time step , we first extract the audio embedding with another ResNet-34 encoder from and concatenate it with the spacial embedding at step . Then, the LSTM takes the concatenated embeddings as input and outputs the current . Such spatial embedding transition (SET) passes the previous to the next time step, and therefore contribute to more natural head motions and better synchronization with audio. Finally, a decoder is used to decode to head pose (3 for rotation and 3 for translation). Our head motion predictor supports an arbitrary length of audio input. The procedure is formulated as:
where is the hidden state of step , means concatenation.
Since the mapping from audio to head motion is one-to-many mapping, the widely-used L1 loss and L2 loss are not suitable choices. Instead, we treat as an image of size , and impose the structural constraint on it using the Structural Similarity (SSIM) loss:
and are the mean and standard deviation of , and and are that of the groundtruth head pose sequence extracted by OpenFace . is the covariance. and are two small constants. To improve the fidelity and smoothness of the predicted head motions, we also employ a discriminator
based on the PatchGAN. Specifically, we adapt the original PatchGAN to perform 1D convolution operations on head motion sequence along temporal trunks. The total loss function foris defined by:
The GAN loss is calculated by LSGAN. When training , we set the window length T to .
3.3 Motion Field Generator
From the , and the predicted , the motion field generator produces the keypoints sequence that controls the dense motion field. For the -th frame, the keypoints include N positions , and the corresponding Jacobian . Each pair represents a local image affine transform. The N local affine transforms with adaptive masks constitute the dense motion field .
Figure 4 shows the structure of . The multimodal input , and are first converted to a unified structure. is expected to provide the identity constraint for generated keypoints, we downsample to a size of and repeat it
times as the tensorof size . Instead of directly taking as input, we draw a 3D box in the camera coordinate system to represent the head pose and project it to the image. In this way, we render a binary image for each pose frame, and stack them to get the pose tensor of size . As to , we encode each to the feature map of the shape of using an encoder composed with conv and upsampling operations. The feature maps are also stacked as the tensor with size . Finally, we construct by concatenating along the channel dimension. In our experiments, and are set to 64, is set to 64. Afterwards, we employ the 3D Hourglass Network (Hourglass-3D) to deal with the temporal dependence and ensure the smoothness of motion field between consecutive frames. The Hourglass-3D takes as input and outputs a group of continuous latent features, which are then decoded into and by two decoders and separately.
The training process of goes through two stages. In the first stage, we use the pretrained as guidance. The loss function is defined as:
denotes the L1 loss. and are the positions and Jacobians extracted by the pretrained from the training video. and are heatmaps produced by and , the keypoint positions are estimated from these heatmaps as in . The heatmap term helps the convergence in the beginning. and are both set to 10. is set to 1 and decays to zero after a certain time.
In the second stage, we import the pretrained to help the fine-tuning of . Giving the predicted , renders the reconstructed frame for the reconstruction loss:
where is the th channel feature of a specific pretrained VGG-16 layer with channels. We apply on the image pyramid of multiple resolutions, and sum them as . For more stable performance, we also adopt the equivariance constraint losses of , denoted as and respectively. Please refer to  for more details. The total loss function is defined as:
is set to 100, , and are all set to 10.
4 Experiment Setup
We use prevalent benchmark datasets VoxCeleb , GRID  and LRW  to evaluate the proposed method. VoxCeleb consists of speech clips collected from YouTube. GRID contains video clips of 33 speakers in the experimental condition. LRW contains 500 different words spoken by hundreds of people in the wild. Specially, for VoxCeleb and GRID, we re-crop and resize the original videos to as in , to get 54354 and 25788 shot video clips respectively. For LRW, we use its original format by aligning the face in the middle of each frame. We split each dataset into training and testing sets following the setting of previous works. All videos are sampled at 25 fps.
4.2 Implementation Details
All our networks are implemented using PyTorch. We adopt Adam optimizer during training, with an initial learning rate of 2e-4 and weight decay to 2e-6.is trained on VoxCeleb for one day on one RTX 2080 Ti with batchsize 64. When training , and , we separate LRW and the other two datasets into two groups due to the different cropping strategies. For each group, We first follow the training details of  to train and , then we train with frozen and . The training of and takes 3 days with batchsize 28, and that of takes one week with batchsize 4 on 4 RTX 2080 Ti. Specially, for LRW, the window length is set to 32 because of the short video clips.
5 Experiments Results
5.1 Evaluation of Visual Quality
We show the comparisons with the state-of-the-art methods in Figure 5, including Vougioukas et al. vougioukas2019realistic, Chen et al. chen2019hierarchical, Prajwal et al. prajwal2020lip and Zhou et al. zhou2020makelttalk. The samples are generated with the same reference image and audio. Instead of only synthesizing a fixed head or cropped face, our method creates more realistic videos with head motions and full background. Compared to Zhou et al. zhou2020makelttalk, our results produce more plausible head movements and a more stable background with fewer artifacts. Besides, our method holds the identity of the speaker well even after a large pose change, owing to the motion field representation. The last row of Figure 5 shows the video-driven result of Siarohin et al. siarohin2019first. Although our results are generated from audio instead of video, we show comparable results in visual quality. We present more results in Figure 6 for unseen identities, including non-realistic paintings and human-like statues. The results show the excellent generalization ability of our method.
The quality of generated videos is evaluated using common reconstruction metrics SSIM, PSNR and Frechet Inception Distance (FID). We compare our approach with recent state-of-the-art methods including Zhou et al. zhou2020makelttalk, and Prajwal et al. prajwal2020lip, which also synthesize images with background rather than cropped face only. The quantitative results are shown in Table 1. Comparing with Zhou et al. zhou2020makelttalk and Prajwal et al. prajwal2020lip, we achieve the highest PSNR/SSIM and the lowest FID score on GRID and VoxCeleb. Since videos of LRW are very short (about 1.16s), most of their head pose and background barely move. Prajwal et al. prajwal2020lip only edits the mouth region of the reference image, resulting in the best PSNR and FID in LRW.
5.2 Evaluation of Lip-sync
We employ SyncNet  to evaluate the audio-visual synchronization of the proposed method. The metric values (confidence/offset) of each method are listed on the right side of Figure 5. We obtain competitive lip-sync accuracy comparing with the state-of-the-art methods, even though we address a more challenging task.
5.3 Evaluation of Head Motion
Figures 6 shows that our head motion predictor creates natural and rhythmic head motions depending on both the audio and the identity. We further compare the head motion predictor with Zhou et al. zhou2020makelttalk (MakeItTalk) with the same audio and two reference images. The head movements of MakeItTalk are detected from the generated videos. For better visualization, we reduce the six dimensional head motions into one dimension by PCA, and show the sequential results in Figure 7. MakeItTalk hardly changes the head orientation and contains many repetitive behaviors, as shown in the red boxes. Their head motion patterns tend to be slightly swinging around the initial pose. In contrast, our head motions preserve the rhythm and synchronization with audio, and are much closer to the ground truth, as shown in the blue boxes. Furthermore, we produce corresponding motion sequences with different input identities.
5.4 Ablation Study
We perform the quantitative evaluation of ablation study on VoxCeleb to illustrate the contribution of each component, the results are shown in Table 2 . We construct three variants including GT-keypoint (using GT keypoints), GT-head (using GT head movements) and our full model. Here, we also evaluate L1 distance between predicted and GT head movement vectors, marked as HE, L1 distance between generated and GT keypoints marked as KE. Although there exists numerical differences, the generated videos are still natural-looking.
We evaluate the effectiveness of Jacobians by removing from . The generated result with and without Jacobians are shown in Figure 8. Without Jacobians, the lip seems to have only open-and-close patterns. It illustrates that the local affine transformations benefit to the lip shape details.
To show the effectiveness of the SSIM loss and SET in , we conduct two variants by replacing the SSIM loss with L1 loss (with L1), or removing SET from our model (w/o SET). The results are compared with our full method (full) in Figure 9. The model trained with L1 loss creates unnatural head motion sequence, because it suffers from the one-to-many mapping ambiguity. Results of the model without SET contains less dynamics and is not well synchronized with audio, especially in the red and blue boxes.
We then compare the results with or without the second training stage in Sec.3.3. We visualize the difference between two consecutive frames in Figure 10. Results without the refinement of the second stage contain texture inconsistency and slight jitters. The constraint on pixel level is helpful for improving the temporal coherence and the fidelity of videos.
5.5 User Study
We further conduct an online user study to compare the integrated quality of our method with state-of-the-arts. We create 4 videos for each method with the same input, to obtain video clips. 33 participants are asked to rate ”does the video look natural?” for each video from 1 to 5. The statistical results are shown in Table 3. Our method outperforms all compared methods significantly with the of the cases that are judged as natural. It indicates that in addition to lip-sync, people are also quite sensitive to both frozen head pose and background artifacts. Besides, videos of only cropped faces [5, 23] are rated lower compared with others.
6 Conclusion and Discussion
In this paper, we propose a novel framework for one-shot talking-head generation from audio, which creates high fidelity videos with natural-looking and rhythmic head motions. We decouple the head motions from full-frame audio-dependent motions and predict the head motions individually in accordance with audio dynamics. Then, the motion field generator produces the keypoints that control the dense motion field from audio and head poses. Finally, an image rendering network synthesizes the videos using the dense motion field. Our method is evaluated qualitatively and quantitatively. The evaluation results show that our method predicts natural head motions, and produces few artifacts in non-face regions and between consecutive frames even though the head goes through a large pose change. Our method is proved to have a higher visual quality compared to the state-of-the-art.
|Chen et al. chen2019hierarchical||8%||31%||39%||17%||5%||22.0%|
|Vougioukas et al. vougioukas2019realistic||41%||30%||14%||12%||2%||14.4%|
|Prajwal et al. prajwal2020lip||5%||20%||40%||30%||6%||35.6%|
|Zhou et al. zhou2020makelttalk||8%||35%||30%||21%||7%||28.0%|
Although our method outperforms previous works, our lip-sync accuracy drops on the bilabial and labiodental phonemes such as p, f and m. Compared to methods that focus on lip-sync, our framework trades a slight drop of lip-sync accuracy for much better head motion and visual quality. Such a trade-off is proved to be favored by most participants in our user study. We will be devoted to increasing the lip-sync accuracy without decreasing the current visual quality in future works. Besides, our method cannot capture the blink pattern, and fails on input reference images with extreme pose or expressions, which also needs to be addressed in the future.
With the convenience of creating photo-realistic videos for arbitrary identity and audio, our method has widespread positive applications, such as video conferencing and movie-dubbing. On the other hand, it may be misused by immoralists. To ensure proper use, we will release our code and models to promote the progress in detecting fake videos. Besides, we strongly require that any result created using our code and models must be marked as synthetic.
-  (2018) Openface 2.0: facial behavior analysis toolkit. In IEEE International Conference on FG, pp. 59–66. Cited by: §3.2.
-  (2018) Recycle-gan: unsupervised video retargeting. In ECCV, pp. 119–135. Cited by: §2.2.
-  (2020) Neural head reenactment with latent pose descriptors. In CVPR, pp. 13786–13795. Cited by: §2.2.
-  (2020) Talking-head generation with rhythmic head motion. In ECCV, pp. 35–51. Cited by: §1, §1.
-  (2019) Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In , pp. 7832–7841. Cited by: §1, §5.5.
-  (2016) Lip reading in the wild. In ACCV, pp. 87–103. Cited by: §4.1, §5.2.
An audio-visual corpus for speech perception and automatic speech recognition. JASA 120 (5), pp. 2421–2424. Cited by: §4.1.
-  (2013) Modeling multimodal behaviors from speech prosody. In International Conference on Intelligent Virtual Agents, pp. 217–228. Cited by: §1, §2.3.
-  (2019) Text-based editing of talking-head video. ACM TOG 38 (4), pp. 1–14. Cited by: §2.1.
Predicting head pose from speech with a conditional variational autoencoder. Proc. Interspeech 2017, pp. 3991–3995. Cited by: §2.3.
-  (2020) Marionette: few-shot face reenactment preserving identity of unseen targets. In AAAI, Vol. 34, pp. 10893–10900. Cited by: §2.2.
-  (2018) Deep video portraits. ACM TOG 37 (4), pp. 1–14. Cited by: §2.2.
Write-a-speaker: text-based emotional and rhythmic talking-head generation.
Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §2.1.
-  (2017) Voxceleb: a large-scale speaker identification dataset. arXiv. Cited by: §4.1.
-  (2019) FSGAN: subject agnostic face swapping and reenactment. In ICCV, pp. 7184–7193. Cited by: §2.2.
-  (2020) A lip sync expert is all you need for speech to lip generation in the wild. In ACM MM, pp. 484–492. Cited by: §1.
Novel realizations of speech-driven head movements with generative adversarial networks. In ICASSP, pp. 6169–6173. Cited by: §2.3.
-  (2019) Animating arbitrary objects via deep motion transfer. In CVPR, pp. 2377–2386. Cited by: §3.3.
-  (2019) First order motion model for image animation. In NIPS, pp. 7137–7147. Cited by: §1, §3.1, §3.3, §3.3, §4.1, §4.2.
-  (2019) Talking face generation by conditional recurrent adversarial network. In IJCAI, pp. 919–925. Cited by: §1.
-  (2017) Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG) 36 (4), pp. 1–13. Cited by: §2.1.
-  (2020) Neural voice puppetry: audio-driven facial reenactment. In European Conference on Computer Vision, pp. 716–731. Cited by: §2.1.
-  (2019) Realistic speech-driven facial animation with gans. International Journal of Computer Vision, pp. 1–16. Cited by: §1, §5.5.
-  (2019) Few-shot video-to-video synthesis. Advances in Neural Information Processing Systems 32, pp. 5013–5024. Cited by: §2.2.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §3.2.
Face super-resolution guided by facial component heatmaps. In ECCV, pp. 217–233. Cited by: §1.
-  (2019) Semantic face hallucination: super-resolving very low-resolution face images with supplementary attributes. IEEE transactions on pattern analysis and machine intelligence 42 (11), pp. 2926–2943. Cited by: §1.
-  (2018) Imagining the unimaginable faces by deconvolutional networks. TIP 27 (6), pp. 2747–2761. Cited by: §1.
-  (2019) Can we see more? joint frontalization and hallucination of unaligned tiny faces. IEEE transactions on pattern analysis and machine intelligence 42 (9), pp. 2148–2164. Cited by: §1.
-  (2020) FReeNet: multi-identity face reenactment. In CVPR, pp. 5325–5334. Cited by: §2.2.
-  (2019) Talking face generation by adversarially disentangled audio-visual representation. In AAAI, Vol. 33, pp. 9299–9306. Cited by: §1.
-  (2020) MakeltTalk: speaker-aware talking-head animation. ACM TOG 39 (6), pp. 1–15. Cited by: §1.
-  (2020-07) Arbitrary talking face generation via attentional audio-visual coherence learning. In IJCAI, pp. 2362–2368. Cited by: §1.