Lip-sync studies [kumar2017obamanet, wang2011text] have predominantly focused on synthesizing accurate, realistic lip movements on a talking head. This type of technique is broadly applicable to many useful scenarios such as conversational agents, virtual anchors, gaming, and movie industry. Lip sync has been widely explored in computer graphics literature for years [wang2011text, fan2015photo, suwajanakorn2017synthesizing]
, and in recent years, few computer vision-based methods[kumar2017obamanet, cudeiro2019capture] have emerged as well. Most studies in this topic focus on synthesizing the mouth region from given audio signals and then merge it into a talking face template. So their results usually have a appealing appearance because they only “rewrite" mouths in the original face. However, such a strategy also leads to some visible mismatches between synthesized mouths and original faces. Traditional graphic-based methods overcome this issue by a series of cumbersome operations such as candidate frame selection, jaw correction, etc. The quality of their results usually depends on the number of candidates [suwajanakorn2017synthesizing]. For neural network-based methods, we have rarely seen an effective solution in limited cases.
There is another type of research [karras2017audio, tian2019audio2face], speech-driven facial animation, that can achieve similar results. The difference is that facial animation studies tend to directly generate full-face features without the use of the target face. Such methods can usually avoid the mismatching problem, but, are more inherently ambiguous because of the inconsistent relationship between audio and facial parts. A recent study [vougioukas2019end] attempts to solve this problem using a GAN-based end-to-end model. It is fully trainable and achieves promising results. However, even for GAN-based architectures, it is hard to directly synthesize high-resolution, flawless, and consistent video frames without enough auxiliary information.
Our goal is to synthesize a high-resolution, photorealistic, and human-like talking head from audio signals. We focus on capturing the specific mouth movements of the target person. We also value the processing speed so that it can be applied to time-mattered real applications. Considering the strengths and defects of existing solutions, we decide to implement the lip-sync strategy using an advanced neural network-based framework to achieve both quality and speed. In this paper, we introduce a novel lip-sync solution for producing photorealistic talking characters from speech signals, as shown in Figure 1. The major contributions can be summarized as follows:
We innovatively model the seq-to-seq mapping from audio signals to mouth features by a pair of adversarial Temporal Convolutional Networks (TCN). Experiments show our model outperforms traditional RNN-based baselines in both accuracy and speed.
We propose an effective rendering method of producing photorealistic face appearance from synthetic facial maps. This image-to-image translation-based method not only avoids the cumbersome operations in traditional graphics-based methods but also solves the existing issues in recent neural network-based solutions.
We conduct a comprehensive experiment to evaluate each stage of our lip-sync method. We quantitatively compare the audio-to-mouth model with baselines and qualitatively examine the contribution of the rendering framework. Our experiments provide valuable experience for related studies.
2 Related Work
2.1 Lip Sync
Lip-sync studies [bregler1997video, busso2007rigid] focus on generating realistic human-speaking videos with accurate lip movements, based on the given speech content and a target video clip. This topic has been widely explored for decades in computer graphics literature. The main solution is to learn a mapping from audio features to visual features and then render them into a photorealistic texture [suwajanakorn2017synthesizing, taylor2017deep]. Earlier methods [wang2011text, xie2007coupled]
use probabilistic graphical models, represented by Hidden Markov Models, to capture the correspondences between mouth movement and audio. As for rendering, the graphics-based way is to select the best matching frames from a set of candidates[wang2011text, suwajanakorn2017synthesizing]
. Recently, deep learning practitioners have successfully applied neural models in both audio-to-mouth and rendering stages. A litany of studies[fan2015photo, tian2019audio2face]kumar2017obamanet] shows the power of image-to-image translation models in synthesizing the photorealistic facial appearance. Compared to graphics-based rendering methods, deep learning technologies show the advantage in speed but still have visible flaws in current solutions.
2.2 Facial Animation
Audio-driven facial animation [cudeiro2019capture, chung2017you, karras2017audio] investigates the audio-visual relationship at the full face level. Not limited to the mouth movement, they also consider the relationship between audio signals to facial expression, head pose, etc. Therefore, most facial animation studies do not need a face template to fill with generated elements. Such a difference contributes to more overall consistency in the synthesized result, but the complicated correspondence between audio to facial organs brings more challenges as well. Facial animation techniques are mostly used in the gaming and animation industry [tian2019audio2face]. Recently, the fascinating performance of Generative Adversarial Network (GAN) inspires a series of approaches [vougioukas2019end, zakharov2019few] to directly synthesize photoreal talking head from audio. However, their outputs are still in low resolution and exist visible artifacts.
2.3 Temporal Convolutional Networks
Recurrent neural networks (RNN) are once the common choice in modeling sequence problems. However, some major issues still limit RNN-based models, including vanishing gradients and memory-bandwidth limitation [martens2011learning]. Although some successful derivatives [gers1999learning] relief the problems, they still hardly achieve satisfactory performance on long sequences. Recently, hierarchical models show more power in learning sequential correspondence [culurciello2018fall]. A representative work is Temporal Convolutional Network (TCN) [bai2018empirical]
which distills best prior practices in convolutional network design into a convenient but powerful convolutional sequential model, such as dilated convolutions, causal convolutions, and residual connections. A comprehensive experiment[bai2018empirical] demonstrates that TCN outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory.
3 Proposed Method
We describe our lip-sync framework from three main components: a feature processing module for extracting audio and video features, a TCN-based adversarial model learning the seq-to-seq mapping from audio features to visual features, and a generative rendering module based on image-to-image translation and fine-labeled face maps.
3.1 Feature Processing
3.1.1 Video Feature
We use dlib facial landmark detector to extract 68 facial keypoints from videos. We improve the scheme from [kumar2017obamanet]
to obtain 20 mouth features by 1) removing in-plane rotation by calculating mouth tilt angles. 2) calculating the relative coordinates of all keypoints based on the center of the nose. 3) eliminate the impact of face size by dividing vectors by the L2-norm of all 68 keypoints. 4) applying PCA to de-correlate the original 20 keypoints into 10-D feature vectors. It is worth noting that the first 10 principal components cover nearly 99% variability in original mouth keypoints. The sampling rate of video features is 25 fps.
3.1.2 Audio Feature
Following previous studies [tian2019audio2face, cudeiro2019capture], we process the audio into a high-level handcrafted acoustic feature, Mel-frequency cepstral coefficients (MFCC). As introduced in [vasquez2019melnet], such time-frequency representations are better aligned with human auditory perception. We compare two types of audio features (MFCC and FBank) in later experiments. The sampling rate of 13-D MFCC features is 100 Hz.
3.2 Audio to Mouth
We first model the seq-to-seq mapping from time-series audio features to sequential mouth features using a pair of Temporal Convolutional Networks (TCN). Bai etal [bai2018empirical] has proved that TCN architectures outperform generic RNN-based models on a variety of sequence modeling tasks. TCN has the strength of the large sequential perceptive field, stable gradients, and low memory requirements. To bring such strengths into the lip-sync scenario, we tailor a TCN-based adversarial model in the following content, as shown in Figure 2.
3.2.1 Dilated Non-causal Convolutions
Dilated convolutions [oord2016wavenet] enable TCN architectures to have an exponentially large reception field on the long sequence. In the common TCN setting, causal constraint means convolution kernels cover only the past inputs. However, many studies [suwajanakorn2017synthesizing, fan2015photo] record that lip movements depend not only on the past but also on future speech signals. So the models such as time-delayed LSTM [suwajanakorn2017synthesizing] and bidirectional LSTM [tian2019audio2face] are the preferred choices. In our TCN setting, we employ the dilated non-causal convolutions to consider both future and past signals while keeping a large receptive field.
3.2.2 Adversarial TCN
We build a pair of TCN-based models to form the adversarial framework. One is a TCN-based generator that learns the mapping from audio features to mouth keypoints. Due to the audio and video features have different fps, the mapping is between two sequences of different lengths. Different from Kumar [kumar2017obamanet] that generates mouth features following the audio rate and then downsampling, we wrap the TCN block with 1-D convolutions to downsample within the model. Our TCN-based generator accepts a 200 length audio sequence and outputs a 50 length sequence of mouth features (25 fps).
We also employ a TCN-based discriminator to support the training of the mapping network. Similar to the generator, the discriminator has the same dilated non-causal structure, but differently, it takes the combination of audio and mouth sequences as input and outputs a real or fake label. Both the generator and discriminator has the TCN block with the kernel size 3 and the dilation factor [1,2,4,8], as shown in Figure 2.
3.2.3 Loss Function
Our loss function consists of anregression loss, a pairwise inter-frame loss, and an adversarial loss. We define that the model G as the mapping of audio-to-mouth pairs, which can be denoted as , where and . Therefore, we use to represent the predicted mouth features.
The regression loss measures the mapping accuracy on individual features:
The pairwise inter-frame loss computes the distance between the differences of consecutive frames between predicted feature sequences and target sequences, which is used to increase the temporal stability [cudeiro2019capture]:
Moreover, the adversarial loss is used with the TCN-based discriminator to capture high-level discrepancy between generated and target features, which can be defined as:
Our final objective () is defined as :
3.2.4 Overlapping between Consecutive Sequences
TCN uses paddings to keep the same length between input and output sequences. However, padding signals will impact the predictions in both the head and tail of the output sequence. To avoid this, we create overlaps on both input and output sequences. That means we only use the optimal segment to build the final output. The overlapping range is related to the TCN’s receptive field, and we consider it as a hyperparameter and determined by later experiments.
3.3 Face Rendering
We combine the learning capabilities from two types of models to produce high-quality lip-sync results. One is the aforementioned TCN-based model that produces accurate mouth shapes. For the other one, we employ the image-to-image translation technique [isola2016image] to “translate" mouth shapes into high-resolution photoreal facial appearance.
Inspired by Kumar [kumar2017obamanet], we propose a generative face rendering approach based on the hierarchical image-to-image translation model [wang2018high]. Our fully-trainable solution avoids some cumbersome steps, such as candidate frame selection, in traditional graphics-based methods [wang2010synthesizing, suwajanakorn2017synthesizing]. We also solve some existing issues in the recent neural network-based study [kumar2017obamanet] and will discuss in later experiments.
3.3.1 Jaw Correction
Jaw correction is necessary for past graphic-based lip-sync studies [mcallister1997lip, suwajanakorn2017synthesizing]. For instance, Suwajanakorn [suwajanakorn2017synthesizing] prepare the seamless jawline using optical flow information. In our image-to-image translation setting, the jaw is also important because the encoder-decoder will integrate jaw and mouth into high-level features and then interpret them globally. Figure 3 illustrate the rendering results from the same mouth shape with different jawlines. This result confirms that simply merging the generated mouth into a talking face template will induce mismatches. Considering the nearly rigid relationship between the jawline and mouth shape [jiang2002relationship]
, we use multi-linear regression to model the deviation of the jawline affected by the mouth shape (w,h), as shown in Figure 3. We use a non-talking target video and dynamically tune the jawline with the generated mouth shapes.
3.3.2 Synthetic Facial Maps
The synthetic facial map consists of the generated mouth, tuned jawline, and other original facial elements in the target frame. The pix2pixHD [wang2018high] takes it as input and then generates the photoreal result. When making the training image pairs for pix2pixHD, We combine the Canny edge, facial label, and the dlib facial landmark into one facial map. The edges are used to provide more auxiliary information for pix2pixHD, and more importantly, improve the details and inter-frame consistency for some non-generated parts such as hair, earrings, and clothes. This is a simple and effective way compared with previous solutions [wang2018video, chan2019everybody] that apply the optical flow or a specific temporal-consistent loss to improve consistency.
We provide an empirical evaluation of our method and recent lip-sync studies. We first examine the contribution of our TCN-based audio-to-mouth model and then compare our face rendering method with relevant solutions.
4.1 Dataset and Environment
For training, we captured 3 hours of video from a female actor and a male actor respectively. Each actor was required to read shot scripts (5-15 min) from a teleprompter in the green screen environment. We also asked actors to read at different speeds and avoid head rotation. The resolution of recorded videos is 1920 1080 and the head area is 512
512. We shoot the videos under the stable lighting condition. We extract video features and audio features from the videos at the 25 fps and 100 fps respectively. All models are implemented in Keras 2.2.4 and Tensorflow 1.14.0. We adapted the TCN codes from the work[Philippe2018]. We processed the raw videos on Inter Xeon silver 4110 and trained neural models on 4 NVIDIA Titan V.
|Model||Female actor||Male actor|
|Adversarial TCN (our)||0.00141||0.0261||0.00132||7||0.00413||0.0478||0.00414||8|
4.2 Audio-to-Mouth Evaluation
In the audio-to-shape stage, we quantitively measure the performance of our model and baselines using the Mean Squared Error (MSE) and Mean Absolute Error (MAE). To measure the framewise velocity of generated mouth sequences, we also evaluate the Inter-Frame MSE as:
To avoid interference caused by other steps, these three metrics are measured on the PCA features predicted by seq-to-seq models. Moreover, we provide the Perceptual Score (PS), a subjective score ranged from 1 to 10. Randomly selected volunteers are asked to compare the generated mouth skeletons with ground truth and give higher scores for better similarity.
We compare the proposed model with two RNN-based baselines and two TCN variants. The baselines are the representative deep learning implementations from recent lip-sync studies. The models being tested are as follows:
Time-delayed LSTM is a typical RNN-based model used to learn the audio-to-mouth mapping. Suwajanakorn [suwajanakorn2017synthesizing] claim that the time delay mechanism is sufficient for learning future information. We adapt code of the model from [kumar2017obamanet].
Bidirectional LSTM is popular in speech recognition and facial animation studies [graves2005framewise, tian2019audio2face]. The bidirectional architecture computes both forward state sequence and backward state sequence.
Causal TCN is the common TCN architecture in which the output depends only on to and no any future inputs involved. This baseline is used to verify the importance of future features in our audio-to-mouth task.
Non-Causal TCN covers both past and future audio signals with non-causal convolutions. It is the main component of our audio-to-mouth model. Here we train the above two TCN models only with + loss.
4.2.3 Model Performance
The results of the model performance comparison are shown in Table 1. We observe that TCN-based models significantly outperform RNN-based baselines on both two datasets. The performances of time-delayed LSTM and bidirectional LSTM are almost at the same level. The bidirectional structure only brings a slight improvement. Compared to causal TCN, non-causal architecture effectively reduces the MSE, MAE, and int-MSE. This result confirms the previous statement that both past and future information is important to audio-to-mouth mapping. We also notice that the adversarial TCN has taken another step based on non-causal TCN. Augmenting the TCN architecture with adversarial loss successfully reduces not only two regression metrics but also the inter-frame MSE, which means the TCN discriminator successfully captures the high-level discrepancy and facilitate the training process. The perceptual scores (PS) almost show the same result.
We also find the results are greatly affected by different data sets. In our case, videos captured from the male actor have a higher MSE than the female actor, same as MAE and Int-MSE. Interestingly, our volunteers voted more for the male. After carefully examining the training videos, we believe the difference comes from the different personal speaking styles. The male actor has less lip movement when speaking, which might confuse the mapping model but please our observers. On the contrary, the obvious lip movements of the female actor reduce ambiguity but add more difficulty to synthesize realistic results.
Moreover, we compare the training time and processing time of the models including LSTM, Bidirectional LSTM, and Non-causal TCN. We train the models using the same batch size and record the per batch training time. We also process the one-minute audio using these models and record the processing time. As shown in Table 2, TCN outperforms RNN-based models in both training and processing time, especially in processing. This result shows that the TCN architecture significantly reduces the processing time in our task, which may benefit many real-time applications.
|Models||Training time (s)||Processing time (s)|
|LSTM||0.069 0.005||2.272 0.269|
|Bi-LSTM||0.124 0.007||3.376 0.201|
|TCN||0.068 0.005||0.011 0.005|
4.2.4 Ablation Study
Next, we delve into the TCN architecture using the female dataset. In this part, we utilize Non-causal TCN with + loss to avoid the instability and more training time caused by adversarial loss. Kernel size and dilation factor are the key hyper-parameters of TCN. They determine the size of the receptive field which significantly affects the mapping accuracy in lip-sync studies. Here we first compare MSE between different combinations of kernel size and dilation.
Table 3 shows MSE for three dilation factors with varying kernel sizes. The number below represents how many steps before and after the current node are in the receptive field. We observe that the models with covering steps between 15 to 31 show better performance, and more or less coverage will increase MSE. These steps of audio features nearly cover 150-310 ms audio signals. This result confirms a previous experiment [suwajanakorn2017synthesizing] which recorded that using 200ms future audio is the best in a time-delayed LSTM. Considering the model complexity and training time, we opt for the simple model with the dilation of [1,2,4,8] and the kernel size of 3.
|[1,2,4]||0.00192 (7)||0.00157 (11)||0.00153 (15)|
|[1,2,4,8]||0.00155 (15)||0.00154 (23)||0.00156 (31)|
|[1,2,4,8,16]||0.00166 (31)||0.00163 (47)||0.00164 (63)|
Based on the fixed kernel size and dilation, we continue to investigate the contribution of different numbers of TCN filters. As shown in Table 4, we observe that more filters bring a lower MSE. We finally use 256 filters in our TCN block to balance the mapping accuracy and model complexity. As for audio features, we find MFCC and FBank features have almost the same performance. We opt for MFCC features which have fewer default dimensions (13-D) than FBank (26-D).
4.2.5 Overlapping Range
To support our overlapping strategy, we investigate the change of mapping accuracy within the 50 length output sequence. We convert predicted PCA features to normalized coordinates and then calculate MSE based on each node in the sequence.
As shown in Figure 4, the errors of the first four frames and the last eight frames are significantly larger than others, which indicates that these frames are influenced by the padding signals. Figure 4 also indicates that the mouth movements are more dependent on future audio signals. This finding contradicts some previous studies [kumar2017obamanet, suwajanakorn2017synthesizing] that use shorter future audio signals than the past in LSTMs. But we have not verified it due to the symmetrical structure of non-causal TCN. According to this result, we overlap the output sequence by 10 frames on both head and tail, and the corresponding input sequence is overlapped by 40 frames.
4.3 Shape-to-Face Evaluation
We fairly compare our rendering module with a relevant approach [kumar2017obamanet] which also employ the image-to-image translation models at the shape-to-mouth stage. We perceptually evaluate the results from both methods and further discuss their strengths and defects.
Figure 5 shows some defects from baseline method. We experiment on both original Obama dataset and our self-made dataset. The left Obama photos are produced as [kumar2017obamanet] which uses pix2pix to generate frames from the mouth edges. We observe an obtrusive patch area in their mouth region. For the right results, we replace the pix2pix with advanced pix2pixHD model but still keep the mouth-mask photos as input. We find the patch issue has been alleviated but defects still exist, such as multi-layered teeth and noise pixels. We review source data and conclude that the patch problem is caused by the inconsistent lighting condition in Obama videos, therefore, the result using our well-prepared dataset reduces is not affected. But it is difficult to locate other problems caused by the mouth-mask method, and even pix2pixHD can not make the remedy.
Figure 6 compares our synthesized 512 512 results to the original video frames which correspond to the input audio. We also provide intermediate facial maps as a reference. Our results show good visual compatibility and embouchure consistency. We observe that the final results accurately capture the mouth movements in the original video footage while representing nature realistic facial expressions. However, for some “big" embouchures, the synthesized mouths show less sensitivity. The lower teeth in generated frames seem to be blurrier than upper teeth (frame 2), and the gap between upper and lower teeth has not been fully recovered (frame 6).
Figure 7 illustrates the contribution of our synthetic facial maps. The edges provide more supplementary information for generating high-quality and consistent details, such as hair silk, eardrops, and clothes. Meanwhile, the face labels bring more accurate and sharper facial texture compared to the edge-based baseline method in Figure 5.
Our method obtains high-quality results by combining the learning ability from both the audio-to-mouth model and the face rendering model. We could observe that the mouth shapes in Figure 6 only provide the basic outlines and there is no clear distinction between each other. But when going through the image-to-image translation model, the mouth shapes turn into highly distinguishable mouth movements. It should be noticed that we also applaud the performance of traditional graphics-based rendering methods. They usually keep more original details and better handle the matching problems such as jaw correlation. Moreover, we appreciate some successful attempts that use GAN-based models to build an end-to-end solution. However, we believe that the resolution and consistency of generated videos are still the main obstacles for the methods that aim to entirely synthesize facial details from audio signals.
This paper describes a novel lip-sync approach for synthesizing high-resolution and photoreal talking head from speech signals. We show that the adversarial temporal convolutional networks are an effective solution for modeling the target seq-to-seq mapping. We address the face rendering problem by proposing an image-to-image translation-based method that solves many issues in existing studies. We believe our work could inspire more valuable applications and benefit the communities.