Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)
To the best of our knowledge, we first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps. Our system contains three stages. The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person's speech space. In the second stage, we learn facial dynamics and motions from the projected audio features. The predicted motions include head poses and upper body motions, where the former is generated by an autoregressive probabilistic model which models the head pose distribution of the target person. Upper body motions are deduced from head poses. In the final stage, we generate conditional feature maps from previous predictions and send them with a candidate image set to an image-to-image translation network to synthesize photorealistic renderings. Our method generalizes well to wild audio and successfully synthesizes high-fidelity personalized facial details, e.g., wrinkles, teeth. Our method also allows explicit control of head poses. Extensive qualitative and quantitative evaluations, along with user studies, demonstrate the superiority of our method over state-of-the-art techniques.READ FULL TEXT VIEW PDF
We propose an audio-driven talking-head method to generate photo-realist...
In this paper, we propose a talking face generation method that takes an...
Lip sync has emerged as a promising technique to generate mouth movement...
In this paper, we propose a novel text-based talking-head video generati...
With the popularization of game and VR/AR devices, there is a growing ne...
Generating high-fidelity talking head video by fitting with the input au...
Non verbal behaviours such as gestures, facial expressions, body posture...
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)
Talking-head animation, i.e., synthesizing audio-synchronized video frames of a target person, is valuable to interactive applications like digital avatars, video conferencing, visual effects, virtual reality, visual dubbing and computer games. With recent advances in deep learning, people have made great progress in this long-standing problem. However, achieving a realistic and expressive talking-head animation remains an open challenge. Humans are extremely sensitive to any facial artifacts, leading to high requirements for the desired techniques.
Several factors contribute to the challenge. Firstly, attempts to generate lip-synchronized and personalized facial dynamics face a two-fold difficulty, due in part to the challenge of mapping from 1-D audio signals to facial movements which lie on high-dimensional manifolds, but also due to the domain difference between wild audio and target speech space, which makes the system fail to preserve individual talking idiosyncrasies. Secondly, head and body motion, another critical component of lifelike animation, is not closely correlated to audio. For example, one can swing his head or be still when he says the same words, which depends on many factors - his mood, location, or history poses. Thirdly, synthesizing controllable photorealistic renderings of the target is non-trivial. Nowadays, traditional rendering engines are still far from desired, whose results can be recognized as fake at a glance. Neural renderer shows great power on photorealistic rendering but suffers from performance degradation if the predicted motion is far outside the span of the training corpus (Kim et al., 2018). Last but not least, many interactive scenarios like video conferencing and digital avatars require the entire system to run in real-time, which makes high demands of the system efficiency without damaging the performance.
In this paper, we propose a deep learning architecture, called Live Speech Portraits (LSP), to address these challenges and step further to practical applications. Our system generates personalized talking-head animation stream, including facial expressions and motion dynamics (head pose and upper body motion) driven by audio and allows for photorealistic rendering in real-time.
First of all, we adopt the idea of self-supervised representation learning, which has shown great power in learning semantic or structural representations and benefits various downstream tasks (He et al., 2020; Chen et al., 2020b; Oord et al., 2018), to extract speaker-independent audio features. To achieve realistic and personalized animation on wild audio streams, we further project the wild features to the target feature space and reconstruct them using target features. This process can be seen as domain adaption from source to target. Subsequently, we are able to learn the mapping from reconstructed audio features to facial dynamics.
Another critical component that contributes to realistic talking-head animation is head and body motion. To generate personalized and time-coherent head poses from audio, we make assumptions that the current head pose is partly correlated to audio information and also partly
to history poses. We propose a novel autoregressive probabilistic model to learn the head pose distribution of the target person based on these two conditions. Head poses are sampled from the estimated distribution, and upper body motions are futher deduced from the sampled head poses.
To synthesize photorealistic renderings, we employ an image-to-image translation network conditioned on a feature map and candidate images. We apply the sampled rigid head poses on facial dynamics and project the transformed facial keypoints and upper body positions to the image plane, generating landmark images as our intermediate representations. Although our system comprises several modules, it is still compact enough to run in real-time at over 30 fps. In summary, we present the following contributions:
To the best of our knowledge, we propose Live Speech Portraits (LSP) as the first audio-driven talking-head animation system with photorealistic renderings in real-time. A comprehensive evaluation demonstrates that our approach outperforms prior methods both qualitatively and quantitatively.
A novel audio feature extraction module that generalizes our system to wild audio signals. The key component of this module is a manifold projection that reconstructs the deep speech representations using target speech features.
An elaborately designed probabilistic autoregressive architecture that predicts personalized head pose distributions conditioned on audio signals and history motions. Our system also allows for user-controllable head poses generation.
Mathematically, audio-driven facial animation aims to generate a sequence of talking-head frames from an input audio stream. In the following, we generally review prior work on audio-driven facial animation, as well as related techniques on speech representation learning, head pose estimation and facial reenactment.
Audio-driven Talking-head Animation. Audio-driven talking-head animation is a cross-modal research topic with a long history in the computer graphics community. Prior approaches take two different roads depending on whether they aim for generating photorealistic videos. In the non-photorealistic case, these methods focus on learning a mapping from input waveforms to facial movements, e.g., 3D vertex coordinates (Karras et al., 2017; Cudeiro et al., 2019), reference facial model parameters (Taylor et al., 2017) or rigging parameters (Zhou et al., 2018). These methods usually require high-quality 4D face capture data or rigging parameters with artist interventions. Here, we focus on the photorealistic case which our method belongs to. More than twenty years ago, people have made groundbreaking explorations in this field. Bregler et al. (1997) proposed Video Rewrite to create a new person-talking video using existing footage. Brand (1999) proposed Voice Puppetry to generate full facial animation from an audio track. These techniques can roughly be categorized into video-based editing methods and image-based generation methods. Video-based editing methods yield editing on a target video - usually synthesize a mouth-related region patch and blend it into the target frame while keeping other regions unaltered (Ezzat et al., 2002; Garrido et al., 2015; Thies et al., 2020). Recently, Thies et al. (2020) proposed Neural Voice Puppetry as an upgrade to Voice Puppetry
. They first learned a generalized 3D face model from audio sequences and then fine-tuned the model on the target clip via learning a person-specific blendshape basis, in which case the talking style of the target portrait can be preserved. Lower faces were synthesized via a neural rendering network finally. However, these approaches have several intrinsic limitations. First, animation length is restricted to target video length. A heuristic post-processing to select proper candidate frames is required for generating longer videos. Second, head poses and upper body motions are uncontrollable since the are directly copied from target videos, which may conflict with the audio track and introduce barriers to real-time applications. Notably, Suwajanakorn et al.(2017) employed a re-timing schedule to select target frames with natural and synchronized head motions. Last but not least, these methods rely on successful face tracking and tend to fail when faces are partly unseen or undetected, e.g., lower faces are obscured by hands or in a very dark environment. Skipping these bad frames leads to a temporally inconsistent result. In contrast, our method synthesizes portraits directly. Obstructed frames could be dropped before training without affecting performance.
Image-based generation methods generate a talking-head video based on one or several cropped reference images. This kind of method avoids the pre-mentioned drawbacks but makes the task more challenging for the requirements of manipulating the entire image, including facial details, motion dynamics, and the background. End-to-end training (Chung et al., 2017; Wiles et al., 2018; Zhou et al., 2021) is becoming a strong trend to generate videos with the rising of deep learning. Chuang et al. (2017) generated a talking face video from a still image and an audio sequence by utilizing a CNN model for the first time. Later, GANs are frequently adopted to generate high-fidelity facial images via adversarial learning (Vougioukas et al., 2018, 2019; Zhou et al., 2019). Instead of directly synthesizing a talking face image, Chen et al. (2019) and Zhou et al. (2020) leveraged sparse facial landmarks as an intermediate representation. The landmark dynamics were first deduced from the audio input through an audio-to-landmark module and then worked as a condition of an image-to-image translation network to generate animated videos. One common problem shared by these methods is that they tend to learn average facial dynamics among the training corpus without person-specific talking styles. Note that Zhou et al. (2020) learned speaker-aware
dynamics from speaker embedding vectors, but still fail to learntarget-aware dynamics which may generate uncanny results. Our method focuses on capturing the person-specific talking dynamics using only a short target video (around 3 minutes). We utilize facial landmarks as an intermediate representation and generate controllable head poses and upper body motions, which makes the animated videos more impressive and realistic.
Speech Representation Learning. Speech signals contain rich high-level information, including content, timbre, and prosody. Much prior work require accurate phoneme labels within millisecond timestamps as input. These labels are often assembled into a sequence of diphones or triphones to encode neighborhood information (Fan et al., 2015). However, converting waveforms to phonemes leads to information compression, along with potential performance reduction introduced by error-prone automatic phoneme labeling tools. People also discovered different schemes to get rid of the dependence of phonemes using hand-crafted features (Suwajanakorn et al., 2017). Recently, modeling these semantic and structural representations through deep neural networks has shown great success and outperforms the traditional hand-crafted features (Devlin et al., 2018; Peters et al., 2018). Thies et al. (2020) employed a DeepSpeech (Hannun et al., 2014) network to extract speech features. Zhou et al. (2020) resorted to voice conversion community (Qian et al., 2019)
to disentangle speech content and identity information. Similarly, Our system uses a self-supervised learning method(Chung and Glass, 2020) to extract the high-level speech information. Moreover, manifold projection is applied to improve generalization.
Head Pose Estimation from Audio. Head pose, as a significant component of realistic animation, delivers rich information in talking-head videos. Greenwood et al. (2018) employed a Bi-directional LSTM model to predict character head animation from audio. Zhou et al. (2020) predicted speaker-aware head motion dynamics as 3D facial landmark displacements. They trained a transformer architecture (Vaswani et al., 2017) in an adversarial mechanism to capture long time dependencies and generate natural head dynamics. Recently, Chen et al. (2020a) proposed a 3D-aware generative network to learn target-aware head motion from a 3-second video clip. Different from most previous work which use deterministic models, we use an auto-regressive probabilistic model conditioned on history head poses and speech representations to predict the distribution parameters at the current timestamp. Head poses are sampled from the predicted probabilistic model. Besides, we further deduce upper body motions from the predicted head pose, which shows a great improvement on the animation quality.
Video-based Facial Reenactment. Video-based facial reenactment is another technique related to audio-driven animation. Thies et al. (2015) proposed the first real-time model-based reenactment system using two RGBD cameras. Face2Face (Thies et al., 2016) extended the boundary using only RGB cameras. Moreover, Liu et al. (2015) combined both audio and visual information as input and tackles the problem that tracking results prone to fail when face is occluded or head pose is extreme. Fried et al. (2019) proposed a method for text-based talking-head editing while is slow for viseme search (5 mins for three words). Yao et al. (2021) reduced the video generation time to 40 seconds for one video. Recently, GANs achieved great success in controllable high-fidelity face synthesis (Karras et al., 2019; Wang et al., 2018b, a). Few shot or even one-shot facial animation methods are explored via landmarks pre-defined or learned in an unsupervised scheme (Zakharov et al., 2019; Siarohin et al., 2019; Sun et al., 2020). Most methods rely on an image-to-image mechanism with semantic images as input. Kim et al. (2018) generated portrait videos including head, mouth and gaze from an input reference video. Kim et al. (2019) trained a recurrent GAN to synthesize style-preserving visual dubbing. Very recently, Elgharib at al. (2020) transferred the egocentric view videos to front facing videos using a position conditional GAN. Different from previous methods, our approach generates photorealistic talking head animation from speech only and runs in real-time.
Overview. Given an arbitrary speech stream, our live speech portraits approach generates photorealistic talking-head animation of the target person in real-time (Figure 2). Our approach consists of three stages: deep speech representation extraction, audio-to-face prediction, and photorealistic face rendering. The first stage extracts the speech representation of the input audio (Section 3.1). The representation extractor learns the high-level speech representation and is trained in a self-supervised manner on an unlabelled speech corpus. We then project the representations to the target person’s speech space to improve generalization. The second stage predicts the full motion dynamics. Two elaborate designed neural networks predict the mouth-related motion (Section 3.2) and head pose (Section 3.3
) from the speech representations, respectively. The mouth-related motions are represented as sparse 3D landmarks, and head poses are represented as rigid rotation and translation. Considering that head poses are less related to audio information than mouth-related motions, we employ a probabilistic autoregressive model to learn the poses conditioned on audio information and history poses. Other facial components which have nearly no correlation to audio (e.g., eyes, brows, noses, etc.) are sampled from the training set. We then compute the upper body motion from the predicted head pose. The final stage synthesizes the photorealistic video frames from the previous predictions and a candidate image set (Section3.4) using a conditional image-to-image translation network. In the following, we introduce each module in detail.
Input information, which are speech signals in our case, plays a crucial role because it powers the entire system. As illustrated in Section 2, people have exploited deep learning approaches, commonly trained in a self-supervised mechanism, to learn high-level speaker-independent representations of speech from surface features. These methods greatly improve state-of-the-art performance of downstream tasks, e.g., auto speech recognition, phone classification, and speaker verification (Oord et al., 2018; Chorowski et al., 2019; Liu et al., 2020).
Specifically, we use the autoregressive predictive coding (APC) model (Chung and Glass, 2020)
to extract structural speech representations. The APC model predicts future surface features given history information. In our case, we select 80-dimensional log Mel spectrograms as speech surface features. The model is a standard 3-layer unidirectional Gated Recurrent Units (GRUs):
where is the hidden states of each layer in GRUs. The hidden states in the final GRU layer are our desired deep speech representations. We add a linear layer to map the output to predict the future log Mel spectrograms during training, and the linear layer is dropped during testing time.
Different person owns diverse speaking styles, which are considered as personalized styles. For example, May clip exhibits large lip movements and always the ’O’ style, Ford clip exhibits small lip movements like whispers, and Nadella clip exhibits adhesion of upper & lower lips like lisps. Directly applying the deep speech representation may lead to poor results when input speech representations locate far away from the target’s speech feature space (e.g., animate a woman by men’s voice, foreign languages, or even songs). We perform manifold projection after extracting the speech representations to improve generalization.
The manifold projection operation is inspired by the recent success in face synthesis from sketches (Chen et al., 2020c), which can be generalized to sketches far away from human faces. We apply the locally linear embedding (LLE) assumption on the speech representation manifold: each data point and its neighbors are locally-linear on the high-dimensional manifold (Roweis and Saul, 2000).
Given an extracted speech representation , we compute the LLE reconstructed representation on each dimension. As illustrated in Figure 3, we first find the nearest points of h in a target speech representation database by computing the Euclidean distance. is the number of training frames. Then, we seek a linear combination of the nearest neighbours to best reconstruct h, which is equivalent to compute the barycentric coordinates of h based on its neighbors via solving the following minimization problem:
where is the barycentric weight of -nearest neighbor , which can be computed via solving a least-squares problem. is chosen as 10 empirically in our experiment. At last, we obtain the projected speech representation :
Subsequently, is sent to the motion predictors in Section 3.2 and Section 3.3 as input deep speech representations. Our experiment results show that the manifold projection helps improve our system generalization ability. Non-linear projection hasn’t been considered for the complexity.
Predicting mouth-related motion from audio has been widely researched in the past few years. People use deep learning architectures to learn a mapping from audio features to intermediate representations, e.g., lip-related landmarks (Greenwood et al., 2018; Zhou et al., 2020)
, parameters of a parametric model(Suwajanakorn et al., 2017; Taylor et al., 2017; Chen et al., 2019), 3D vertices (Karras et al., 2017; Cudeiro et al., 2019), or facial blendshapes (Thies et al., 2020; Zhou et al., 2018). In our case, we use 3D displacements with respect to mean positions of the target person in object coordinates as our intermediate representations.
To model sequence dependencies, we use Long Short Term Memory (LSTM) models to learn the mapping from speech representations to mouth-related motions. Similar to(Suwajanakorn et al., 2017), we add a frames delay to make the model accessible to a short future, leading to significant improvement on the quality. We later feed the output of the LSTM network to a Multi-Layer Perception (MLP) and finally predict the 3D displacements . In summary, our mouth-related prediction module works as follows:
where time delay is set to 18 frames, equal to 300 ms delay in our experiments (60 FPS). The LSTM is stacked into three layers, and each layer has a hidden state of size 256. The MLP decoder network has three layers with hidden state size of 256, 512, and 75.
Head pose and upper body motion are another two components that contribute to vivid talking-head animation. For example, people naturally swing their heads and move their bodies when talking, aiming to express emotion and deliver attitudes to the audience. We first describe the method to estimate head pose and then the upper body motion.
Head pose estimation from audio is non-trivial since little relationship exists between them. Considering the intrinsic difficulties, which is a one-to-many mapping from audio to head pose (one can say the same sentence in arbitrary poses), we make two assumptions as prior knowledge.
Assumption 1. Head poses are partly related to audio information, like expression and intonation. For example, people tend to nod heads when expressing agreement and look up when speaking in rising intonation and vice versa.
Assumption 2. Current head pose partly depends on history head poses. For example, there is a large probability that people will turn heads back if they have turned a large angle before.
Current head pose partly depends on history head poses. For example, there is a large probability that people will turn heads back if they have turned a large angle before.
These two assumptions simplify the problem and motivate our architecture design. To satisfy the requirements, the proposed network should have the ability to see the history head poses and current audio information as conditions. Besides, instead of considering it as a regression problem and train it using Euclidean distance loss (Zhou et al., 2020), we should model this mapping as a probabilistic distribution. Recently, probabilistic models are successfully used in motion synthesis (Henter et al., 2020) and outperform the deterministic models. The joint probability of the head motion can be described as follows:
where is the head motion and is the speech representation.
The probabilistic model we use is a multi-dimensional Gaussian distribution. The network architecture is inspired by recent success in conditional probabilistic generative modeling(Oord et al., 2016b, a). The detailed design of the probabilistic model is illustrated in Figure 4. The model is a stack of two residual blocks with seven layers each. Considering the long-time dependencies required to produce natural head motion (one swing his head from left to right may lasts for several seconds), these residual blocks use dilation convolutions layers to capture the dependencies instead of normal convolutions with much fewer parameters. The dilation is doubled seven times for each layer in our architecture and then repeated twice: 1, 2, 4, 8, 16, 32, 64, 1, 2, 4, 8, 16, 32, 64. As a result, the history receptive field size
of our model is 255 frames, equal to 4.25 seconds in our experiments. The output of each layer is summed up and processed by a post-processing network (a stack of 2 relu-conv layers) to generate the current distribution. In particular, the model outputs the mean values
and the standard deviationsof the estimated Gaussian. Then we sample in the distribution to get final rigid head pose , composed of 3D rotation and translation
. We also tried with a Gaussian Mixture Model but found no obvious improvement. After sampling, we encode the current pose as input pose information for the next time step, forming an autoregressive mechanism. In summary, the head pose estimation can be illustrated as follows:
Upper Body Motion. For upper body motion estimation, an ideal method is to build a body model and estimate the parameters (Mehta et al., 2020). To avoid making the algorithm too complex (upper body always lies only at the bottom part of the image), we assign the upper body part as a billboard (Cao et al., 2016) which is shaped by several shoulder landmarks manually defined. The initial depth of the billboard is set as the average depth of the landmarks in full training sequences and same for all. We translate the billboard model with the 50% translation part in the predicted head motion in most cases as the results.
The last stage of our approach is to generate the photorealistic facial renderings from previous predictions, as illustrated in Figure 2. Our rendering network is inspired by the recent advances in synthesizing photorealistic and controllable facial videos (Isola
et al., 2017; Thies
et al., 2019; Kim et al., 2018; Elgharib et al., 2020). We use a conditional image-to-image translation network as our backbone along with adversarial training. The network takes a channel-wise concatenation of a conditional feature map and candidate images of the target person to produce photorealistic renderings.
Conditional Feature Maps. To provide the clues of the face and the upper body, we draw a conditional feature map for each frame from the above predictions. An example of the conditional map is shown in Figure 5. The feature map consists of a facial part and an upper body part. Drawing semantic regions with colors, or even further one region, one channel brings richer information and more drawing time. We didn’t find obvious improvements on these two alternatives. Note that both the sparse facial landmarks and the upper body billboard we predict locate in object coordinates. Therefore, we need to project these 3D positions to the 2D image plane via pre-computed camera intrinsic parameters . The camera model we use is a pinhole camera model and , where is the focal length and is the principal point. The consecutive 2D projected components are line-connected in a pre-defined semantic sequence, resulting in the conditional feature map of size .
Candidate Image set. Besides the conditional feature map, we additionally input a candidate image set of the target person to provide detailed scene and texture clues. We found that adding such a candidate set helps the network generate consistent backgrounds considering the changing camera motions in the training set and relieves the pressure of the network to synthesize subtle details, like teeth and pores. These images are automatically selected. For the first two, we choose the 100th minimum/maximum mouth area. For the rest, we sample x- and y-axis rotation by uniform intervals and choose the nearest samples from intervals. Therefore, the size of the final concatenated input image becomes .
The network is a 8-layer UNet-like (Ronneberger et al., 2015; Esser et al., 2018; Han et al., 2019)convolutional neural network with skip connections in each resolution layer. The resolution of each layer are (256, 128, 64, 32, 16, 8, 4, 2
) and the corresponding numbers of feature channels are (64, 128, 256, 512, 512, 512, 512, 512). Each encoder layer consists of one convolution (stride 2) and one residual block. The symmetric decoder layer is almost the same, except the first convolution is replaced by a nearest up-sample operation with a scale factor of 2. Examples of our photorealistic renderings are shown in Figure6.
In this section, we describe the aspects relevant to the implementation of our approach: dataset acquisition and pre-processing (Section 4.1
), loss functions (Section4.2), training setup (Section 4.3), and real-time animation settings (Section 4.4).
We apply our approach to 8 different target sequences of 7 different subjects for training and testing. These sequences span a range of 3-5 minutes. All videos are extracted at 60 frames per second (FPS) and the synchronized audio waves are sampled at 16 Hz frequency. We first crop the video to keep the face at the center and then resize to . All input images and output images share the same resolution. We split videos as 80% / 20% for training and validation. Please refer to Appendix A for more details.
We detect the 73 pre-defined facial landmarks for all videos using an off-the-shelf tool. To provide the groundtruth of the 3D mouth shape and head pose, we employ an optimization-based 3D face tracking algorithm similar to (Shi et al., 2014; Thies et al., 2016). For camera calibration, we use a binary search to compute the focal length as demonstrated in (Cao et al., 2013). The principle point is set as the center of the image. Note that we do camera calibration and 3D face tracking on the original image and compute a transformation matrix according to the crop and resize parameters. The upper body motion feature points are manually selected once for first frame of each sequence and tracked for the rest frames using LK optical flow (Bouguet et al., 2001), and the OpenCV implementation (Bradski, 2000). For more details about monocular 3D face tracking, we recommend readers to refer to the summary paper (Zollhöfer et al., 2018).
To train the APC speech representation extractor, we use the Mandarin Chinese part of the Common Voice dataset (Ardila et al., 2020)
that provides unlabelled wild utterances. Specifically, The subset contains 889 different speakers with various accents. In total, there are about 26 hours of unlabelled utterances. We use 80-dimensional log Mel spectrograms as surface features. The log mel spectrograms are computed with 1/60 second frame length, 1/120 second frame-shift, and 512-point Short-Time Fourier Transform (STFT). Although our APC model is trained in Mandarin, we find that our system still works well in other languages because the model learns the high-level and semantic information. Also the manifold projection improves the generalization ability.
The training of the APC model is fully self-supervised via predicting the surface features frames ahead. Given a sequence of log mel spectrograms , the APC model processes each element at time step and outputs a prediction , generating a predicted sequence . We optimize the model by minimizing the L1 loss between the input sequence and the predicted as follows:
where following the setting in (Chung and Glass, 2020).
To learn the mapping from audio to mouth-related motion, we minimize the distance between the groundtruth mouth displacements and the predicted displacements. Specifically, the loss can be written as:
where represents the number of consecutive frames sent to the model at each iteration. is the number of the pre-defined sparse mouth-related 3D points in our experiments.
Apart from learning the mapping from audio to mouth-related motion, we also aim to estimate the target’s head pose during training. The upper body motion can be deduced from the head pose as mentioned in Section 3.3. Specifically, we employ an autoregressive probabilistic model to model the head pose distribution. We train the model by minimizing the negative log-likelihood of the pose distribution. Given a sequence of history head pose and speech representation , the probabilistic loss is:
where is the input head pose and speech representation at time . This loss term forces the model to output the mean values and standard deviations of the Gaussian distribution. To increase numerical stability, we output the negative log sigma instead of the sigma directly. Each element in pose sequence are composed of the current pose and a linear velocity term . Although we only use the first six dimensions of rotation and translation after sampling in the distribution, we find that adding such a velocity term could implicitly force the model to focus on the motion speed, leading to smoother results.
Finally, we train a neural renderer to synthesis photorealistic talking-head images. The training procedure follows the adversarial training mechanism (Goodfellow et al., 2014). We adopt the multi-scale PatchGAN architecture (Isola et al., 2017; Wang et al., 2018b) as the backbone of discriminator D. The image-to-image translation network G is trained to generate ”realistic” images to fool the discriminator D, while the discriminator D is trained to tell the generated images from groundtruth images. Specifically, we employ LSGAN loss (Mao et al., 2017) as the adversarial loss to optimize the discriminator D:
where is the discriminator classification output when input the groundtruth image and the generated rendering , respectively. We additionally use a color loss, a perceptual loss (Johnson et al., 2016) and a feature matching loss (Wang et al., 2018b):
where is the adversarial loss, is the color loss, is the perceptual loss and is the feature matching loss. The weights of each loss are set to in all our experiments empirically. The color loss is a per-pixel loss to minimize the difference from generated images and groundtruth images We tried higher weights (x10) on mouth, yep mouth-related errors drop, but full-image errors rise. Considering the full-image generation task, we choose equal weight.
For perceptual loss, we adopt a VGG19 network (Simonyan and Zisserman, 2014) to extract the perceptual features from and and minimize their distance:
where denotes the layers we use and denotes the -th layer. Finally, to improve the training speed and stability, we adopt a feature matching loss:
where is the number of spatial layers in discriminator D. The -based feature matching loss is designed to match the statistics of features extracted by the discriminator from and .
All our models are trained on PyTorch (Python)(Paszke et al., 2019) using Adam optimizer with hyper-parameters . The learning rate is set to and linearly decay to in all experiments. The APC model contains 4.064 parameters, the mouth-related position predictor contains 3.064 parameters, the head pose estimator contains 4.267 parameters and the renderer contains 76.204 parameters.
We train the first three models on a Nvidia 1080Ti GPU and takes about (11, 0.5, 5) hours in total (200, 200, 200) epochs with a batch size of 32, respectively. The photorealistic images renderer is trained on 4 Nvidia 1080Ti GPUs for an average of 22 hours in 60 epochs and a batch size of 8. During testing, we select all models with minimized validation loss.
We implement and test our real-time animation system using C++ on a desktop PC with an Intel Core i7-9700K CPU (32 GB RAM) and an NVIDIA GeForce RTX 2080 (8 GB RAM). The deep speech representation extraction module takes around 2.9 ms for inference (1.4 ms for APC model forward pass and 1.5 ms for manifold projection). Learning facial dynamics from audio representations via 3-layer LSTM and MLP networks takes around 2.5 ms. Besides, the Audio2Mouth module leads to around 300 ms latency for obtaining 18 frames of future audio information. We then use Tensorrt to accelerate the last two models. Specifically, the head poses estimation model takes 4.4 ms and the photorealistic renderer takes 20.1 ms after acceleration. Time of memory copying between CPU and GPU has already been included here. Therefore, the entire system takes about 27.4 ms for inference at over 30 FPS with 300 ms latency.
Discussion. Here we discuss the run time with related work Zhou et al. (2020) and Thies et al. (2020). We emphasize that our system is the first realization of an end-to-end live system for photorealistic audio-driven talking-head animation and step further to practical applications, considering that these papers haven’t shown an actual live demo. Besides, Zhou et al. (2020) is not designed for live streaming generation. The self-attention network (Vaswani et al., 2017) in speaker-aware animation works as a post weighted combination of previous landmarks predictions and is inappropriate for live applications like video conferencing, requiring a low latency. Thies et al. (2020) will meet more difficulties. Their method is restricted to the target video length, and therefore a sufficiently long target video without obstruction is required, which is hard to acquire. To generate longer frames, an additional heuristic schedule to select proper candidate frames is needed. Also, pose-audio inconsistency appears for the lacking control on head motion (Section 2). These factors result in obstacles in the live implementation.
Our live speech portraits method generates personalized and photorealistic talking-head animation from audio input in real-time. We recommend readers watch the supplementary video.
In the following, we present the results of our approach, evaluate the design of our method both qualitatively and quantitatively, compare to the state-of-the-art techniques, and show results of a user study. We further demonstrate the potentials on several applications, e.g., dubbing, video conferencing and virtual avatars.
Figure 6 shows a gallery of our results. Our approach allows animating target portraits driven by audio sequences while preserving the personal talking styles. Our method produces facial dynamics, natural head poses, and upper body motions and synthesizes temporally coherent renderings in high-fidelity, e.g., clear teeth textures. It also successfully works on subjects with long hair, glasses, and earrings. Figure 1 and the supplementary video show examples of our real-time animation system and results of different people driven by the same utterance. Results prove that we preserve their own talking styles in the training video.
Figure 7 demonstrates the pose controllability of our method. We synthesize results with different head poses, either generated by our system or sampled from the training set (Columns 1 and 3). We further evaluate our model with mirrors of the previous two poses and a self-defined pose (Columns 2, 4, and 5). Even though these challenging poses partly lay outside the training corpus and therefore may lead to artifacts, our system still generates correct lip motions and temporally smooth renderings.
We analyze the effectiveness of manifold projection in Figure 8. The operation is designed to project the original speech representations to the target speech space by minimizing the reconstruction loss. As shown in the figure, the reconstructed features (green) are much closer to the target speech space than the original wild features (blue). More quantitative comparison results of the manifold projection can be seen in our supplementary video. We found that using manifold projection generates more accurate lip synchronization than not using it, especially when encountering the audio of different genders or foreign languages.
Figure 9 depicts a qualitative evaluation of the generated head poses using t-SNE visualization. We select 8 different targets and denote them with different colors. In Figure 9 (left), we visualize the generated head poses of these targets from the same input audio. It can be seen that our predictions for one person lie in a nearby region and are far away from different targets. We further analyze the connections between generated poses and training poses in Figure 9 (right). Solid dot denotes training corpus poses and star
denote our predictions. We downsample the training poses using K-means to 30 to reduce data size. The visualization demonstrates that our predictions locate closely to training corpus poses.
|Val Loss (mm)||5.309||5.280||5.117||5.335|
|Val Loss (mm)||5.160||4.916||5.248||5.539|
We also qualitatively evaluate the design of the head pose estimation module from audio. Results are shown in Figure 10. We recommend readers to watch the supplementary video for better visual comparisons. To confirm our proposed two prior assumptions, here we perform an ablative study of the input, architecture, and loss design by training and testing four alternative variants: ”LSTM (L2)” (LSTM network trained using L2 loss), ”LSTM (Probabilistic)” (LSTM network trained using probabilistic loss), ”Ours (w/o Hist. Poses)” (Our architecture without history poses input), ”Ours (L2)” (Our architecture trained using L2 loss). A quantitative evaluation of head pose estimation can be found in Section 5.2.
The variants ”LSTM (L2)” and ”LSTM (Probabilistic)” generate more temporal jitters than other variants, which indicates that directly using LSTM architecture may not be a good choice in our task. A reasonable explanation is that RNN is easy to overfit on a small dataset, which is an around 3-minute length video in our case. On the other hand, LSTMs have a theoretically infinite receptive field of the history information depending on the forget and memory mechanism. Our architecture has a fixed receptive field which intuitively becomes more robust to long history information and will not easily overfit.
The variant ”Ours (w/o Hist. Poses)” also performs worse than our full model. It tends to generate unchanged poses in the Obama video and struggled poses in the woman video (see the supplementary video). Without modeling the history poses, the model is trained to learn a mapping from audio to the head poses solely. Therefore, the network generates pose with the most probability of the current audio clip, which may be far away from adjacent ones, leading to time-incoherent results.
The variant ”Ours (L2)” generates best results among all variants but there is still room for improvement in the term of time consistency and motion variation. The only difference is that this variant replaces the probabilistic loss with the L2 distance loss. It makes the problem a regression problem, which means the model needs to find the deterministic position given history poses and audio information. Therefore, the model struggles to find the best balance point between audio and history poses, resulting in time-inconsistency and minor motion variations. A probabilistic model is suitable to handle these ambiguities and finally achieves the best results both in time consistency and motion richness.
|Ours (w/o Hist. Poses)||3.9%||0.9%||4.2/8.9%|
|(Zhou et al., 2020)||4.6%||0.9%||6.1/10.1%|
Here, we perform a thorough quantitative evaluation of the proposed method. Table 1 evaluates the importance of time latency of input audio. Choosing a proper latency is important for a real-time system because it greatly affects the user experience. In this experiment, we train the audio-to-mouth network with different time delays on a 30-minutes subset of the Obama Weekly Address dataset (Suwajanakorn et al., 2017). The subset was split into a training dataset (80%) and a validation dataset (20%). Validation losses are computed as Euclidean distance between predicted mouth-related landmarks positions and the tracked groundtruth. As can be seen, the time delay of 300 ms (18 frames) gives minimized validation loss. Too short or too long delays both lead to performance reduction. When the network fails to have access to a certain range of future information (smaller than 300 ms), it cannot model the coarticulation. Too long delays (longer than 300 ms) always cover several phonemes and introduce redundant information. In most real-time scenarios, e.g., video conferencing, 300 ms is a tolerable latency. Therefore, we use this latency in all experiments.
Now we perform quantitative evaluation of our head pose estimation module. We compare our method with using four alternative variants pre-mentioned in Section 5.1 as well as Zhou et al. (2020). We created a test set composed of 6 different videos (average lasts for 45 seconds each clip) and the speeches and videos are unseen during training. Specifically, We evaluate these methods on the testing set by computing the metrics D-L, D-V and D-Rot/Pos defined in (Zhou et al., 2020). D-L and D-V denote the normalized Euclidean position and velocity difference between predictions and the groundtruth. D-Rot/Pos denotes the rotation angle differences and normalized translation distances. Lower is better for all these metrics.
Table 2 reports the evaluation results. Our model obviously outperforms other alternative variants, tallying with the subjective evaluation in Section 5.1. In particular, we observe that the LSTM-based variants produce higher error than other variants ((see ”LSTM (L2)” and ”LSTM (Probabilistic)”)). This results confirms that LSTM is prone to overfit in the training set. Replacing the probabilistic modeling with L2-regression generates slightly worse results (see ”Ours (L2)”). Our full method generates similar head motion with respect to the groundtruth, confirming that we learn the distribution of the target person. We also compare with Zhou et al. (2020) which learn speaker-aware head motions. Results demonstrate that our method outperforms their method.
|Obama1||w/o U.B.M.||6.131 1.841||25.364 2.259||0.850 0.043||0.704 0.407|
|w/o Cand.||6.683 2.507||24.964 2.483||0.845 0.048||0.746 0.422|
|w/ Both||5.713 1.671||26.006 2.275||0.862 0.038||0.698 0.348|
|Obama2||w/o U.B.M.||7.634 5.105||23.143 2.758||0.843 0.102||1.136 1.279|
|w/o Cand.||4.006 0.754||27.927 2.228||0.917 0.020||0.366 0.111|
|w/ Both||3.993 0.732||27.576 2.174||0.926 0.020||0.370 0.103|
|May||w/o U.B.M.||5.957 1.322||27.171 1.820||0.807 0.037||0.790 0.260|
|w/o Cand.||5.577 1.246||27.979 1.683||0.823 0.033||0.747 0.280|
|w/ Both||5.539 1.234||28.044 1.782||0.823 0.035||0.746 0.209|
|Obama1||Unet||6.540 1.743||25.087 2.046||0.832 0.035||0.875 0.395|
|Normal||5.782 1.687||25.920 2.229||0.861 0.038||0.727 0.346|
|Large||5.713 1.671||26.006 2.275||0.862 0.038||0.698 0.348|
|Obama2||Unet||4.155 0.794||27.414 2.182||0.907 0.022||0.479 0.124|
|Normal||4.010 0.763||27.406 2.211||0.912 0.021||0.376 0.106|
|Large||3.993 0.732||27.576 2.174||0.926 0.020||0.370 0.103|
|May||Unet||6.007 1.709||27.528 1.835||0.806 0.037||1.018 0.389|
|Normal||5.578 1.694||27.924 1.882||0.818 0.049||0.828 0.449|
|Large||5.539 1.234||28.044 1.782||0.823 0.035||0.746 0.209|
|Obama1||0.5 mins||9.403 3.068||22.481 2.166||0.788 0.042||1.421 0.533|
|1 min||8.259 2.552||23.939 2.427||0.812 0.046||1.233 0.523|
|3 mins||5.713 1.671||26.006 2.275||0.862 0.038||0.698 0.348|
|Obama2||0.5 mins||6.941 2.015||22.960 2.195||0.858 0.030||0.945 0.382|
|1 min||5.285 1.303||24.873 2.010||0.888 0.024||0.585 0.239|
|3 mins||3.993 0.732||27.576 2.174||0.926 0.020||0.370 0.103|
|May||0.5 mins||9.539 2.740||23.432 2.087||0.739 0.033||1.731 0.582|
|1 min||7.265 2.366||25.875 2.086||0.739 0.050||1.175 0.685|
|3 mins||5.539 1.234||28.044 1.782||0.823 0.035||0.746 0.209|
. We evaluate the model on three videos. The first three minutes of these videos are used as training data while the rest work as the testing set unless stated otherwise. Note that we test models with groundtruth head poses for numerical evaluation. We report the numerical errors and corresponding standard deviation of average L1-photometric loss (range of 0-255), Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM), and deep perceptual distance (LPIPS)(Zhang et al., 2018) in Table 3
. The standard deviation indicates the temporal stability around the mean loss - higher variance means stronger incoherence. The L1 loss and its standard deviation are shown directly on the top-left corner of error heat-maps. We refer readers to the supplementary video for better visualization. A large-version generator works as a baseline design to test with different ablative conditions.
Figure 11 and Table 3 (top) show the importance of our two conditional input. Removing upper body motions leads to strong shakiness especially around the neck and shoulders. It is hard for the network to synthesize time-consistent upper body motions without this strong location condition. Removing the candidate image set also leads to significant performance decrease on backgrounds and facial contours, especially when the training videos includes camera motions. For example, Obama video in Figure 11 consists of several different camera motions and confuses the model to learn the one-to-many mappings. The candidate image set works as a clue to tell the model what the scene is and guides the network synthesize the right and consistent background. Besides, it also relieve the network pressure to synthesize high-fidelity details since it has included details as input.
Figure 12 and Table 3 (middle) evaluate the generator architecture design. We compare our model with a baseline model Unet (Isola et al., 2017) and a larger model with 2 residual blocks each layer. We found significant performance degradation using Unet (see red boxes in Figure 12). They fail to synthesize clear teeth, ears and other facial details compared to other architectures. Increasing the res-block in each layer takes over 59% more parameters (121.790M vs 76.204M) but obtains no significant improvements on image quality. Considering the balance of performance and efficiency, we use normal architecture (1 Res-block) as our default architecture.
Finally we evaluate the importance of training dataset size. In this experiment, we train the model using 0.5/1/3 minutes frames (see Figure 13). Larger training set generates better results since it covers more pose and expression variations (see Table 3 (bottom)), and the best results are achieved using the full training set.
Now we compare our method with state-of-the-art audio-driven talking-head animation techniques. All the test input audio sequences are unseen during training. We strongly recommend readers watch the supplementary video for better comparisons.
Comparisons to image-based generation methods. We first compare our approach with image-based generation methods for synthesizing talking-head videos. Specially, we compare with Chen et al. (2019), Vougioukas et al. (2019) and Zhou et al. (2020). Figure 14 (upper the solid line) shows results driven by wild audios, and Figure 14 ( lower the solid line) shows the results driven by voices of the target person in images. These methods are trained to generalize to unseen faces and therefore lack personal talking styles (they tend to generate the same lip motions for everyone) and facial details. Chen et al. (2019) and Vougioukas et al. (2019) generate the talking videos in a cropped and normalized face region and therefore fail to handle head poses. Zhou et al. (2020) generate speaker-aware talking-head videos but not the target person style. They warp both the background and the talking-head, giving the impression that the foreground head carries the background and moves together (see the green boxes). Moreover, the mouth of the portrait tends to twist and the synthesized region is blurred. Compared against these methods, our method successfully captures the talking style of the target personal and synthesizes sharper images with higher fidelity, e.g., wrinkles and teeth (see the red boxes). We also note that the head pose generated by our method is different from the groundtruth but exhibits a similar distribution in the training set.
We further compare with Chen et al. (2020a) in the supplementary video, which generate talking faces with rhythmic head movements. We notice that Chen et al. (2020a) do not disentangle the head motion and the background, resulting in an associated movement of the both. Besides, our method keeps the facial details and generates facial images with higher quality.
Comparisons to video-based editing methods. We compare our approach against video-based editing methods. In particular, we compare against Suwajanakorn et al. (2017) and Thies et al. (2020). Please watch the supplementary video for the results. These two methods both synthesize a lower face patch and blend it into the target frame. They rely on additional facial tracking algorithm to provide accurate and stable rigid head motions and 3D mouth locations. Our method directly generates full head renderings as well as backgrounds. Suwajanakorn et al. (2017) synthesize high-quality talking videos of Obama, which is trained on a huge amount of his weekly address videos (17 hours). Neural Voice Puppetry (Thies et al., 2020) additionally trains person-specific blendshape basis using 2-3 minute videos aside from training a general model basis on around 3 hours videos. Our method generates visually comparable and controllable results while using only around 3 minutes videos for training. Considering the intrinsic features of video-based editing methods, which limits the application scenarios, our approach is more applicable to other target person. Last but not least, our system runs in real-time.
Comparisons to model-based methods. We also compare our approach with model-based methods (Karras et al., 2017; Taylor et al., 2017; Zhou et al., 2018; Cudeiro et al., 2019). These methods focus on learning a 3D face mesh or rigging parameters from audio. They usually require a 4D training corpus obtained by a high-cost vision-based capture system or rigging parameters with artist interventions, while our method uses sparse 3D landmarks as an intermediate representation and works on internet videos. Also, our method generates photorealistic results.
We finally conduct three user studies to quantitatively evaluate the quality of our method. We compare our results with state-of-the-art open-source methods(Vougioukas et al., 2019; Chen et al., 2019; Zhou et al., 2020). We prepare 20 audio clips for each method and generate 80 video clips in total. All audio clips are wild and unseen in the training set. These user studies are web-based, and 48 participants with computer science backgrounds finished our questionnaire. During the study, the web page shows one video at a time in randomized order, and the participant is asked to evaluate the video w.r.t three statements: ’The video looks realistic to me.’, ’The mouth motion is sync with the audio.’ and ’The head motion of the portrait is natural.’ on a scale between 1 to 5 (5-strongly agree, 4-agree, 3-neither agree nor disagree, 2-disagree, 1-strongly disagree). Figure 15 shows the average scores of different methods on three statements. The head pose evaluation is only conducted on Zhou et al. (2020) since the other two methods animate the cropped face and hardly have any pose motions.
As can be seen, our method achieves the highest scores over three tasks. For the first task, our method has the highest score of 4.8, which means our results are the most photorealistic. Meanwhile, we have the best results in terms of lip-synchronization. We believe that is because our manifold projection works which improves the generalization. Finally, we compare the head motion against Zhou et al. (2020). Our method generates more natural head motions because we model the target-aware head motion.
Our method synthesizes photorealistic talking-head animation from audio streams in real-time, thus having a wide range of applications, e.g., dubbing, video conferencing, and virtual avatars. We refer readers to our supplementary video. Figure 16 demonstrates the potential applications. On the top of the figure, we show the audio-driven dubbing results of the target person. Compared to video-based dubbing methods (Kim et al., 2019), our method avoids generating implausible facial dynamics of the target person because we model the personal characteristics.
Video conferencing is another application (see Figure 16 (b)). In scenarios that people cannot deliver visual signals, e.g., they are outdoor or have limited bandwidth, our method can generate high-fidelity video frames only driven by audio in real-time.
We finally demonstrate our potentials in virtual avatars such as virtual anchors, assistants. Our supplementary video shows a real-time demo of virtual avatars, e.g., the portrait of Theresa May is animated to sing a song driven by the actor’s voice. Figure 16 (c) shows results driven by the Text-to-Speech (TTS) system. The supplementary video also includes a comparison with Zhou et al. (2020) and Thies et al. (2020). Our method generates more realistic frames and more accurate lip synchronization.
We presented a deep learning approach for generating photorealistic talking-head animation of the target person in real-time. Our method can handle new audio clips not seen during training and still synthesize personalized video frames. The full system is only needed to be trained on a several-minute length video. Our pipeline contains three stages: deep audio features extraction, facial dynamics and motion generation, and photorealistic image synthesis. The first stage includes a manifold projection on deep audio features, which helps generalize to wild audio. In the second stage, facial dynamics, head poses, and upper body motions are generated. An autoregressive probabilistic head pose estimation network is trained to learn the target actor’s pose distribution. This network led to personalized head pose generation and avoided the potential performance degradation of the subsequent neural renderer. Finally, we generated intermediate feature maps from these predictions and sent them with a candidate image set to an image-to-image translation network to synthesize video frames. Thorough experiments and a user study show that our method outperforms the state-of-the-art techniques both qualitatively and quantitatively. Our method can be applied in many scenarios, especially the ones required to run in real-time, like dubbing, video conferencing, and virtual avatars. We hope this work could open a new avenue for future researches in this field.
Limitations and Future Work. While we have demonstrated impressive results of our method in a wide variety of scenarios, there are still several limitations to our approach. Our real-time system does not always capture well with the plosive and nasal consonants, e.g., /p/b/m/. The reasons behind are various. First, /p/b/m/ usually sounds in low volumes and may be ignored by the fore-end as environmental noise. Second, our live system runs at over 30 FPS, and it may miss these short sounds. It also fails to capture the speakings at a very fast speed, like a quarrel scenario. Our offine results (60FPS) are better, which partly verifies our supposition. Applying model pruning is a promising solution to decrease the parameters and increase the running speed. Besides, the spectrum construction we use tends to miss those short phonemes, which can be tackled by using pure deep features, e.g., wav2vec (Schneider et al., 2019). The face tracking algorithm we used is not state-of-the-art, we believe that better reconstruction leads to better lip-sync results.
Similar to most learning-based methods, the style of the generated videos is restricted to the training corpus. Our method preserves the talking styles in training sequences (3-5 minutes) via manifold projection, a domain transfer method to find the most similar samples. This mechanism alleviates this problem to some extent. We believe that a complete solution is to apply a perfect audio disentanglement algorithm like (Qian et al., 2020) to split each components, i.e., content, pitch, timbre and rhythm, and find the best mapping of these components.
Emotional audio may generate unsatisfied results when the model was trained on a neutral-style video. Our method cannot directly control the emotion of the generated videos. Recent work (Ji et al., 2021) shows promising emotion manipulation results when training on an emotional dataset. It would be interesting to apply such progress to our system.
Although we successfully handle the shadows and the lighting reflections when people swing their heads, we still can not explicitly control these parameters. Relighting techniques (Sun et al., 2019) can be applied directly to our rendering results to control the environment lighting. Gestures are another important component for people to deliver expression. We are looking forward to future work on gesture generation driven by audio.
With the rapid development of ’Deepfake’ techniques, the threshold for people to synthesize fake videos of arbitrary person is becoming lower. In most kinds, they facilitate the movie and entertainment industry and reduces the bandwidth of video streaming by sending the audio signals only. However, these techniques can be misused. Due to the fact that it’s more difficult for people to distinguish fake videos, the algorithm may be utilized to spread misinformation or obtain illegal profits. Our method achieves real-time photorealistic talking-head animation and only needs to be trained on a several-minute length video, which can be easily found on the Internet. For non-celebrities, their faces and voices are harder to recognize than celebrities, and therefore generating their fake videos is more deceptive. Potential solutions like digital face forensics methods (Rössler et al., 2018; Rossler et al., 2019) to detect deepfakes must be considered. We hope the public be aware of the potential risks of the misuse of new techniques.
We would like to thank Shuaizhen Jing for the help with the Tensorrt implementation. We are grateful to Qingqing Tian for the facial capture. Yuanxun Lu would also like to thank Xinya Ji for her mental support and proof-reading during the project. This work was supported by the NSFC grant 62025108, 61627804 and Leading Technology of Jiangsu Basic Research Plan (BK20192003).
European Conference on Computer Vision. Springer, 35–51.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7832–7841.
International conference on machine learning. PMLR, 1597–1607.
Unsupervised speech representation learning using wavenet autoencoders.IEEE/ACM transactions on audio, speech, and language processing 27, 12 (2019), 2041–2053.
Finet: Compatible and diverse fashion image inpainting. InProceedings of the IEEE/CVF International Conference on Computer Vision. 4481–4491.
Perceptual losses for real-time style transfer and super-resolution. InEuropean conference on computer vision. Springer, 694–711.
A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401–4410.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9299–9306.
In this appendix, we describe all the sequences we used in detail (Table 4).
|Actor A||3 min 45s|
|Ford||3 min 10s|