Face transfer is a method for mapping face performances of one individual to facial animations of another one. It uses facial expressions and head poses from the video of a source actor to generate a video of a target character. A variety of methods have been developed for face transfer and have achieved impressive results. Previous work typically models the face of source and target, and then transfers the corresponding features from the source to the target, and finally re-renders the target face and blends to the original target image to achieve face transfer [Vlasic et al.2005, Shi et al.2014, Thies et al.2016]. A data driven approach is proposed by [Li et al.2012]. They retrieved frames from a database based on a similarity metric and used optical flow as appearance and velocity measure and then searched for the -nearest neighbors based on time stamps and flow distance.
Different from the methods mentioned above which divide the task into several steps, and explicitly model the facial attributes, in this paper, we use deep neural network to develop an end-to-end approach[LeCun, Bengio, and Hinton2015]. Our work takes a talking video of a source actor as input. For every frame in the video, a face image of target character with corresponding facial expression and head pose is generated. By combining every generated frame, a corresponding video of the target character is generated.
Face transfer is a special case of image-to-image translation tasks[Isola et al.2016, Zhu et al.2017]. The characteristic of the input video is the appearance of the character in the video, while the identity of each frame is the character’s facial expression and head pose. Our method is to use Generative Adversarial Network (GAN) [Goodfellow et al.2014] to learn to transform the characteristic while preserving the identity. There are two key factors for this task. First, a source face image should be mapped to a target face image with the corresponding facial expression and head pose. Second, the image should be of high quality, i.e., looks natural and indistinguishable to human.
In this paper, to generate images with matching facial expression and head pose, we use CycleGAN [Zhu et al.2017] to transfer the identity between two image sets. CycleGAN is proposed to capture the special characteristics of one image collection and translate the characteristics into the other image collection in the absence of any paired training samples. CycleGAN learns a one-to-one mapping, which ensures each input face image can be mapped to a target image with a corresponding facial expression and head pose [Kim et al.2017a].
GAN is originally proposed to map the inputs to a real data distribution. A discriminator that models the whole images requires the generated images to be close to real images, which may restrict the creativity of the generator. [Isola et al.2016] proposed that PatchGAN is an effective architecture in image-to-image translation tasks. Instead of a discriminator that performs a judge from the whole image. PatchGAN discriminator has a receptive field smaller than the whole image. It only models images from patch level and explicitly requires every image patch to be real. In that case, the generated image, composed by realistic image patches, can be more diverse, which enhances the generator’s creativity. We study the impact of different receptive field sizes on the generators and use a model with a pair of discriminators with a big and small receptive fields respectively to capture both global coherence and local patterns.
To sum up, our contributions are as follows.
To the best of our knowledge, our work is the first that applies Generative Adversarial Network to perform end-to-end face transfer which we formalize as an image-to-image translation problem.
We explore the impact of discriminators with different receptive field sizes on the quality of generated images. We propose an architecture of two discriminators with different receptive field sizes. This enables the generator to create images with a head pose that does not occur in the real image set.
The demo video is provided at goo.gl/RBbR9y.
[Vlasic et al.2005] performed face transfer based on a multi-linear model of 3D face meshes that separably parameterizes the space of geometric variations due to different facial attributes. The authors tracked a face template and re-rendered it under different expression parameters. [Dale et al.2011] tracked the facial performance in both videos. The authors warped the source to the target face and re-timed the source to match the target performance using the corresponding 3D geometry. [Garrido et al.2014]
proposed a reenactment pipeline conceived as part image retrieval and part face transfer.[Li et al.2012] took advantage of an existing facial performance database of the target person. They used a query image to retrieve frames from a database based on similarity metrics. [Thies et al.2016] investigated face trackers and expression modeling to transfer facial expressions and achieved real-time face transfer.
Compared with above works, our approach uses generative adversarial network to achieve an end-to-end face transfer between two given characters without any supervision. We directly generate the target frames with the input frames of the source character.
pointed out that many problems in image processing, computer graphics, and computer vision can be formulated as an image-to-image translation task. For example, label to scene, aerial to map, day to night, edges to photo and also grayscale to color. Some problems in face synthesizing can also be regarded as image-to-image translation tasks. To be specific, changing attributes of face images, such as gender, hair style[Kim et al.2017b], age, expression, beard and glasses [Shen and Liu2016]. In this paper, we also formulate face transfer as an image-to-image translation task.
Generative Adversarial Networks
Generative Adversarial Networks (GAN) [Goodfellow et al.2014]
has attained much attention in unsupervised learning during the recent 3 years. Conditional GAN, as a variant of GAN, is widely used in various computer vision scenarios. In some image-to-image translation tasks, the inputs are images rather than noises.[Zhu et al.2017, Kim et al.2017b, Yi et al.2017] investigated similar cycle architecture and named this architecture as CycleGAN, DiscoGAN, DualGAN respectively. In this paper, we refer to this architecture as CycleGAN.
Compared with traditional GAN which has only a generator mapping domain to domain and a discriminator on domain . CycleGAN adds another generator mapping domain to domain and a discriminator on domain . The two GANs form a cycle transformation, and cycle consistency loss is introduced to urge the cycle transformation to be identical. Such a cycle architecture can be applied to unpaired data [Zhu et al.2017]. We leverage cycle architecture to transfer the facial performance from the source character to the target character, with two unpaired videos, one for each character.
[Isola et al.2016] proposed PatchGAN as an effective architecture in image-to-image translation tasks. It restricts the discriminator to image patches in order to model high-frequency structures. And the authors showed that PatchGAN is effective on a wider range of problems. A similar PatchGAN architecture was previously proposed in [Li and Wand2016], for the purpose of capturing local style statistics. Such a discriminator models the image as a Markov random field [Li and Wand2016]. PatchGAN is also used in further studies such as unpaired settings [Zhu et al.2017] and dual learning [Yi et al.2017]. In this paper, we explore the effect of PatchGAN discriminators with different receptive field sizes.
GAN with Multi-Discriminators
Despite the impressive results achieved by GANs. GANs are reputably difficult to train. It is hard to balance the generator and the discriminator, and it easily gets mode-collapse. [Durugkar, Gemp, and Mahadevan2016] extended GANs to multiple discriminators. For a generator , discriminators of the same structure with random initialization are utilized as teachers for the generator. They suggested that GANs with multiple discriminators can better approximate the optimal discriminator, and are more stable on providing reliable feedback for the generator.
In this paper, we experiment with the framework of two discriminators against one generator. Note that our discriminators are PatchGAN discriminators. We explore the case of two discriminators with the same structure and the case of two discriminators with different receptive field sizes. For the latter case, one discriminator has a large receptive field and another has a small receptive field.
In this section, we discuss some preliminaries of our models, including GANs and CycleGAN framework.
A Generative Adversarial Network is a generative model that consists of two neural networks. A generator
learns to map random noise vectorto real data distribution: . A discriminator tries to distinguish real data from generated samples. They are iteratively trained to play a two-player min-max game.
In image-to-image translation tasks, the generator takes in images as input instead of noise and maps images from to target domain : . We follow the choice of [Zhu et al.2017] which adopts least square loss instead of the negative log likelihood. The adversarial loss is defined as
CycleGAN, in addition to traditional GANs, adopts two pairs of generator and discriminator. A generator that maps to with a discriminator on domain and a generator that maps to with a discriminator on domain . The two GANs are trained simultaneously. Each image in domain transformed by to domain is then transformed back to domain by . CycleGAN introduces a cycle consistency loss which enforces and to be consistent. Such a cycle architecture can thus be applied to unpaired data.
The loss function of cycle consistency loss is defined as
The cycle consistency loss consists of two parts. They are conditioned on both and . When minimizing , G is optimized to transform a real image to a generated sample that contains sufficient information to be transformed back to . When minimizing , is optimized to adapt a fake sample back to a real image .
We empirically find that the original definition of the cycle consistency loss makes the generators prone to generate artifacts meaningless to human. Thus we make a small modification: both generators only take real images as input and will not be trained on fake images during training:
Such a modification helps reduce some artifacts.
Original CycleGAN fixes the weight of adversarial losses to 1.0 and the two parts of cycle consistency share a weight parameter. We find different datasets are different in susceptibility to mode collapse. And the ratio of cycle loss and adversarial loss should be different for different datasets. Therefore, we introduce weights for different cycle passes as hyperparameters, and define the full objective function as
where , and are the hyperparameters.
For discriminators, we employ Markovian PatchGAN discriminator [Isola et al.2016, Li and Wand2016], which models images only at patch level rather than the whole image. It assumes independence between pixels separated by more than a patch diameter. The discriminator is run convolutionally across the image and the losses of all image patches are averaged to provide the final loss of the discriminator. PatchGAN discriminator is effective in capturing local high-frequency features but less effective in modeling global structure.
The choice of receptive field size is a staggering problem. In some image-to-image translation tasks, such as edge to photo, labels to street scene and grayscale to color [Isola et al.2016], the generator changes the local style while preserving spatial information. The shape of the objects in the picture usually remains unchanged. The receptive field of discriminators in these tasks could be more arbitrary than that of our case. For example, in [Isola et al.2016], they experimented , , , on images and got visually similar results with ImageGAN and PatchGAN.
In our face transfer task, we find the receptive field size strongly affects the quality of generated faces. With a small receptive field, only generated image patches are required to be realistic, which results in more diverse generated images and enhances the generator’s creativity, especially on a dataset with limited samples. While, for the fact that we are modeling an image of a entire face, we need a discriminator with a receptive field close to image size. If the receptive field is too small, it will result in unreasonable deformation of generated faces.
A discriminator with limited capacity may fail to generate realistic images. However, modification on the discriminator’s structure is often along with a difficulty in training GANs. When the discriminator reaches a far superior situation to the generator, the generator may stop making progress [Durugkar, Gemp, and Mahadevan2016, Arjovsky and Bottou2017, Neyshabur, Bhojanapalli, and Chakrabarti2017]. When the distribution of generated samples has little overlap with real image distribution, the generator cannot receive efficient gradients from the discriminator to improve its performance, which is referred as the gradient vanishing problem in GAN [Arjovsky, Chintala, and Bottou2017].
We introduce the scheme of training two discriminators against one generator as a more stable way to improve the capacity of discriminators, as similar to [Durugkar, Gemp, and Mahadevan2016]. To be specific, the multi-discriminator architecture can better approximate the optimal discriminator, and, if one of the discriminators is trained to be far superior over the generator, the generator can still receive instructive gradients from the other discriminator.
Choice of Receptive Fields
In the two-discriminator setting, we conduct experiments with three different pairs of receptive field sizes. To ensure justice, all models have two discriminators against one generator. The performance of these models will be shown and analyzed in Experiments Section.
For the first case, both discriminators are with receptive field sizes of the size . The two discriminators share the same structure but are randomly instantiated. We simply average their losses as the final adversarial loss. Such a receptive field is close to image size. It models real images from a global view. Therefore, if a generated face image exhibits a pose that never occurs in the real image set, the discriminator will give a low score and prevent the generator from generating such images, which restricts the creativity of the generator.
For the second case, we use two discriminators with receptive fields. Because the discriminator only models local patterns of real images and it does not require the whole generated image to be similar to real images. The generator has little restriction on the global structure, which enables the generator to transform images with head poses that never occur in the target domain into a much better sample than the case mentioned above. In other words, it enhances the generator’s creativity. However, discriminators with small receptive fields cannot model global features, which results in global inconformities such as excessive deformation of generated eyes and face.
For the third case, one discriminator has a receptive field, while the other has a receptive field. It is a trade-off between the two cases mentioned above that models global and local structure simultaneously. We add parameters to tune weight of the two adversarial losses. We hope this model can take both global features and local features into consideration. While the discriminator improves the generator’s creativity, the discriminator can help eliminate those abrupt deformation caused by the absence of global inspection.
For the mapping function , The formulation of the final adversarial loss is defined as
where is the hyperparameter to maintain the ratio between the adversarial losses of the two discriminators. Note that and are two discriminator instances. In the first two cases, is set as 0.5 as the balance of the two factors are not sensitive to the final results. In the third case, needs to be carefully tuned as it is sensitive to the performance.
We conduct experiments on three clips:
Barack Obama in weekly address
Joe Biden in weekly address
Li Xiuping in CCTV News
We manually crop the 3 videos and extract the frames of the cropped video. All of these video clips are mainly a character talking to the camera. We choose these videos based on the following reasons. Because we aim at transferring the facial performance, each video includes the facial part of a acting character. In our video clips, body parts usually do not appear in the cropped video and the shooting angles are fixed, otherwise, irrespective body movements and the changing of shooting angle may also be considered as important identities by the generator. The images are scaled to .
Most input images with common facial expression and head pose are transformed into realistic samples in the target domain. Example results are shown in Figure 1.
Performance of Different Receptive Fields
In this section, we compare the performance of three models. They all have two discriminators trained against one generator. The first model has two PatchGAN discriminators with receptive field, represented by model. The second has two discriminators with receptive field, represented by model.The third has a discriminator with receptive field and a discriminator with receptive field, represented by model.
To illustrate the difference between the three models, we compare their performance on the task Joe Biden Li Xiuping. This is the most difficult task in the three datasets for two reasons. First, Joe Biden is an English speaker while Li Xiuping is a Chinese speaker. There is no counterpart for some mouth shapes in Joe Biden’s video. Second, Joe Biden moves his head arbitrarily in his video while Li Xiuping moves in a small range. We pick the representative frames that illustrate the advantages and disadvantages of the three models.
discriminator models global structure of images, which restricts the generator’s creativity. It makes the generator generates noisy and distorted face images if the source image exhibits a head pose unseen in target image set. discriminator allows the generator to generate uncommon head pose but cannot ensure global features. The model is a trade-off between global and local features. By tuning the weight of two discriminators’ adversarial losses, we achieved high-quality results.
Creativity on Unseen Head Poses
In the video of Joe Biden, he sometimes raises his head, sometimes tilts his head drastically, while Li Xiuping never does so in her video. So there is no image that can be considered as a direct reference.
Joe Biden in the source frames raises his head, as shown in Figure 3. The model generates distorted faces, for the discriminator models the whole images and it considers the unseen head pose as a fake sample, which hinders the generator from creating unseen head poses. The other two models generate meaningful faces, although there are some artifacts on the neck. The results of model and model are similar.
In another video fragment, Figure 3, Joe Biden tilts his head drastically. model generates a face image with noise everywhere. model generates a clear face, but causes unpleasant deformation of the generated face. This is because the discriminator cannot model the whole face. It only inspects generated images from a local perspective. Without the restriction on global features, it cannot penalize the obvious global deformation. discriminator models both global and local structures, which leads to much better results.
In this section, we discuss the limitation of our face transfer method. We show the performance of our three models on generating unseen expressions in Figure 4. We cut a fragment of video in which Joe Biden grins broadly to pronounce ’any’. It is hard to find a similar mouth shape in Chinese, let alone a short video of Li Xiuping. Without a reference, it is hard to generate a sharp and realistic mouth with our approach. model generates a mouth that resembles the source image in the overall shape but cannot create the details of teeth.
CycleGAN encourages the generator to map a source image to a target image with identical facial expression and head pose. However, when the source and target video do not match in the diversity of head pose and facial expressions, it is hard to learn a perfect one-to-one mapping, which results in noisy and distorted results as shown in Figure 2, Figure 3 and Figure 4. To address this issue we adopt our two-discriminator architecture to create some unseen poses.
In this paper, we leverage CycleGAN and PatchGAN to achieve an end-to-end face transfer. CycleGAN learns a one-to-one mapping, which ensures each source face image to be mapped to a target image with a corresponding facial expression and head pose. Practically, with limited training samples, it is difficult to well generate the corresponding target face image with unseen facial expression or head pose in the target dataset. To improve the generalization ability of the generator, we propose to adopt a discriminator with a small receptive field to alleviate the restriction on the generator and a discriminator with a big receptive field to ensure global coherence. This two-discriminator architecture achieves the best result in our experiments.
For the future work, loss re-weighting on image patches could help improve generated image quality. And an investigation on the impact of receptive field size on other image-to-image translation tasks is also interesting.
- [Arjovsky and Bottou2017] Arjovsky, M., and Bottou, L. 2017. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862.
- [Arjovsky, Chintala, and Bottou2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. ICML.
- [Dale et al.2011] Dale, K.; Sunkavalli, K.; Johnson, M. K.; Vlasic, D.; Matusik, W.; and Pfister, H. 2011. Video face replacement. ACM Transactions on Graphics (TOG) 30(6):130.
- [Durugkar, Gemp, and Mahadevan2016] Durugkar, I.; Gemp, I.; and Mahadevan, S. 2016. Generative multi-adversarial networks. arXiv preprint arXiv:1611.01673.
[Garrido et al.2014]
Garrido, P.; Valgaerts, L.; Rehmsen, O.; Thormahlen, T.; Perez, P.; and
Automatic face reenactment.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4217–4224.
- [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
- [Isola et al.2016] Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2016. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004.
- [Kim et al.2017a] Kim, T.; Cha, M.; Kim, H.; Lee, J.; and Kim, J. 2017a. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192.
- [Kim et al.2017b] Kim, T.; Kim, B.; Cha, M.; and Kim, J. 2017b. Unsupervised visual attribute transfer with reconfigurable generative adversarial networks. arXiv preprint arXiv:1707.09798.
- [LeCun, Bengio, and Hinton2015] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436–444.
- [Li and Wand2016] Li, C., and Wand, M. 2016. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, 702–716. Springer.
- [Li et al.2012] Li, K.; Xu, F.; Wang, J.; Dai, Q.; and Liu, Y. 2012. A data-driven approach for facial expression synthesis in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 57–64. IEEE.
- [Lin et al.2017] Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002.
- [Neyshabur, Bhojanapalli, and Chakrabarti2017] Neyshabur, B.; Bhojanapalli, S.; and Chakrabarti, A. 2017. Stabilizing gan training with multiple random projections. arXiv preprint arXiv:1705.07831.
- [Shen and Liu2016] Shen, W., and Liu, R. 2016. Learning residual images for face attribute manipulation. arXiv preprint arXiv:1612.05363.
- [Shi et al.2014] Shi, F.; Wu, H.-T.; Tong, X.; and Chai, J. 2014. Automatic acquisition of high-fidelity facial performances using monocular videos. ACM Transactions on Graphics (TOG) 33(6):222.
- [Thies et al.2016] Thies, J.; Zollhofer, M.; Stamminger, M.; Theobalt, C.; and Nießner, M. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2387–2395.
- [Vlasic et al.2005] Vlasic, D.; Brand, M.; Pfister, H.; and Popović, J. 2005. Face transfer with multilinear models. In ACM transactions on graphics (TOG), volume 24, 426–433. ACM.
- [Yi et al.2017] Yi, Z.; Zhang, H.; Gong, P. T.; et al. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. arXiv preprint arXiv:1704.02510.
- [Zhu et al.2017] Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593.