Face-off is an interesting case of style transfer where the facial expressions and attributes of one person could be fully transformed to another face. We are interested in the unsupervised training process which only requires two sequences of unaligned video frames from each person and learns what shared attributes to extract automatically. In this project, we explored various improvements for adversarial training (i.e. CycleGAN[Zhu et al., 2017]) to deal with the common problem of model collapse, to capture details in facial expressions and head poses, and thus transfer facial expressions with higher consistency and stability.
1.1 Our Contribution
This project111Please see this video first (use headphone!):
https://drive.google.com/file/d/1QFh-n0L6Q-3kwfO1WIKjf7_O5dsCwClL/view?usp=sharing222Code: https://github.com/ShangxuanWu/CycleGAN-Face-off explored different approaches to generate better face-off videos based on CycleGAN[Zhu et al., 2017]. Our experiment results are summarized as below:
We proposed several working methods for improving CycleGAN face-off results:
Using two distinct discriminators[Xu et al., 2017].
Using face segmentation masks as weight on cycle-consistency loss.
Using SSIM loss[Wang et al., 2003] with loss weight within a proper range learns better poses and structure, but needs more tuning to improve facial details.
2 Related Work
Facial expression transfer is a classic topic in computer vision and graphics using facial landmark localization and face morphing, and it became more popular with the adaptation of neural style transfer and GANs[Zhu et al., 2017] [Goodfellow et al., 2014] [Radford et al., 2015] [Isola et al., 2016]
. GAN models such as Pix2pix[Isola et al., 2016] and CycleGAN [Zhu et al., 2017] have shown impressive results on image-to-image translation that learns to relate two different data domains.
The loss function of CycleGAN is composed of two parts: traditional GAN loss and a new cycle-consistency loss which pushes cycle consistency:
where cycle-consistency loss represents how similar the is to and is to :
However in our experiments, we have found it very challenging to get good transferring results on unaligned datasets. So in the next section, we will discuss some different techniques to deal with it.
In this section, we explore how to generate better face-off sequences on unaligned datasets in the following aspects: (1) using WGAN loss [Gulrajani et al., 2017] in adversarial part; (2) adding SSIM loss [Wang et al., 2003] in cycle-consistency loss; (3) adding face segmentation masks as reconstruction weights; (4) using skip layers to increase multi-scale invariance; (5) increasing the capacity of discriminators, and etc.
3.1 WGAN Loss
We consider WGAN [Gulrajani et al., 2017] to deal with the common problem of model collapse in adversarial training and to achieve more stable results. As we found from our tests, some of the expressions of person A were transferred to the same pose and expression of person B. The standard discriminator loss uses cross-entropy loss and suffers from gradient vanishing. To solve this, we adopted the following improvements according to WGAN paper:
3.2 SSIM (Structural Similarity) Loss
SSIM loss [Wang et al., 2003] matches the luminance(), contrast(), and structure() information of the generated image and the input image, and it’s proved to be very helpful to improve the quality of image generation. Multi-Scale SSIM loss considers SSIM loss over M scales as follows:
We added SSIM loss in CycleGAN in order to enforce the similarity between recovered image and original image (as illustrated in Figure 1).
3.3 Background Subtraction and Face Mask
We observed severe background corruption in the video shown in [Wei, 2017]. One possible reason is that rather than explicitly separating foreground and background, CycleGAN treats the whole image as one object and transfers the domain implicitly.
Therefore, in this project, we explicitly dealt with foreground and background, trying to get a sharper boundary of the target. We segmented the input faces as shown in Figure 2, then fed the mask as weight on pixel-wise reconstruction error.
We experimented with both fully convolutional networks (FCN)[Long et al., 2015] and DLib[King, 2017] in this project. Using FCN-8s pretrained on Pascal VOC dataset, we can segment the whole person from the image, which guides the network to put more focus on the whole body. However, we noticed that the segmentation result is very unstable, and cannot give a very clear contour of a person. So we later looked at DLib by which we extracted the facial landmarks and then transformed the face polygons to masks. Using the face masks are more helpful for focusing on facial expressions.
There are two instinctive approaches to leverage face mask:
Crop out face part only for network input and neglect all the other parts.
Based on segmentation mask, apply pixel-wise weight to original cycle-consistency loss.
We experimented with both of them and results will be analyzed in the next session.
3.4 Generator Variants
3.5 Discriminator Variants
According to [Xu et al., 2017], improving the capacity of discriminator in GANs would result in more natural and higher resolution image generation. Therefore, we propose some ways to enforce the discriminative process:
Vanilla CycleGAN uses 3-layer-Conv and as discriminator. We will first to extend the depth of this subnetwork to 5-layer-Conv.
Alternatively, we can use two different discriminators at each side, and average their loss using a given weight . is set as 0.5 in our experiments. The loss function is modified as:
We evaluated proposed methods on four videos:
A video of Shangxuan talking to the camera shot by ourselves.
A video of Ye talking to the camera shot by ourselves.
A video of Xiaohan talking to the camera shot by ourselves.
We created four datasets444https://drive.google.com/drive/folders/1Oepq5qBleF9mDrzdulEzWnY04FAlPUA1?usp=sharing555https://drive.google.com/drive/folders/1yGZ0NJJeqxhSYONIUjOc7v30NhpAS65U?usp=sharing from the videos described above by extracting 2,200 frames from each clip, in which 2,000 of them belong to the training set while 200 are testing images. All videos were shot against relatively plain background but the faces in each are not perfectly aligned, as shown in Figure 4. Each image is of resolution , and will be randomly cropped to when being input to the network for data augmentation. Our face-off experiments were performed on three pairs of datasets A2B, A2C, A2D.
5 Experiments and Result Analysis
The results of different experiments on baseline, SSIM loss, WGAN loss, UNet structure, face masks and discriminator structures are shown in Figure 5, of which analysis is as follows.
We trained a baseline CycleGAN model with 3-layer discriminator (3 convolution blocks). We back-propagates using least-square GAN (LS-GAN) loss and cycle consistency () loss. This was trained on unaligned dataset where two people appear with different poses and at different spatial locations as mentioned in the Dataset section.
rFor face-off between Shangxuan and Russ, the baseline model already learns the mapping of human poses well, but the results are quite unstable at edges. Since the faces of Shangxuan and Russ are not perfectly aligned, the model does not deal with scale well.
For face-off between Ye and Russ, the baseline model learns to transform well between different gender domains. But the results are very shaky and inconsistent between the frames as well.
For face-off between Xiaohan and Russ, the baseline model learns to transform human poses and gender as well, but maps Xiaohan’s hairstyle to Russ, thus creates a lot of noisy results.
5.2 WGAN Loss
We implemented WGAN loss to improve the training of GANs. However, the training was very unstable, even if we tuned the learning rate and clipped the gradients. Using WGAN has a high failure rate and is slow to train. We can see the results that least-square GAN loss helps produce better results compared to WGAN loss.
5.3 SSIM Loss
SSIM loss should have a better interpretation of matching the luminance, contrast, and structure information. Though it has been shown to perform perceptually preferable images, it is not shown to improve the image details in our experiments. After some tuning, we found a proper range for weights of SSIM should be around 0.0001 to 0.01. The weight should not be too large, or it might dominate the reconstruction loss and we cannot see as much details as the baseline model using loss. We added SSIM loss with weight 0.01, we can see in Figure 5 that using SSIM can help learn the pose well, while it still needs more tuning to recover more facial details, but generally it is helpful.
5.4 Face Mask
We want the model to learn to focus on facial expressions, and allow higher gradient flow to train those areas better. So we experimented with face masks generated by DLib. There are two ways to use the face masks. First is to crop the input images and input the masked image; second is to apply the mask to the loss function.
5.4.1 Mask-out Background
When we mask out the background, the face left is only a small portion in the image that the model has to learn. In such a scenario the generator exhibits very poor diversity amongst generated samples so the discriminator cannot tag them as fake, which makes training discriminator ineffective. The model got collapsed easily. Common ways to solve this problem is to increase the dataset diversity or batch size, to let the discriminator learn more about edge-cases, or use WGAN to improve the training of GANs. However, we tried using WGAN but that is hard to tune as well.
5.4.2 Mask as Loss Weight
From the previous experiment, we think that it is better to keep the background but more importantly, to increase the weight of facial parts. Thus during training, we input binary masks together with images and applied element wise product of with reconstruction loss. We can see from Figure 5 that, this can add more facial details such as teeth as well as more natural facial expressions. With higher gradient flow on faces, the network learns to focus on facial details more.
5.5 U-Net in Generator
We observed a severe mode collapse using U-Net as generator. The generated images are nearly identical even if the input images have different poses. The reason is that comparing to vanilla ResNet generator structure, U-Net is not so capable of extracting image representations. We would suggest using a stronger network if we are to substitute the generator in the future.
5.6 Multiple Deeper Discriminators
When the number of discriminator layer increases, the receptive field size is reduced, forcing the model to learn a more detailed translation from one domain to another. The result demonstrates that the model leveraging 5-layer discriminator does a better job at imitating the facial expressions of the input though the global structure like head-shoulder ratio is not held as well.
Multiple-discriminator GAN, as appose to the one with a single discriminator, amplifies the model capacity and resists random noises. It noticeably outperforms other settings when encountering an image with an unseen pose (as revealed by the fifth and sixth row in Figure 5). With a reasonable trade-off between patterns learned from different receptive fields, the generator perfectly combines the subtle facial expressions from the source person without detouring too much from target person’s features.
5.7 Training Loss Comparison
We plotted the three training losses (generator loss, discriminator loss and cycle-consistency loss) for three different architectures: baseline (vanilla CycleGAN), a good face-off network (Double Discriminator CycleGAN) and a bad face-off network (U-Net Generator CycleGAN). The plots are shown below:
The patterns in loss curves (as shown in Figure 6
) reveal that discriminator loss and cycle-consistency loss are decreasing by epoch. However, for all three networks, the generator loss increases after some point, which is an indicator of imbalanced training ofand , where is getting too strong. Notice that generator loss includes both the reconstruction loss and the adversarial part to fool discriminator, so increase in generator loss might also show that the network is learning to add more random details in order to fool discriminator.
We can also see that there is not much difference between three different losses. This is coherent with the statement that loss is not a useful indicator of generated visual quality in GAN training.
5.8 Discussion of Evaluation Metrics
It is known that GAN results are hard to evaluate as the applications are usually on the edge of art and technology.
We looked into some evaluation metrics such as Inception score used in WGAN paper[Gulrajani et al., 2017] and FID score in the paperHeusel et al.  that both correlate well with human judgment. However, it’s important to evaluate Inception score on a large enough number of samples (i.e. 50k) and samples from different classes as part of this metric measures diversity. Alternatively, FID score captures the similarity of generated images to real ones better than the Inception Score. But it recommends using a minimum sample size of 10k to calculate the FID otherwise the true FID of the generator is underestimated.
Since our datasets only contains one class mainly - person, and each dataset is quite small (around 2k), the inception score and FID can’t tell much difference between how good the generated images are. Considering that there are not many videos to evaluate and easy to tell the difference, we basically based on manual inspection of the visual fidelity of generated videos. Link to the generated videos is given in the first section and the appendix.
6 Conclusion and Future Work
This project explored different approaches to generate better face-off videos based on CycleGAN[Zhu et al., 2017]. Our experiment shows that: using two distinct discriminators, deeper discriminator networks or applying face segmentation masks as weight to cycle-consistency loss would result in smoother and more stable face-off results.
During this project, we also found some drawbacks for existing CycleGAN structure and we want to solve the following problems in the future:
Exploring the possibility of domain transfer using CycleGAN, i.e. training transfer between different types of objects.
Increasing the capability of CycleGAN face-off in complicated backgrounds.
Extending face-off to body-off: training GANs for generating full-body pose transfer.
- Goodfellow et al.  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Gulrajani et al.  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
- Heusel et al.  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6629–6640, 2017.
- Isola et al.  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
- King  D. King. DLib C++ Library, 2017. URL http://dlib.net/.
Long et al. 
J. Long, E. Shelhamer, and T. Darrell.
Fully convolutional networks for semantic segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
- Radford et al.  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Ronneberger et al.  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
- Wang et al.  Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, volume 2, pages 1398–1402. IEEE, 2003.
- Wei  T. Wei. Cyclegan face-off, 2017. URL https://www.youtube.com/watch?v=Fea4kZq0oFQ.
- Xu et al.  R. Xu, Z. Zhou, W. Zhang, and Y. Yu. Face transfer with generative adversarial network. arXiv preprint arXiv:1710.06090, 2017.
- Zhu et al.  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
Appendix A Appendix
a.1 Russ with Great Face, Hair and Pose!
a.2 More Generated Videos
Following are our test sequences and results for Shangxuan’s and Xiaohan’s sequence. Please use headphone!
Shangxuan’s test sequence:
Xiaohan’s test sequence: