High Fidelity Face Manipulation with Extreme Pose and Expression

03/28/2019 ∙ by Chaoyou Fu, et al. ∙ Horizon Robotics 10

Face manipulation has shown remarkable advances with the flourish of Generative Adversarial Networks. However, due to the difficulties of controlling the structure and texture in high-resolution, it is challenging to simultaneously model pose and expression during manipulation. In this paper, we propose a novel framework that simplifies face manipulation with extreme pose and expression into two correlated stages: a boundary prediction stage and a disentangled face synthesis stage. In the first stage, we propose to use a boundary image for joint pose and expression modeling. An encoder-decoder network is employed to predict the boundary image of the target face in a semi-supervised way. Pose and expression estimators are used to improve the prediction accuracy. In the second stage, the predicted boundary image and the original face are encoded into the structure and texture latent space by two encoder networks respectively. A proxy network and a feature threshold loss are further imposed as constraints to disentangle the latent space. In addition, we build up a new high quality Multi-View Face (MVF-HQ) database that contains 120K high-resolution face images of 479 identities with pose and expression variations, which will be released soon. Qualitative and quantitative experiments on four databases show that our method pushes forward the advance of extreme face manipulation from 128 × 128 resolution to 1024 × 1024 resolution, and significantly improves the face recognition performance under large poses.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 12

page 13

page 14

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Photo-realistic face manipulation with arbitrary pose and expression is a meaningful task in a wide range of fields, such as movie industry, entertainment and photography technologies. With the flourish of Generative Adversarial Networks (GANs) [9], face manipulation has achieved significant advances in recent years [7, 25, 16, 14, 27]. However, existing face manipulation methods mainly focus on only one facial variation (e.g., pose or expression). The methods for large pose or expression have still been limited to a low-resolution (128 128). Particularly, joint pose and expression modeling is challenging [40], especially when high-resolution facial images have extreme pose and expression.

Figure 2: Visual comparisons (512

512) of different methods. (a) Direct image-to-image translation

[33]. The local structures, e.g., the mouth, are unclear; (b) Directly concatenating the original input face and the boundary of the target face [14]. The local structures are ambiguous and textures are confused; (c) Directly utilizing a face recognition network to disentangle structure and texture [2]. The local structures are clear, but the textures are somewhat lost; (d) Our method. The structures and textures are well maintained.

For face manipulation, a straightforward way is to apply image-to-image translation [17, 33]. However, in the case of high-resolution with extreme pose and expression, it is difficult to guarantee the facial local structures in this way. As shown in Fig. 2 (a), the facial local structures, such as the eyes, nose and mouth, are unclear. Recent observations in [33] show that the boundary information is crucial in high fidelity image synthesis. Hence, we argue that the lack of geometry guidance makes it difficult to synthesize extreme high-resolution face images. Several geometry guided methods have been proposed for face manipulation [14, 18, 29]. For example, CAPG-GAN [14] utilizes facial landmarks to control face rotation. SC-FEGAN [18] realizes local facial editing by sketch. Most of geometry guided methods directly concatenate a face image and its geometry guidance in the image space. However, since there is no disentanglement between the structure and texture, such the concatenation is difficult to maintain the facial structure and texture. As shown in Fig. 2 (b), the synthesized face’s structures are ambiguous and its textures are confused. [2] proposes a simple disentanglement manner. It introduces a face recognition network to learn structure invariant features, and then concatenates the structure invariant features with structure features to synthesize faces. As shown in Fig. 2 (c), this disentanglement manner does make the structure of the synthesized face clearer, but the textures of the synthesized high-resolution face are somewhat lost. We argue that it is because the features of the face recognition network are too compact, leading to severe texture loss in such high-resolution case.

Based on the above observations, we propose a novel framework for high-resolution face manipulation with extreme pose and expression, as shown in Fig. 3. Our framework simplifies this challenging task into two correlated stages: a boundary prediction stage and a disentangled face synthesis stage. The first stage utilizes a boundary image for joint pose and expression modeling. It employs an encoder-decoder network to predict the boundary image of a target face in a semi-supervised way [26]. Pose and expression estimators are introduced to improve the prediction accuracy. The second stage encodes the predicted boundary image and the original face into the structure and texture latent space by two encoder networks respectively. A proxy network and a feature threshold loss are proposed to disentangle the latent space. Specifically, since it is hard to directly disentangle the structure and texture [28], we introduce a face recognition network as a proxy to facilitate disentanglement. Different from [2] that directly utilizes the compacted features of the proxy network, we propose a simple yet effective feature threshold loss to control the compactness between our learned face features and the compacted features, as shown in Fig. 3. Our method disentangles the structure and texture, while keeps the integrity of the texture, as shown in Fig.2 (d). More high-resolution extreme face manipulation results are presented in Fig. 1 (512 512) and Fig. 5 (1024 1024).

Moreover, we introduce a new high quality Multi-View Face (MVF-HQ) database. It contains 120K high-resolution face images from 479 identities with diverse pose and expression variations, whose facial areas reach resolution. We will release this database soon, along with its 5 precise facial landmarks annotated by human.

In summary, the main contributions are as follows:

  • The high-resolution face manipulation problem with extreme pose and expression is formulated as a stage-wise learning problem that contains two correlated stages: a boundary prediction stage and a disentangled face synthesis stage.

  • We realize joint pose and expression modeling by the boundary image translation in the first stage. Besides, a proxy network and a feature threshold loss are introduced in the second stage to disentangle the structure and texture for better utilizing the boundary image.

  • This is the first time to explore extreme high-resolution face manipulation. A new high-resolution (2048 2048) face database is created. These works are expected to promote the development of high-resolution image synthesis.

  • Experiments on the MultiPIE [10], RaFD [22], CelebA-HQ [19] and our MVF-HQ databases show that our method pushes forward the advance of extreme face manipulation from 128 128 resolution to 1024 1024 resolution, and significantly improves the face recognition performance under large poses.

2 Related Work

Figure 3: The framework of our method, which consists of a boundary prediction stage and a disentangled face synthesis stage. The first stage predicts the boundary image of the target face in a semi-supervised way. A pose estimator and an expression estimator is used to improve the prediction accuracy. The second stage utilizes the predicted boundary image to synthesize refined face. A proxy network and a feature threshold loss are introduced to disentangle the structure and texture in the latent space.

2.1 Face Manipulation

Face manipulation has attracted great attention to computer vision and graphics

[3, 34, 37, 6, 21, 30, 23]. Recently, Generative Adversarial Networks (GANs) [9] have shown great potential in the field of face manipulation. For example, StarGAN [7] realizes multi-domain face attribute transfer by a single generator. By controlling the magnitude of Action Units (AU), GANimation [25] renders expressions in a continuum. TP-GAN [16] realizes photo-realistic facial frontalization from a single image. FaceID-GAN [27]

introduces an identity classifier as a competitor to better preserve identity when pose and expression change. However, extreme face manipulation methods

[16, 27, 14] are still limited to low-resolution (128128). High-resolution face manipulation with extreme pose and expression remains unexplored.

2.2 High Fidelity Image Synthesis

High fidelity image synthesis is a hot topic in computer vision community. It contains unconditional manner and conditional manner. Unconditional high-resolution image synthesis generates images from noise without any condition. PG-GAN [19] synthesizes high-resolution face images by progressively growing the generator and discriminator. It also introduces a resolution CelebA-HQ database. IntroVAE [15] first utilizes variational model to synthesize high-resolution images without discriminator. For conditional high-resolution image synthesis, the synthesized images need to meet the given conditions. pix2pixHD [33] proposes a coarse-to-fine generator and a multi-scale discriminator for high fidelity image translation. Video-to-video [32] extends pix2pixHD with a spatio-temporal adversarial objective, achieving temporally coherent high-resolution video translation. BigGAN [4] first achieves high-resolution (

) conditional image synthesis on the Imagenet. StyleGAN

[20] introduces an alternative generator to automatically learn attributes and releases a resolution FFHQ database.

3 Method

Given an original face , the goal of our method is to synthesize the target face

, according to a given pose vector

and an expression vector . In addition, we denote the boundary image of the original face and the target face as and , respectively. In order to better realize high fidelity face synthesis with extreme pose and expression, we explicitly divide the face manipulation task into two stages: a boundary prediction stage and a disentangled face synthesis stage, as shown in Fig. 3. In the rest of this section, we will present the above two stages in detail.

3.1 Boundary Prediction

Boundary prediction stage predicts the target boundary image according to the given conditional vectors, including a pose vector and an expression vector. As shown in Fig. 3, we utilize an encoder network and a decoder network to realize this conditional boundary prediction. Specifically, through , we first map the original input boundary image into a latent space . Then, the pose vector and expression vector are concatenated with the hidden variable to provide conditional information. Last, the target boundary image is generated by the decoder network .

The pose and expression are discrete in the database, e.g., the MultiPIE database [10] only has 15 discrete poses and 6 discrete expressions. However, we expect that this stage can generate boundary image with arbitrary pose and expression, including the pose and expression existing in the database or beyond the database. Hence, we introduce a semi-supervised training manner. For the pose and expression in the database, we can utilize the corresponding ground truth to constrain the generated boundary image. For the pose and expression that do not exist in the database, we utilize two pre-trained estimators, including a pose estimator and an expression estimator , to constrain the generated boundary image by conditional regression.

The loss functions involved in this stage are described below, including a pixel-wise loss and a conditional regression loss.

Pixel-Wise Loss. For the pose and expression that belong to the database, a pixel-wise loss is utilized to constrain the predicted boundary image :

(1)

where is the ground truth target boundary image.

Conditional Regression Loss. For the pose and the expression that do not exist in the database, we first randomly produce and to generate boundary image . Then, we utilize a pose estimator and an expression estimator to estimate pose and expression , respectively. The estimated and are used to constrain the generated boundary image. The intuition is that the estimated and of should be equal to the conditional vectors and , respectively. Hence, a conditional regression loss, including a pose regression term and an expression regression term, is formulated as:

(2)

The parameters of the pre-trained and are fixed during training procedure.

3.2 Disentangled Face Synthesis

This stage utilizes the predicted boundary image to perform refined face synthesis. As shown in Fig. 3, we first utilize two encoders and to map the predicted boundary image and the original input face to and , respectively. Then, we disentangle the structure and texture in the latent space, by a proxy network and a feature threshold loss. After disentanglement, the boundary features and the image feature are concatenated to feed into the decoder , synthesizing the final target face .

The loss functions in this stage are presented below, including a feature threshold loss, a multi-scale pixel-wise loss, a multi-scale conditional adversarial loss and an identity preserving loss.

Feature Threshold Loss. The feature threshold loss is designed to assist in disentangling the structure and texture in the latent space. Considering that directly disentangling structure and texture is difficult, we utilize a pre-trained face recognition network as a proxy network , whose features are thought to be structure invariant. In addition, instead of directly utilizing the compact features that will result in texture loss, as shown in Fig. 2 (c), we introduce a feature threshold loss to better disentangle the structure and texture. Specifically, it controls the feature distance between the face features and the compact features :

(3)

where and is a threshold value. As the loss decreases, the face features are closer to the compact features , which means the structure and the texture are more disentangled. Meanwhile, the threshold value controls the compact degree of face features , which is used to maintain the texture. The parameter analysis of is presented in Section 4.4.

Figure 4: Synthesis results (512 512) with different poses and expressions on the MVF-HQ database. For each image pair, the left is the original input and the right is the synthesized result. Zoom in for details.

Multi-Scale Pixel-Wise Loss. We introduce a multi-scale pixel-wise loss to constrain the synthesized face at different scales. Specifically, with the downsampling operation on factors of 2 and 4, we first obtain an image pyramid of 3 scales of the synthesized and the ground truth faces, respectively. Then, we calculate the pixel-wise loss on these 3 scales faces:

(4)

where denotes the scales. The pixel-wise loss at the top of the image pyramid pays more attention to the global information, because it has a larger receptive field. On the contrary, the pixel-wise loss in the bottom of the image pyramid is more concerned with the recovery of details.

Multi-Scale Conditional Adversarial Loss. To improve the sharpness of the synthesized face images, we also introduce a conditional adversarial loss. The discriminator tries to distinguish the fake image pair from the real image pair , and the generator tries to fool the discriminator:

(5)

In order to improve the ability of the discriminator, we adopt the multi-scale discriminant strategy [33]. It utilizes three discriminators to discriminate the synthesized images at three different scales.

Identity Preserving Loss. In order to further preserve the identity information of the synthesized faces, we adopt an identity preserving loss as [14]. Specifically, a pre-trained Light CNN [35] is introduced as a feature extractor . It forces the identity features of the synthesized face to be as close to the identity features of the real face as possible. The identity preserving loss is formulated as:

(6)

where and denote the output of last pooling layer and the fully connected layer, respectively.

3.3 Overall Loss

The boundary prediction stage and the disentangled face synthesis stage are trained separately. We first train the boundary prediction stage, and then utilize the predicted boundary to train the face synthesis stage. For the boundary prediction stage, the overall loss is:

(7)

For the the face synthesis stage, the overall loss is:

(8)

where , , , , and are the trade-off parameters.

4 Experiments

We evaluate our method on four databases. The details of databases and experimental settings are first introduced in Section 4.1. Then, qualitative and quantitative results are presented in Sections 4.2 and 4.3, respectively. Finally, experimental analysis is described in Section 4.4.

Figure 5: Synthesis results (1024 1024) on the MVF-HQ database. The lower right corner is the input face.

4.1 Databases and Settings

Classic Databases. Three classic face databases, including MultiPIE [10], RaFD [22] and CelebA-HQ [19], are chosen in our experiments. MultiPIE contains 337 identities under 15 poses, 20 illumination levels and 6 expressions. In our quantitative experiments, the division of training and testing sets are the same with the Setting 2 in [38], which only contains the natural expression. While in our qualitative experiments, we also use the data of the other 5 expressions. RaFD consists of 8,040 images, including 73 participants with 8 expressions, 3 gaze directions and 5 poses. We randomly select 10 identities as the testing set and use the remaining identities as the training set. CelebA-HQ is an in-the-wild database that consists of 30,000 celebrity images. Considering that most of the images in CelebA-HQ are frontal view, we utilize a 3D model [43] to make corresponding paired profiles. We randomly choose 3,000 images as the testing set and use the remaining images as the training set. Note that, all face images in MultiPIE are aligned to resolution, while the images in RaFD and CelebA-HQ are aligned to resolution.

Database Images Resolution Identities Poses Expressions Paired Year
RaFD [22] 8,040 512 512 73 5 8 2008
CelebA-HQ [19] 30,000 1024 1024 No Label No Label No Label 2017
FFHQ [20] 70,000 1024 1024 No Label No Label No Label 2018
MVF-HQ(Ours) 120,283 2048 2048 479 13 3 2019
Table 1: Comparisons of existing high-resolution face databases. Resolution means the maximum resolution of facial area that can be aligned.

MVF-HQ Database. In order to verify the effectiveness of our high fidelity method in the extreme case, we need a higher-resolution database with various poses and expressions. However, most of the existing face manipulation databases, e.g., MultiPIE, are limited in low-resolution. Although RaFD database can be aligned to , the number (8,040 images) and the diversity (5 poses) are limited. The recently released high-resolution databases CelebA-HQ [19] and FFHQ [20] have greatly pushed forward the advances of high-resolution image synthesis, but the pose of these databases are also limited.

Therefore, we create a new high quality Multi-View Face (MVF-HQ) database that consists of 120,283 images 111Some face images are removed while manually cleaning the database. from 479 identities, including 13 poses, 3 expressions and 7 illuminations. The facial area of MVF-HQ can be aligned up to resolution. The comparisons of existing public high-resolution face databases are presented in Table 1, which shows the advantages of our MVF-HQ database. More information about this database is presented in supplementary materials. In our experiments, we randomly select 336 identities as the training set and the remaining 143 identities are treated as the testing set. There are no identity overlaps between training and testing. In addition, due to the limited GPU memory, we only conduct experiments at and resolutions. Higher resolution will be explored in our future work. We will release this database soon, along with 5 precise facial landmarks annotated by human.

Experimental Settings. The facial boundary image is obtained according to the facial landmarks. Thanks to the advances in facial landmark detection [5], we first detect 68 facial landmarks, and then connect the adjacent landmarks to obtain a boundary image. Meanwhile, pose vectors are calculated according to the detected facial landmarks. Moreover, we utilize the Action Units (AUs) [8] as our expression vectors, which are collected by the open source toolkit [1]. The pose estimator and expression estimator in Section 3.1 are pre-trained on the above four databases and a large-scale in-the-wild database CelebA [24]

. Our method is implemented by Pytorch. The parameters

, , , , and in Section 3.3 are set to 1, 0.1, 0.01, 50, 0.5 and 0.02, respectively. The parameter in Eq. 3 is set to 7. The learning rate is set to 0.0002. Note that the high-resolution experiments on the MVF-HQ database are conducted on 8 NVIDIA Titan X GPUs with 12GB memory. Training takes about 12 days for 1024 1024 resolution and about 7 days for 512 512 resolution.

Figure 6: Visual comparisons with CAPG-GAN [14] on the MultiPIE Setting 2. Our method achieves better results in texture, e.g., the freckles in the first set of images. Zoom in for details.
Figure 7: Synthesis results on the MultiPIE database. The boundary images are generated by our boundary prediction stage.
Figure 8: Facial expression and pose synthesis (512 ) on the RaFD database. The first column is the input, and the remaining columns are synthesized results with different expressions and poses.
Figure 9: Visual comparisons (512 ) with pix2pixHD [33] on the RaFD database (on the left) and MVF-QH database (on the right). Zoom in for details.

4.2 Qualitative Experiments

Experimental Results on the MultiPIE. According to the given conditional vectors, our method can render an input face to arbitrary pose and expression. Another state-of-the-art work to realize the similar task is CAPG-GAN [14], which rotates a face to arbitrary pose controlled by 5 facial landmarks. CAPG-GAN directly concatenates the original faces and the target landmarks as input, and then feeds them into the generator. The comparison results between our method and CAPG-GAN are shown in Fig. 6. We can see that the synthesized images of CAPG-GAN can not preserve the texture well, e.g., the freckles in the first set of images. It is because that CAPG-GAN does not disentangle structure and texture in the latent space. On the contrary, the synthesized images by our method are closer to the ground truth. Besides, different from CAGP-GAN, our method can also render expression. Fig. 7 presents more synthesized results. We can observe that structure and texture of our synthesized images are all preserved well, even under extreme pose and expression.

Experimental Results on the RaFD. We first compare our method with pix2pixHD [33], which is a state-of-the-art high-resolution conditional image-to-image method, as shown in Fig. 9. Compared with pix2pixHD that lacks the guidance of structure, the synthesized images of our method have better quality. More results under different expressions and poses are shown in Fig. 8. Note that the number of training images in the RaFD database is small, which brings huge challenges to the network training. Fig. 8 and Fig. 9 show the ability of our method to achieve extreme high-resolution results in the case of limited training images.

Experimental Results on the MVF-HQ. The comparisons between our method and the pix2pixHD [33] are shown in Fig. 9. Compared with pix2pixHD, our method still maintains the degree of sharpness under such an extreme pose. More synthesis results of our method with different poses and expressions are shown in Fig. 1 and Fig. 4. We observe that our method can faithfully synthesize photo-realistic details, including the eyebrows, eyes, teeth, hair, etc. Moreover, we also extend our method to 1024 1024 resolution. As seen in Fig. 5, our method achieves great results in such a challenging situation.

Figure 10: Synthesis results (512 512) on the CelebA-HQ database. The lower right corner is the input face.

Experimental Results on the CelebA-HQ. In order to further explore the expansibility of our method under in-the-wild situation, we perform visual comparison of face synthesis on the CelebA-HQ database. Fig. 10 shows the results of the synthesized frontal faces from the profiles. Our method can not only preserve the overall facial structures, but also recover the unseen textures.

4.3 Quantitative Experiments

In this section, we evaluate the identity preserving property and synthesis quality of our method. As shown in Fig. 7, our method can effectively restore the structure and texture from the profile faces, which can be used to improve face recognition under large poses [14, 36, 12]. Hence, we compare the face recognition accuracy of our method with the state-of-the-art face normalization methods, including 3D-PIM [42], CAPG-GAN [14], PIM [41], TP-GAN [16], FF-GAN [39] and DR-GAN [31] on the MultiPIE Setting 2. The comparison results are shown in Table 2. We can see that our method significantly outperforms its competitors, especially under extreme poses ( and ). In addition, Table 3 further tabulates the results of different methods on the MVF-HQ database. Compared with the Setting 2 of MultiPIE that only contains natural expression, our new database is more challenging, because of the complicated expressions. We observe that our method still outperforms other state-of-the-art methods, including pix2pixHD [33] and CAPG-GAN. The face recognition results on the MultiPIE and MVF-HQ suggest that our method can effectively improve the recognition performance under large poses.

Besides, in order to evaluate the quality of the synthesized images, we also compare Frchet Inception Distance (FID) [13] with CAPG-GAN and pix2pixHD. We calculate the FID between the real faces and the synthesized faces. The results in Table 4 (a) further show that the high quality synthesis character of our method.

Method
DR-GAN [31] - -
FF-GAN [39]
TP-GAN [16]
PIM [41]
CAPG-GAN [14]
3D-PIM [42]
Ours
Table 2: Comparisons of Rank-1 recognition rates (%) across views under the MultiPIE Setting2.
Method
CAPG-GAN [14]
pix2pixHD [33]
Light CNN [35]
Ours
Table 3: Comparisons of Rank-1 recognition rates (%) across views under the MVF-HQ database.
CAPG-GAN pix2pixHD Ours
FID
(a) Comparisons with the state-of-the-art methods.
w/o w/o w/o w/o w/o
FID
(b) Comparisons with our variants.
Table 4: The FID comparisons (lower is better) with CAPG-GAN [14], pix2pixHD [33] and our variants on the MVF-HQ database.

4.4 Experimental Analysis

Ablation Study. In this subsection, we study the roles of the five loss functions in our method. Both qualitative visualization results and quantitative FID results are reported for a comprehensive comparison. Fig. 11 shows the visual comparisons between our method and its five variants. We can see that without , the generated boundary image, which does not belong to the database, will be unclear, resulting in an incomplete synthesized face. Without , the local structures (e.g., the eyes and nose) are ambiguous and the textures are confused, indicating the usage of disentanglement. Without and , the synthesized faces display different degrees of blur, revealing the validity of multi-scale pixel-wise loss and multi-scale adversarial loss (Note that w/o means only utilizing one scale pixel-wise loss.). Without , the local textures (e.g., the beard) are few. Table 4 (b) further tabulates the FID results of different variants of our method. We can see that the FID will increase if one loss is not adopted, which is consistent with the visualization results. These visualization results and FID results verify that each component in our method is essential for extreme high-resolution face manipulation.

Parameter Analysis. As mentioned in Section 3.2, the parameter of Eq. 3 affects the degree of disentanglement. We present the visual results when taking different values of in Fig. 11. We can observe that when is too large, the synthesized faces are blurred due to the weak disentanglement. On the contrary, when is too small, the texture of the synthesized faces will be somewhat lost because of the too compact face features. The best result is obtained when .

Figure 11: The results (512 512) of experimental analysis. The first row shows the results of ablation study and the second row presents the parameter analysis of in Eq. 3.

5 Conclusion

This paper has developed a stage-wise framework for high fidelity face manipulation with extreme pose and expression. It simplifies the face manipulation into two correlated stages: a boundary prediction stage and a disentangled face synthesis stage. The first stage predicts the boundary image of the target face in a semi-supervised way, modeling pose and expression jointly. The second stage utilizes the predicted boundary to perform refined face synthesis. It introduces a proxy network and a novel feature threshold loss to disentangle the structure and texture in the latent space. Further, a new high-resolution MVF-HQ database is created to promote the development of high-resolution face synthesis. Extensive experiments show that our method pushes forward the advance of extreme face manipulation from 128 128 resolution to 1024 1024 resolution, and significantly improves the face recognition performance under large poses.

References

  • [1] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency. Openface 2.0: Facial behavior analysis toolkit. In FG, 2018.
  • [2] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Towards open-set identity preserving face synthesis. In CVPR, 2018.
  • [3] V. Blanz, C. Basso, T. Poggio, and T. Vetter. Reanimating faces in images and video. In CGF, 2003.
  • [4] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. In ICLR, 2019.
  • [5] A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In ICCV, 2017.
  • [6] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Facewarehouse: A 3d facial expression database for visual computing. TVCG, 2014.
  • [7] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
  • [8] E. Friesen and P. Ekman. Facial action coding system: a technique for the measurement of facial movement. Palo Alto, 1978.
  • [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [10] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image and Vision Computing, 2010.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [12] R. He, X. Wu, Z. Sun, and T. Tan. Wasserstein cnn: Learning invariant features for nir-vis face recognition. TPAMI, 2018.
  • [13] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a nash equilibrium. In NIPS, 2017.
  • [14] Y. Hu, X. Wu, B. Yu, R. He, and Z. Sun. Pose-guided photorealistic face rotation. In CVPR, 2018.
  • [15] H. Huang, Z. Li, R. He, Z. Sun, and T. Tan.

    Introvae: Introspective variational autoencoders for photographic image synthesis.

    In NIPS, 2018.
  • [16] R. Huang, S. Zhang, T. Li, and R. He. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In ICCV, 2017.
  • [17] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In CVPR, 2017.
  • [18] Y. Jo and J. Park. Sc-fegan: Face editing generative adversarial network with user’s sketch and color. arXiv:1902.06838, 2019.
  • [19] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.
  • [20] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. arXiv:1812.04948, 2018.
  • [21] I. Kemelmacher-Shlizerman, S. Suwajanakorn, and S. M. Seitz. Illumination-aware age progression. In CVPR, 2014.
  • [22] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and A. Van Knippenberg. Presentation and validation of the radboud faces database. Cognition and Emotion, 2010.
  • [23] P. Li, Y. Hu, R. He, and Z. Sun. Global and local consistent wavelet-domain age synthesis. TIFS, 2018.
  • [24] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
  • [25] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In ECCV, 2018.
  • [26] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He.

    Data distillation: Towards omni-supervised learning.

    In CVPR, 2018.
  • [27] Y. Shen, P. Luo, J. Yan, X. Wang, and X. Tang. Faceid-gan: Learning a symmetry three-player gan for identity-preserving face synthesis. In CVPR, 2018.
  • [28] Z. Shu, M. Sahasrabudhe, R. Alp Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. In ECCV, 2018.
  • [29] L. Song, Z. Lu, R. He, Z. Sun, and T. Tan. Geometry guided adversarial facial expression synthesis. In ACMM, 2018.
  • [30] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Real-time face capture and reenactment of rgb videos. In CVPR, 2016.
  • [31] L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for pose-invariant face recognition. In CVPR, 2017.
  • [32] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. In NIPS, 2018.
  • [33] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
  • [34] Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras. Face relighting from a single image under arbitrary unknown lighting conditions. TPAMI, 2009.
  • [35] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. TIFS, 2018.
  • [36] X. Wu, H. Huang, V. M. Patel, R. He, and Z. Sun. Disentangled variational representation for heterogeneous face recognition. In AAAI, 2019.
  • [37] F. Yang, J. Wang, E. Shechtman, L. Bourdev, and D. Metaxas. Expression flow for 3d-aware face component transfer. TOG, 2011.
  • [38] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, and J. Kim.

    Rotating your face using multi-task deep neural network.

    In CVPR, 2015.
  • [39] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Towards large-pose face frontalization in the wild. In ICCV, 2017.
  • [40] F. Zhang, T. Zhang, Q. Mao, and C. Xu. Joint pose and expression modeling for facial expression recognition. In CVPR, 2018.
  • [41] J. Zhao, Y. Cheng, Y. Xu, L. Xiong, J. Li, F. Zhao, K. Jayashree, S. Pranata, S. Shen, J. Xing, S. Yan, and J. Feng. Towards pose invariant face recognition in the wild. In CVPR, 2018.
  • [42] J. Zhao, L. Xiong, Y. Cheng, Y. Cheng, J. Li, L. Zhou, Y. Xu, J. Karlekar, S. Pranata, S. Shen, J. Xing, S. Yan, and J. Feng. 3d-aided deep pose-invariant face recognition. In IJCAI, 2018.
  • [43] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment across large poses: A 3d solution. In CVPR, 2016.

6 Supplementary Materials

6.1 Multi-View Face (MVF-HQ) Database

The data acquisition system is shown in Fig. 12 (a). It consists of 13 digital cameras (Canon EOS 1300D/1500D with 55mm prime lens), locating at the same height with head. The angle between two cameras is . All the cameras are connected with the computers to take photos simultaneously. The original images with different views are shown in Fig. 12(c). The resolution is 60004000 and the facial area can reach 2048 2048. Besides, 7 illuminations are also provided in the acquisition system, including above, front, front-above, front-below, behind, left and right. A total of 479 volunteers participated in the database collection and all the volunteers have signed a license. Each participant is asked to display three facial expressions, including neutral, smile and surprise, as shown in Fig. 12 (b). Therefore, the total number of images is 130,767 (479 identities 3 expressions 13 views 7 illuminations). After manually cleaning the database, the final number of images is 120,283. In addiction, we also manually mark 5 facial landmarks for each image.

Figure 12: Technical setup and examples (60004000 resolution) of our MVF-HQ database. (a) Technical setup. (b) Examples for the three expressions. (c) Examples for the thirteen views.

6.2 Network Architectures

The network architectures of generators , and discriminator are presented in Table 5, Table 6 and Table 7, respectively. Conv

contains convolution, instance normalization and ReLU, while

Conv* just contains convolution. Meanwhile, Deconv

contains deconvolution with one output padding, instance normalization and ReLU. In addition, the architecture of

is the same with , expect the number of channels in the convolution layer is half of . Res-block denotes a residual block [11]. The output shape in Table 4 is for the input resolution of 1024 1024, the same architecture is directly applied to the other two scales of the input image (512 512 and 256 256), same as [33].

Filter/Stride/Padding

Output Shape
Input image - 3 1024 1024
Conv 7 7 / 1 / 3 32 1024 1024
Conv 3 3 / 2 / 1 64 512 512
Conv 3 3 / 2 / 1 128 256 256
Conv 3 3 / 2 / 1 256 128 128
Conv 3 3 / 2 / 1 512 64 64
Conv 3 3 / 2 / 1 256 32 32
Conv 3 3 / 2 / 1 128 16 16
Conv 3 3 / 2 / 1 128 8 8
Table 5: The architecture of .
Filter/Stride/Padding Output Shape
Input feature - 192 8 8
Res-block 3 3 / 1 / 1 512 8 8
Res-block 3 3 / 1 / 1 512 8 8
Res-block 3 3 / 1 / 1 512 8 8
Deconv 3 3 / 2 / 1 512 16 16
Deconv 3 3 / 2 / 1 512 32 32
Deconv 3 3 / 2 / 1 256 64 64
Deconv 3 3 / 2 / 1 256 128 128
Deconv 3 3 / 2 / 1 128 256 256
Deconv 3 3 / 2 / 1 64 512 512
Deconv 3 3 / 2 / 1 32 1024 1024
Conv* 7 7 / 1 / 3 3 1024 1024
Table 6: The architecture of .
Discriminator Filter/Stride/Padding Output Shape
Input image - 6 1024 1024
Conv 4 4 / 2 / 1 64 512 512
Conv 4 4 / 2 / 1 128 256 256
Conv 4 4 / 2 / 1 256 128 128
Conv 4 4 / 1 / 1 512 127 127
Conv* 4 4 / 1 / 1 1 126 126
Table 7: The architecture of discriminator.

6.3 Additional Results on the CelebA-HQ

Due to the effects of uncontrolled variants, such as illumination and background, high-resolution face frontalization under the in-the-wild setting is challenging. More visualization results (512 512 resolution) on the CelebA-HQ database are shown in Fig. 13 and Fig. 14, which demonstrate the effectiveness of our method in the uncontrolled situation.

6.4 Additional Results on the MVF-HQ

More face rotation results on the MVF-HQ database are presented in Fig. 15 (512 512 resolution), Fig. 16 (1024 1024 resolution) and Fig. 17 (1024 1024 resolution). The results (512 512 resolution) of continuous pose change are shown as a GIF file in our folder (the frontal image is the original input face and the others are synthesized).

6.5 Additional Results on the MultiPIE

Fig. 18 shows the results of rotating an input frontal face to arbitrary poses and expressions. We can observe that the synthesis results are impressive, even under the extreme poses and expressions. Fig. 19 presents the results of frontalizing a profile face () with expression changes. It is challenging to frontalize a profile face under , let alone changing the expression at the same time. We can see that the results of our method are very close to the ground truth.

6.6 Additional Results on the RaFD

By continuously controlling the expression vector, our method can realize continuous expression change. The results (512 512 resolution) are shown as a GIF file in our folder (the first image with neutral expression is the original face and the others are synthesized).

Figure 13: Visualization results (512 512 resolution) on the CelebA-HQ database. The lower right corner is the input profile.
Figure 14: Visualization results (512 512 resolution) on the CelebA-HQ database. The lower right corner is the input profile.
Figure 15: Visualization results (512 512 resolution) on the MVF-HQ database. The lower right corner is the input face.
Figure 16: Visualization results (1024 1024 resolution) on the MVF-HQ database. The lower right corner is the input face.
Figure 17: Visualization results (1024 1024) on the MVF-HQ database. The lower right corner is the input face.
Figure 18: Visualization results on the MultiPIE database. The first column is the input, and the remaining columns are the synthesized results.
Figure 19: Visual results on the MultiPIE database. For each image set, the first image is the input, the second image is the synthesized result, and the last image is the ground truth.