M2FPA: A Multi-Yaw Multi-Pitch High-Quality Database and Benchmark for Facial Pose Analysis

03/30/2019
by   Peipei Li, et al.
0

Facial images in surveillance or mobile scenarios often have large view-point variations in terms of pitch and yaw angles. These jointly occurred angle variations make face recognition challenging. Current public face databases mainly consider the case of yaw variations. In this paper, a new large-scale Multi-yaw Multi-pitch high-quality database is proposed for Facial Pose Analysis (M2FPA), including face frontalization, face rotation, facial pose estimation and pose-invariant face recognition. It contains 397,544 images of 229 subjects with yaw, pitch, attribute, illumination and accessory. M2FPA is the most comprehensive multi-view face database for facial pose analysis. Further, we provide an effective benchmark for face frontalization and pose-invariant face recognition on M2FPA with several state-of-the-art methods, including DR-GAN, TP-GAN and CAPG-GAN. We believe that the new database and benchmark can significantly push forward the advance of facial pose analysis in real-world applications. Moreover, a simple yet effective parsing guided discriminator is introduced to capture the local consistency during GAN optimization. Extensive quantitative and qualitative results on M2FPA and Multi-PIE demonstrate the superiority of our face frontalization method. Baseline results for both face synthesis and face recognition from state-of-theart methods demonstrate the challenge offered by this new database.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

page 8

page 12

page 14

page 15

page 16

05/31/2017

Representation Learning by Rotating Your Faces

The large pose discrepancy between two face images is one of the fundame...
04/20/2019

Facial Feature Embedded CycleGAN for VIS-NIR Translation

VIS-NIR face recognition remains a challenging task due to the distincti...
05/13/2019

A High-Efficiency Framework for Constructing Large-Scale Face Parsing Benchmark

Face parsing, which is to assign a semantic label to each pixel in face ...
07/22/2020

Multi-Spectral Facial Biometrics in Access Control

This study demonstrates how facial biometrics, acquired using multi-spec...
12/17/2019

LAMP-HQ: A Large-Scale Multi-Pose High-Quality Database for NIR-VIS Face Recognition

Near-infrared-visible (NIR-VIS) heterogeneous face recognition matches N...
01/22/2020

M^2 Deep-ID: A Novel Model for Multi-View Face Identification Using Convolutional Deep Neural Networks

Despite significant advances in Deep Face Recognition (DFR) systems, int...
06/01/2016

A 3D Face Modelling Approach for Pose-Invariant Face Recognition in a Human-Robot Environment

Face analysis techniques have become a crucial component of human-machin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the development of deep learning, face recognition systems have achieved 99% accuracy  

[21, 3, 28] on some popular databases [11, 16]. However, in some realworld surveillance or mobile scenarios, the captured face images often contain extreme view-point variations so that face recognition performance is significantly affected. Recently, the great progress of face synthesis [10, 12, 33] has pushed forward the development of recognition via generation . TP-GAN [12] and CAPG-GAN [10] perform face frontalization to improve recognition accuracy under large poses. DA-GAN [33] is proposed to simulate profile face images, facilitating pose-invariant face recognition. However, their performance often depends on the diversity of pose variations in the training databases.

The existing face databases with pose variations can be categorized into two classes. The ones, such as LFW [11], IJB-A[17] and VGGFace2 [3], are collected from the Internet. Considering these databases, the pose variations follow a long-tailed distribution, that is there are few profile faces. Moreover, it is obvious that obtaining the accurate pose labels is difficult for these databases. The others, including CMU PIE [24], CAS-PEAL-R1 [6] and CMU Multi-PIE [8], are captured under the constrained environment across accurate poses. These databases often pay attention to yaw angles without considering pitch angles. However, in many surveillance or mobile scenarios, a facial image often has large yaw and pith angles simultaneously. Current public face databases mainly consider the case of yaw variations. Such the face recognition across both yaw and pitch angles needs to be extensively evaluated in order to ensure the robustness of recognition system. Therefore, it is crucial to provide researchers with a multi-yaw multi-pitch high-quality face database for facial pose analysis, including face frontalization, face rotation, facial pose estimation and pose-invariant face recognition.

In this paper, a Multi-yaw Multi-pitch high-quality database for Facial Pose Analysis (MFPA) is proposed to address this issue. The comparisons with the existing facial pose analysis databases are summarized in Table 1. The main advantages lie in the following aspects: (1) Large-scale. MFPA includes totally 397,544 images of 229 subjects with 62 poses, 4 attributes and 7 illuminations. (2) Accurate and diverse poses. We design an acquisition system to simultaneously capture 62 poses, including 13 yaw angles (ranging from to ), 5 pitch angles (ranging from to ) and 44 yaw-pitch angles. (3) High-resolutions. All the images are captured by the SHL-200WS (2.0-megapixel CMOS camera), which leads to high-quality resolutions (). (4) Accessory variations. We use several types of glasses as accessories to further increase the diversity of our database with occlusions.

Database Yaw Pitch Yaw-Pitch Attributes Illuminations Subjects Images Image Size Controllabled Size[GB] Paired Year
PIE [24] 9 2 2 4 21 68 41,000+ 640486 40 2003
LFW [11] No label No label No label No label No label 5,749 13,233 250250 0.17 2007
CAS-PEAL-R1 [6] 7 2 12 5 15 1,040 30,863 640480 26.6 2008
Multi-PIE [8] 13 0 2 6 19 337 755,370 640480 305 2009
IJB-A [17] No label No label No label No label No label 500 25,809 1026698 14.5 2015
CelebA [19] No label No label No label No label No label 10,177 202,599 505606 9.49 2016
CelebA-HQ [14] No label No label No label No label No label No label 30,000 10241024 27.5 2017
FF-HQ [15] No label No label No label No label No label No label 70,000 10241024 89.3 2018
MFPA (Ours) 13 5 44 4 7 229 397,544 19201080 421 2019
Table 1: Comparisons of existing facial pose analysis databases. Image Size is the average size across all the images in the database. In Multi-PIE, part of frontal images are 30722048 in size, but the most are 640480 resolution. Images have much background in IJB-A.

To the best of our knowledge, MFPA is the most comprehensive multi-view face database which covers variations in yaw, pitch, attribute, illumination, accessory. MFPA will provide researchers developing and evaluating the new algorithms for facial pose analysis, including face frontalization, face rotation, facial pose estimation and pose-invariant face recognition. Further, in order to provide an effective benchmark for face frontalization and pose-invariant face recognition on the MFPA databases, we implement and evaluate several state-of-the-art methods, including DR-GAN[27], TP-GAN[12] and CAPG-GAN[10].

In addition, we propose a simple yet effective parsing guided discriminator, which introduces the parsing map [19] as a flexible attention to capture local consistency during GAN optimization. First, a pre-trained face parser captures the three local masks, including hairstyle, skin and facial features (eyes, nose and mouth). Second, we treat these parsing masks as the soft attention, facilitating the synthesized frontal images and the ground truth. And then, these local features are fed into a discriminator, called parsing guided discriminator, to ensure the local consistency of the synthesized frontal images. In this way, we can synthesize photo-realistic frontal images with extreme yaw and pitch variations on the MFPA and Multi-PIE databases.

The main contributions of this paper are as follows:

  • We introduce a Multi-yaw Multi-pitch high-quality database for Facial Pose Analysis (MFPA). It contains 397,544 images of 229 subjects with yaw, pitch, attribute, illumination and accessory.

  • We provide a comprehensive qualitative and quantitative benchmark of several state-of-the-art methods for face frontalization and pose-invariant face recognition, including DR-GAN[27], TP-GAN[12] and CAPG-GAN[10], on the MFPA database.

  • We propose a simple yet effective parsing guided discriminator, which introduces parsing maps as soft attention to capture the local consistency during GAN optimization. In this way, we can synthesize phototrealistic frontal images on MFPA and Multi-PIE.

Figure 1: An example of the yaw and pitch variations in our MFPA database. From top to bottom, the pitch angles of the 6 camera layers are , , , , and , respectively. The yaw pose of each image is shown in the green box. Zoom in for a better view.

2 Related Work

2.1 Databases

The existing face databases with pose variations can be categorized into two classes. Some databases, including LFW [11], IJB-A [17], VGGFace2 [3], CelebA [19] and CelebA-HQ [14], are often collected from the Internet. Therefore, the pose variations in these databases follow a long-tailed distribution, which means there are lots of nearly frontal faces but few profile ones. In addition, it is expensive to obtain the precious pose labels for these facial images, which leads to difficulties for face frontalization, face rotation and facial pose estimation. Others, such as CMU PIE [24], CMU Multi-PIE [8] and CAS-PEAL-R1 [6], are captured under constrained environment with precise controlling of angles. Both CUM PIE and CMU Multi-PIE vary only yaw angles ranging from to . CAS-PEAL-R1 contains 14 poses with pitch variations, but these pitch variations are captured by asking the subjects to look upward/downward, which leads to inaccurate pose labels. Moreover, CAS-PEAL-R1 contains accessory variations only with the frontal faces. Different from the existing databases, MFPA contains variations including attribute, illumination, accessory across precious yaw and pitch angles.

2.2 Face Rotation

Face rotation is an extremely challenging ill-posed task in computer vision. In recent years, benefiting from Generative Adversarial Network (GAN) 

[7], face rotation has made great progress. Currently, state-of-the-art face rotation algorithms can be categorized into two aspects, including 2D [27, 12, 10, 31, 23, 26] and 3D [30, 33, 4, 32, 2, 20] based methods. For 2D based methods, Tran et.al [27] propose DR-GAN to disentangle pose variations from the facial images. TP-GAN [12] employs a two path model, including global and local generators, to synthesize photo-realistic frontal faces. Hu et. al [10] incorporate landmark heatmaps as a geometry guidance to synthesize face images with arbitrary poses. PIM [31] performs face frontalization in a mutual boosting way with a dual-path generator. FaceID-GAN [23] extends the conventional two-player GAN to three players, competing with the generator by disentangling the identities of real and synthesized faces. Considering 3D-based methods, FF-GAN [30] incorporates 3DMM into GAN to provide the shape and appearance prior. DA-GAN [33] employs a dual architecture to refine a 3D simulated profile face. UV-GAN [4] considers face rotation as a UV map completion task. 3D-PIM [32] incorporates a simulator with a 3D Morphable Model to obtain shape and appearance prior for face frontalization. Moreover, DepthNet [20] infers plausible 3D transformations from one face pose to another, to realize face frontalization.

3 The MFPA Database

In this section we present an overview of the MFPA database, including how it was collected, cleaned, annotated and its statistics. To the best of our knowledge, MFPA is the first publicly available database that contains precise and multiple yaw and pitch variations. In the rest of this section, we first introduce the hardware configuration and data collection, then describe the cleaning and annotating procedure. Finally, we present the statistics of MFPA, including the yaw and pitch poses, the types of attributes and the positions of illuminations.

3.1 Data Acquisition

We design a flexible multi-camera acquisition system to capture faces with multiple yaw and pitch angles. Figure 2 shows an overview of the acquisition system. It is built by many removable brackets, forming an approximate hemisphere with a diameter of 3 meters. As shown in Figure 3, the acquisition system contains 7 horizontal layers, where the first six (Layer1Layer6) are the camera layers and the last one is the balance layer. The interval between two adjacent layers is . The Layer4 has the same height with the center of hemisphere (red circle in Figure 3). Therefore, we set the pitch angle of Layer4 to . As a result, from top to bottom, the intervals between the rest 5 camera layers and the Layer4 are , , , and , respectively.

Figure 2: Overview of the acquisition system. It contains total 7 horizontal layers. The bottom is the balanced layer and the rest are the camera layers.
Figure 3: The diagram of camera positions. The left and right are the cutaways of frontal and side views, respectively.

A total of 62 SHL-200WSs (2.0-megapixel CMOS camera with 12mm prime lens) are located on these 6 camera layers. As shown in Figure 3, there are 5, 9, 13, 13, 13 and 9 cameras on the Layer1, 2, 3, 4, 5 and 6, respectively. For each layer, the cameras are evenly located from to . The detailed yaw and pitch angles of each camera can be found in Figure 1 and Table 2. All the 62 cameras are connected to 6 computers through USB interfaces and a master computer synchronously dominates these computers. We develop a software to simultaneously control the 62 cameras and collect all the 62 images in one shot to ensure the consistency. In addition, as described in Figure 4, there are 7 different directions of light source equipped on our acquisition system, including above, front, front-above, front-below, behind, left and right. In order to maintain the consistency of the background, we construct some brackets and white canvas behind the acquisition system, as shown in the upper left corner in Figure 2.

A total of 300 volunteers are chosen to create the MFPA and all the participants have signed a license. During the collection procedure, we fix a chair and provide a headrest to ensure position of face is at the center of hemisphere. Each participant has 4 attributes, including neutral, wearing glass, smile and surprise. Figure 5 shows some examples of the attributes. Therefore, we totally capture () facial images.

Figure 4: The diagram of illumination positions. The left and right are the cutaways of frontal and side views, respectively.
Figure 5: Examples of four attributes in MFPA.

3.2 Data Cleaning and Annotating

After collection, we manually check all the facial images and remove those participants whose entire head is not captured by one or more cameras. In the end, we eliminate 71 participants with information missing, and the remaining 229 participants form our final MFPA database. Facial landmark detection is an essential preprocessing in facial pose analysis, such as face rotation and pose-invariant face recognition. However, current methods [1, 25] often fail to accurately detect facials landmarks with extreme yaw and pitch angles. In order to ease the utilization of our database, we manually mark the five facial landmarks of each image in MFPA.

3.3 The Statistics of MFpa

Poses Pitch = Yaw =
Pitch = Yaw =
Pitch = Yaw =
Pitch = Yaw =
Pitch = Yaw =
Pitch = Yaw =
Attributes Happy, Normal, Wear glasses, Surprise
Illuminations Above, Front, Front-above, Behind
Front-below, Left, Right
Table 2: The poses, attributes and illuminations in MFPA.

After manually cleaning, we retain 397,544 facial images of 229 subjects, covering 62 poses, 4 attributes and 7 illuminations. Table 2 presents the poses, attributes and illuminations of our MFPA database. Compared with the existing facial pose analysis databases, as summarized in Table 1, the main advantages of MFPA lie in four-folds:

  • Large-scale. MFPA contains total 397,544 facial images of 229 subjects with 62 poses, 4 attributes and 7 illuminations. It spends almost one year to establish the multi-camera acquisition system and collect such a number of images.

  • Accurate and diverse poses. Our acquisition system can simultaneously capture 62 poses in one shot, including 13 yaw angles (ranging from to ), 5 pitch angles(ranging from to ) and 44 yaw-pitch angles. To the best of our knowledge, MFPA is the first publicly available database that contains precise and multiple yaw and pitch angles.

  • High-resolution. All the images are captured by the SHL-200WS (2.0-megapixel CMOS camera), leading to high resolution ().

  • Accessory variations. In order to further increase the diversity of MFPA, we add five types of glasses as the accessories, including dark sunglasses, pink sunglasses, round glasses, librarian glasses and rimless glasses.

4 Approach

In this section, we propose a parsing guided local discriminator into GAN training, as is shown in Figure 6. We introduce parsing map [19] as a flexible attention to capture the local consistency of the real and synthesized frontal images. In this way, our method can effectively frontalize a face with yaw-pitch variations and accessory occlusions on the new MFPA database.

Figure 6: The overall framework of our method.

4.1 Network Architecture

Given a profile facial image and its corresponding frontal face , we can obtain the synthesized frontal image by a generator ,

(1)

where is the parameter of . The architecture of generator is detailed in Supplementary Materials.

As shown in Figure 6, we introduce two discriminators during GAN optimization, including a global discriminator and a parsing guided local discriminator . Specially, the discriminator aims to distinguish the real image and the synthesized frontal image from a global view. Considering photo-realistic visualizations, especially for faces with extreme yaw-pitch angles or accessory, it is crucial to ensure the local consistency between the synthesized frontal image and the ground truth. First, we utilize a pre-trained facial parser  [18] to capture three local masks, including the hairstyle mask , the skin mask and the facial feature mask from the real frontal image ,

(2)

where the values of three masks are ranged from 0 to 1. Second, we treat these masks as the soft attention, facilitating the synthesized frontal image and the ground truth as follows:

(3)
(4)

where denotes the hadamard product. , and denote the hairstyle, skin and facial feature information from , while , and are from . And then these local features are fed into the parsing guided local discriminator . As shown in Figure 6, three subnets are used to encode the output feature maps of the hairstyle, skin and facial features, respectively. Finally, we concatenate the three encoded feature maps and feed it with binary cross entropy loss to distinguish that the input of local features is real or fake. The parsing guided local discriminator can efficiently ensure whether the local consistency of the synthesized frontal images is similar with the ground truth or not.

4.2 Training Losses

Multi-Scale Pixel Loss. Following [10], we employ a multi-scale pixel loss to enhance the content consistency between the synthesized image and the ground truth image .

(5)

where is the channel number, is the -th image scale, . and represent the width and height of the -th image scale, respectively.

Global-Local Adversarial Loss. We adopt a globallocal adversarial loss, aiming at synthesizing photo-realistic frontal face images. Specifically, the global discriminator distinguishes the synthesized face image from real image .

(6)

The parsing guided local discriminator aims to make the synthesized local facial details , and close to the real , and ,

(7)

Identity Preserving Loss. An identity preserving loss is employed to constrain the identity consistency between and . We utilize a pre-trained LightCNN-29 [28] to extract the identity features of and . The identity preserving loss is as follows:

(8)

where and denote the fully connected layer and the last pooling layer of the pre-trained LightCNN, respectively. and

represent the vector 2-norm and matrix F-norm, respectively.

Total Variation Regularization. We introduce a total variation regularization term [13] to remove the unfavorable artifacts.

(9)

where , and are the channel, width and height of the synthesized image .

Overall Loss. Finally, the total supervised loss is a weighted sum of the above losses. The generator and two discriminators, including a global discriminator and a parsing guided local discriminator, are trained alternately to play a min-max problem. The overall loss is written as:

(10)

where , and are the trade-off parameters.

5 Experiments

We evaluate our method qualitatively and quantitatively on the proposed MFPA database. For qualitative evaluation, we show the results of face frontalization on several yaw and pitch faces. For quantitative evaluation, we perform pose-invariant face recognition based on both the original and synthesized face images. We also provide three face frontalization benchmarks on MFPA, including DR-GAN [27], TP-GAN [12] and CAPG-GAN [10]. To further demonstrate the effective of the proposed method and assess the difficulty of MFPA, we also conduct experiments on the Multi-PIE [8] database, which is widely used in facial pose analysis. In the following subsections, we begin with an introduction of databases and settings, especially the training and testing protocols of MFPA. Then we present the qualitative frontalization results and quantitative recognition results on MFPA and Multi-PIE. Lastly, we conduct ablation study to demonstrate the effect of each part in our method.

5.1 Databases and Settings

Databases. The MFPA database totally contains 397,544 images of 229 subjects under 62 poses, 4 attributes and 7 illuminations. 57 of 62 poses are chosen in our experiments, except for pitch angles. We randomly select 162 subjects as the training set, i.e., images in total. The remaining 67 subjects form the testing set. For testing, one gallery image with frontal view, neutral attribute and above illumination is employed for each of the 67 subjects. The remaining yaw and pitch face images are treated as probes. The number of the probe and gallery images are 105,056 and 67 respectively. We will release the original MFPA database together with the annotated five facial landmarks and the training and testing protocols.

The Multi-PIE database [8] is a popular database for evaluating face synthesis and recognition across yaw angles. Following [10], we use Setting 2 protocol in our experiments. There are 161,460, 72,000, 137 images in the training, probe and gallery sets, respectively.

Implementation Details. Following the previous methods [27, 12, 10], we crop and align face images on MFPA and Multi-PIE for experimental evaluation. Besides, we also conduct experiments on face images on the MFPA database for high-resolution face frontalization under multiple yaw and pitch variations. A pre-trained LightCNN-29 [28]

is chosen for calculating the identity preserving loss and is fixed during training. Our model is implemented with Pytorch. We choose Adam optimizer with the

of 0.5 and of 0.99. The learning rate is initialized by and linearly decayed by

after each epoch until 0. The batch size is 16 for

resolution and 8 for resolution on a single NVIDIA TITAN Xp GPU with 12G memory. In all experiments, we empirically set the trade-off parameters , and to 20, 1, 1, 0.08 and , respectively.

5.2 Evaluation on MFpa

5.2.1 Face Frontalization

Figure 7: The frontalized 128128 results of our method under different poses on MFPA. From top to bottom, the yaw angles are , , and . For each subject, the first column is the generated frontal image, the second column is the input profile, and the last column is the ground-truth frontal image.
Figure 8: Frontalized results of different methods under extreme poses on MFPA. For each subject, the first row shows the visualizations (256256) of our method. From left to right: our frontalized result, the input profile and the groundtruth. The second row shows the frontalized results (128128) of different benchmark methods. From left to right: CAPG-GAN [10], TP-GAN [12], DR-GAN [27] (9696) and the online demo.

The collected MFPA database provides a possibility for face frontalization under various yaw and pitch angles. Benefiting from the global-local adversary, our method can frontalize face images with large yaw and pitch variations. The synthesis results of yaw angles and pitch angles are shown in Figure 7. We observe that not only the global facial structure but also the local texture details are recovered in an identity consistent way. Surprisingly, the sunglasses under extreme poses can also be well preserved. Besides, the current databases for large pose face frontalization are limited to yaw angles and a low resolution, i.e. . The collected MFPA has higher quality and supports for face frontalization at resolution with multiple yaw and pitch angles. The frontalized results of our method on MFPA are presented in Figure 8, where high quality and photo-realistic frontal faces are obtained. More frontalized results are listed in supplementary materials due to the page limitaion.

In addition, we provide several benchmark face frontalization results on MFPA, including DR-GAN[27], TP-GAN[12], and CAPG-GAN[10]. We re-implement CAPG-GAN and TP-GAN according to the original papers. For DR-GAN, we provide two results: one is the re-implemented version111https://github.com/zhangjunh/DR-GAN-by-pytorch and the other is the online demo222http://cvlab.cse.msu.edu/cvl-demo/DR-GAN-DEMO/index.html. Figure 8

presents the comparison results. We observe that our method, CAPG-GAN and TP-GAN achieve good visualizations, while DR-GAN fails to preserve the attributes and the facial structures due to its unsupervised learning procedure. However, there are also some unsatisfactory synthesized details among most of the methods, such as the hair, the face shape. These demonstrate the difficulties of synthesizing photorealistic frontal faces from extreme yaw and pitch angles. Therefore, we expect that collected M

FPA pushes forward the advance in multiple yaw and pitch face synthesis.

5.2.2 Pose-invariant Face Recognition

Method
LightCNN-29 v2
Original 100 100 99.8 98.6 86.9 51.7
DR-GAN[27] 98.9 97.9 95.7 89.5 70.3 35.5
TP-GAN[12] 99.9 99.8 99.4 97.3 87.6 62.1
CAPG-GAN[10] 99.9 99.7 99.4 96.4 87.2 63.9
Ours 100 100 99.9 98.4 90.6 67.6
IR-50
Original 99.7 99.7 99.2 97.2 87.2 35.3
DR-GAN[27] 97.8 97.6 95.6 89.9 70.6 26.5
TP-GAN[12] 99.7 99.2 98.2 96.3 86.6 48.0
CAPG-GAN[10] 98.8 98.5 97.0 93.4 81.9 50.1
Ours 99.5 99.5 99.0 97.3 89.6 55.8
Table 3: Rank-1 recognition rates () across views at pitch angle on MFPA.
Method Pitch
LightCNN-29 v2
Original 100 100 100 99.8 97.5 76.5 34.3
99.9 100 99.8 99.7 97.3 81.8 45.9
DR-GAN[27] 99.1 98.8 98.0 94.8 85.6 61.1 20.8
98.1 98.2 96.5 93.3 83.1 62.7 31.0
TP-GAN[12] 99.8 99.8 99.7 99.5 95.7 81.6 50.9
99.9 99.9 99.6 99.2 95.9 84.1 56.9
CAPG-GAN 99.8 99.9 99.8 98.9 95.0 81.4 54.4
[10] 99.8 99.9 99.7 98.7 95.1 85.5 65.6
Ours 99.9 99.9 99.8 99.7 97.5 86.2 56.2
99.9 99.9 99.8 99.7 97.4 88.1 66.5
IR-50
Original 99.8 99.9 99.6 98.7 95.7 77.1 23.4
98.7 99.4 99.2 98.1 95.7 78.8 27.9
DR-GAN[27] 98.5 98.2 97.8 94.0 84.8 60.9 17.0
95.8 97.2 96.2 93.3 84.8 60.3 20.8
TP-GAN[12] 99.0 99.6 99.1 98.5 94.7 79.1 40.6
98.2 98.9 98.1 97.2 94.8 80.9 43.5
CAPG-GAN 98.9 99.0 98.5 95.8 91.5 75.7 40.7
[10] 98.5 98.5 97.9 95.3 90.3 76.0 47.8
Ours 99.7 99.6 99.4 98.7 96.1 84.5 43.6
98.6 99.1 98.7 98.8 96.5 83.9 49.7
Table 4: Rank-1 recognition rates () across views at pitch angle on MFPA.
Method Pitch
LightCNN-29 v2
Original 99.7 99.2 96.5 71.6 24.5
98.6 98.2 93.6 69.9 22.1
DR-GAN[27] 93.8 91.5 83.4 52.0 16.9
91.7 90.6 79.1 46.6 16.6
TP-GAN[12] 99.7 98.8 95.8 77.2 43.4
98.2 97.6 93.4 75.7 38.9
CAPG-GAN[10] 98.8 98.4 94.1 79.5 48.0
98.9 98.3 93.8 75.3 49.3
Ours 99.7 99.1 97.7 81.9 48.2
98.9 98.7 95.8 82.2 49.3
IR-50
Original 99.2 98.1 94.7 73.5 17.6
97.1 97.3 93.0 67.2 9.0
DR-GAN[27] 92.9 92.3 83.8 56.4 13.9
93.0 92.0 82.1 50.3 7.5
TP-GAN[12] 98.1 97.3 94.4 76.8 34.5
95.7 96.1 92.2 71.6 27.5
CAPG-GAN[10] 97.1 96.2 90.5 73.1 34.5
95.8 95.4 89.2 67.6 33.0
Ours 98.6 97.8 96.0 79.6 36.4
97.2 97.4 95.1 76.7 33.1
Table 5: Rank-1 recognition rates () across views at pitch angle on MFPA.

Face recognition accuracy is a commonly used metric to evaluate the identity preserving ability of different frontalization methods. The better recognition accuracy, the more identity information is preserved during the synthesis process. Hence, we quantitatively evaluate our method and compare it with several state-of-the-art frontalization methods on MFPA, including DR-GAN[27], TP-GAN[12], and CAPG-GAN[10]. We employ two open-source pre-trained recognition models, LightCNN-29 v2333https://github.com/AlfredXiangWu/LightCNN and IR-50444https://github.com/ZhaoJ9014/face.evoLVe.PyTorch, as the feature extractors and define the distance metric as the average distance of the original image pair and the generated image pair. Tables 3, 4 and 5 present the Rank-1 accuracies of different methods on MFPA under , and pitch angles, respectively. When keeping the yaw angle consistent, we observe that the larger the pitch angle, the lower the accuracy is obtained, suggesting the great challenge in pitch variations. Besides, by recognition via generation , TP-GAN, CAPG-GAN and our method achieve better recognition performance than the original data under the large poses, such as yaw and pitch angles. We further observe that the accuracy of DR-GAN is inferior to the original data. This may be because DR-GAN is trained in an unsupervised way and there are too many pose variations in MFPA.

5.3 Evaluation on Multi-PIE

In this section, we present the quantitative and qualitative evaluations on the popular Multi-PIE [8] database. Figure 10 shows the frontalized image of our method. We observe that our method can achieve photo-realistic visualizations against other state-of-the-art methods, including CAPG-GAN [10], TP-GAN [12] and FF-GAN [30]. Table 6 further tabulates the Rank-1 performance of different methods under the Setting 2 for Multi-PIE. It is obvious that our method outperforms its competitors, including FIP+LDA[35], MVP+LDA[36], CPF[29], DR-GAN[27], FF-GAN[30], TP-GAN[12] and CAPG-GAN[10].

Figure 9: Comparisons with different methods under the pose of (first two rows) and (last two rows) on Multi-PIE.
Method
FIP+LDA[35] 90.7 80.7 64.1 45.9 - -
MVP+LDA[36] 92.8 83.7 72.9 60.1 - -
CPF[29] 95.0 88.5 79.9 61.9 - -
DR-GAN[27] 94.0 90.1 86.2 83.2 - -
FF-GAN[30] 94.6 92.5 89.7 85.2 77.2 61.2
TP-GAN[12] 98.68 98.06 95.38 87.72 77.43 64.64
CAPG-GAN[10] 99.82 99.56 97.33 90.63 83.05 66.05
Ours 99.96 99.78 99.53 96.18 88.74 75.33
Table 6: Rank-1 recognition rates () across views under Setting 2 on Multi-PIE.

5.4 Ablation Study

We report both quantitative recognition results and qualitative visualization results of our method and its four variants for a comprehensive comparison as the ablation study. We give the details in the Supplemental Materials, due to the page limitation.

6 Conclusion

This paper has introduced a new large-scale Multi-yaw Multi-pitch high-quality database for Facial Pose Analysis (MFPA), including face frontalization, face rotation, facial pose estimation and pose-invariant face recognition. To the best of our knowledge, MFPA is the most comprehensive multi-view face database that covers variations in yaw, pitch, attribute, illumination, accessory. We also provide an effective benchmark for face frontalization and pose-invariant face recognition on MFPA. Several state-of-the-art methods, such as DR-GAN, TP-GAN and CAPG-GAN, are implemented and evaluated. Moreover, we propose a simple yet effective parsing guided local discriminator to capture the local consistency during GAN optimization. In this way, we can synthesize photo-realistic frontal images with extreme yaw and pitch variations on Multi-PIE and M FPA. We believe that the new database and benchmark can significantly push forward the advance of faical pose analysis in community.

References

  • [1] A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In ICCV, 2017.
  • [2] J. Cao, Y. Hu, H. Zhang, R. He, and Z. Sun. Learning a high fidelity pose invariant model for high-resolution face frontalization. In NeurIPS, 2018.
  • [3] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset for recognising faces across pose and age. FG, 2018.
  • [4] J. Deng, S. Cheng, N. Xue, Y. Zhou, and S. Zafeiriou. Uv-gan: Adversarial facial uv map completion for pose-invariant face recognition. In CVPR, 2018.
  • [5] C. Ferrari, G. Lisanti, S. Berretti, and A. Del Bimbo. Effective 3d based frontalization for unconstrained face recognition. In ICPR, 2016.
  • [6] W. Gao, B. Cao, S. Shan, X. Chen, D. Zhou, X. Zhang, and D. Zhao. The cas-peal large-scale chinese face database and baseline evaluations. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 38(1):149–161, 2008.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
  • [8] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image and Vision Computing, 28(5):807–813, 2010.
  • [9] T. Hassner, S. Harel, E. Paz, and R. Enbar. Effective face frontalization in unconstrained images. In CVPR, 2015.
  • [10] Y. Hu, X. Wu, B. Yu, R. He, and Z. Sun. Pose-guided photorealistic face rotation. In CVPR, 2018.
  • [11] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.
  • [12] R. Huang, S. Zhang, T. Li, and R. He. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In ICCV, 2017.
  • [13] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In ECCV. Springer, 2016.
  • [14] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • [15] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
  • [16] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. CVPR, 2016.
  • [17] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother, A. Mah, and A. K. Jain. Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In CVPR, 2015.
  • [18] S. Liu, J. Yang, C. Huang, and M.-H. Yang. Multi-objective convolutional learning for face labeling. In CVPR, 2015.
  • [19] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
  • [20] J. R. A. Moniz, C. Beckham, S. Rajotte, S. Honari, and C. Pal. Unsupervised depth estimation, 3d face rotation and replacement. In NeurIPS, 2018.
  • [21] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, 2015.
  • [22] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A 3d face model for pose and illumination invariant face recognition. In AVSS, 2009.
  • [23] Y. Shen, P. Luo, J. Yan, X. Wang, and X. Tang. Faceid-gan: Learning a symmetry three-player gan for identity-preserving face synthesis. In CVPR, 2018.
  • [24] T. Sim, S. Baker, and M. Bsat. The cmu pose, illumination, and expression (pie) database. In FG. IEEE, 2002.
  • [25] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In CVPR, 2013.
  • [26] Y. Tian, X. Peng, L. Zhao, S. Zhang, and D. N. Metaxas. Cr-gan: learning complete representations for multi-view generation. arXiv preprint arXiv:1806.11191, 2018.
  • [27] L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for pose-invariant face recognition. In CVPR, 2017.
  • [28] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
  • [29] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, and J. Kim.

    Rotating your face using multi-task deep neural network.

    In CVPR, 2015.
  • [30] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Towards large-pose face frontalization in the wild. In ICCV, 2017.
  • [31] J. Zhao, Y. Cheng, Y. Xu, L. Xiong, J. Li, F. Zhao, K. Jayashree, S. Pranata, S. Shen, J. Xing, et al. Towards pose invariant face recognition in the wild. In CVPR, 2018.
  • [32] J. Zhao, L. Xiong, Y. Cheng, Y. Cheng, J. Li, L. Zhou, Y. Xu, J. Karlekar, S. Pranata, S. Shen, et al. 3d-aided deep pose-invariant face recognition. In IJCAI, 2018.
  • [33] J. Zhao, L. Xiong, P. K. Jayashree, J. Li, F. Zhao, Z. Wang, P. S. Pranata, P. S. Shen, S. Yan, and J. Feng. Dual-agent gans for photorealistic and identity preserving profile face synthesis. In NeurIPS, 2017.
  • [34] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-fidelity pose and expression normalization for face recognition in the wild. In CVPR, 2015.
  • [35] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity-preserving face space. In ICCV, 2013.
  • [36] Z. Zhu, P. Luo, X. Wang, and X. Tang.

    Multi-view perceptron: a deep model for learning face identity and view representations.

    In NeurIPS, 2014.

7 Supplementary Material

In this supplementary material, we first introduce the network architectures of the generator and discriminators in our method. Then we present the ablation study in Section 2. Additional in-the-wild experiments on LFW and CelebA-HQ are shown in Section 3 and 4, respectively. 256256 frontalization results for all the 57 poses are given in Section 5. Furthermore, in Section 6, we conduct face frontalization of 512512 resolution on the new MFPA database, which reveals the superiority of MFPA.

7.1 Network Architecture

Our generator adopts an encoder-decoder architecture. Taking 256256 resolution as an example, the detailed structure of is listed in Table 7. In the encoder, each convolution layer is followed by one residual block. In the decoder, there are three parts. The first is a simple deconvolution structure to upsample the fc2 features. The second part contains stacked deconvolution layers for reconstruction and each of them is followed by two residual blocks. The third one involves some convolution layers for recovering different scales of face images.

Layer Input Filter Size Output Size
- 512
- 256
-
Table 7: Structure of the generator .

The detailed structures of the global discriminator and the parsing guided local discriminator are shown in Tables 8 and 9, respectively. Each in and contains a

convolution layer, an instance normalization layer and a leaky ReLU layer. The last layers in

and

produce probabilistic outputs by sigmoid functions.

Note that, we also employ the same network architectures for experiments of 128128 resolution (in the main text) and 512512 resolution (in this supplementary material), except for the channel numbers of and .

Layer Input Filter Size Output Size
/
Table 8: Structure of the discriminator .
Layer Input Filter Size Output Size
/
/
/
Table 9: Structure of the discriminator .

7.2 Ablation Study

In this section, we report both qualitative visualization results and quantitative recognition results for a comprehensive comparison as the ablation study. Figure 10 presents visual comparisons between our method and its four incomplete variants on the new MFPA database. Without the loss, the synthesized faces are obviously blur. Without the loss, much identity information is lost during face frontalization. Without loss, there are more artifacts on the synthesized faces. Specially, without the loss, we observe that the structures of facial features are quite different from the ground truth, where the eyes and mouth have deformations. These indicate that the parsing guided local discriminator can ensure the local consistency between real and synthesized frontal images.

Method
LightCNN-29 v2
w/o 99.8 99.7 99.4 97.3 86.1 63.1
w/o 99.8 99.6 99.5 97.9 88.6 67.1
w/o 99.9 99.7 99.0 96.9 86.3 56.5
w/o 100 100 99.7 98.4 89.3 63.5
Ours 100 100 99.9 98.4 90.6 67.6
IR-50
w/o 99.7 99.3 98.3 94.9 82.1 44.9
w/o 99.4 99.4 98.5 96.2 87.7 52.0
w/o 99.2 99.0 98.3 95.3 83.8 43.4
w/o 99.7 99.3 98.3 95.7 82.4 45.9
Ours 99.5 99.5 99.0 97.3 89.6 55.8
Table 10: Model comparisons: Rank-1 recognition rates () on MFPA.

Table 10 further presents the Rank-1 performance of different variants of our method on MFPA. Similar to the visualization ablation study, we observe that the Rank-1 accuracy will decrease if one loss is removed. These phenomena indicate that each component in our method is essential for synthesizing photo-realistic frontal images.

Figure 10: Model comparisons: synthesis results of our method and its variants.

7.3 Additional Results on LFW

Method ACC(%) AUC(%)
Ferrari [5] - 94.29
LFW-3D[9] 93.62 88.36
LFW-HPEN[34] 96.25 99.39
FF-GAN[30] 96.42 99.45
CAPG-GAN[10] 99.37 99.90
Ours 99.41 99.92
Table 11: Face verification accuracy (ACC) and area-under-curve (AUC) results on LFW.

Additional frontalization results and comparisons with the previous methods on LFW are shown in Figure 11 and Figure 12, respectively. Same as TP-GAN [12] and CAPG-GAN [10], our model is only trained on Multi-PIE and tested on LFW. In Figure 11, for each subject, the input image is on the left and the frontalized result is on the right. We can observe that both the visual realism and the identity information are well preserved during frontalization. In addition, as shown in Figure 12, our method obtains good visualization results that are comparable to or better than the previous methods, including LFW-3D [9], LFW-HPEN [34], TP-GAN [12] and CAPG-GAN [10]. The quantitative results on LFW are presented in Table 11.

Figure 11: Visualization results on LFW. For each subject, the left is the input and the right is the frontalized result.
Figure 12: Visualization comparisons on LFW. For each subject, from left to right is the synthesized result of LFW-3D [9], HPEN [34], TP-GAN [12], CAPG-GAN [10], our method and the input image.

7.4 Additional Results on CelebA-HQ

CelebA-HQ [30] is a newly proposed high-quality database with small pose variations for face synthesis. We conduct additional experiments on CelebA-HQ to demonstrate the effectiveness of our method under such in-the-wild settings. We observe that the images in CelebA-HQ are almost frontal view. In order to take advantage of the high-quality images, following [2], we utilize a 3DMM model [22] to produce the paired profile images for each frontal image. We random choose 3,451 images as the testing set and the frontalization results of our method are presented in Figure 13. Note that there are no overlap subjects between the training and testing sets.

Figure 13: High-quality frontalization results on CelebA-HQ. For each subject, the left is the input and the right is the synthesized result.

7.5 Additional 256256 Results on MFpa

Additional 256256 frontalization results under 57 poses on MFPA are shown in Figure 14. For each subject, the top is the input with different poses and the bottom is the synthesized result. As expected, our method can frontalize the faces with sunglasses. In addition, we also observe that most frontalization results preserve the visual realism and the identity information well, even under extreme yaw and pitch poses.

Figure 14: The 256256 frontalization results of our method under 57 poses on MFPA. From top to bottom, the pitch angles of the Layer 2-6 are , , , and , respectively. From left to right, the yaw angles are from to . For each subject, the top is the input and the bottom is the synthesized result.

7.6 Additional 512512 Results on MFpa

Generating high-resolution results is significant to enlarge the application field of face rotation. However, the current facial pose analysis databases, which are collected in the constrained environment, only provide 128128 images. Our proposed MFPA supports higher resolution up to 512512 and contains various yaw and pitches angels. Additional 512512 frontalization results of our method on MFPA are shown in Figure 15. We observe that our high resolution results have richer textures and look more plausible. We believe that the high-resolution MFPA can push forward the advance of facial pose analysis in mobile or surveillance applications.

Figure 15: The 512512 frontalization results of our method under extreme poses on MFPA. For each subject, the bottom left corner is the input image.