GridFace: Face Rectification via Learning Local Homography Transformations

08/19/2018
by   Erjin Zhou, et al.
Megvii Technology Limited
6

In this paper, we propose a method, called GridFace, to reduce facial geometric variations and improve the recognition performance. Our method rectifies the face by local homography transformations, which are estimated by a face rectification network. To encourage the image generation with canonical views, we apply a regularization based on the natural face distribution. We learn the rectification network and recognition network in an end-to-end manner. Extensive experiments show our method greatly reduces geometric variations, and gains significant improvements in unconstrained face recognition scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

page 10

01/25/2017

Towards End-to-End Face Recognition through Alignment Learning

Plenty of effective methods have been proposed for face recognition duri...
03/09/2022

Controllable Evaluation and Generation of Physical Adversarial Patch on Face Recognition

Recent studies have revealed the vulnerability of face recognition model...
08/17/2019

Attentional Feature-Pair Relation Networks for Accurate Face Recognition

Human face recognition is one of the most important research areas in bi...
03/31/2021

Few-Data Guided Learning Upon End-to-End Point Cloud Network for 3D Face Recognition

3D face recognition has shown its potential in many application scenario...
08/01/2013

Domain-invariant Face Recognition using Learned Low-rank Transformation

We present a low-rank transformation approach to compensate for face var...
02/21/2020

Face Phylogeny Tree Using Basis Functions

Photometric transformations, such as brightness and contrast adjustment,...
11/16/2013

Can a biologically-plausible hierarchy effectively replace face detection, alignment, and recognition pipelines?

The standard approach to unconstrained face recognition in natural photo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite of the recent academic/commercial progresses made in deep learning 

[34], [31], [30], [47], [41], [28], [18], [37], [20], [36], [14], [42], [43], [39], it is still hard to claim that face recognition has been solved in unconstrained settings. One of the remaining challenges for the in-the-wild recognition is facial geometric variations. Such variations in pose and misalignment (introduced by face detection bounding box localization) substantially degrade the face representation and recognition performance.

The common adopted way to deal with this issue is using a 2D transformation to calibrate the facial landmarks to pre-defined templates (i.e., 2D mean face landmarks or a 3D mean face model). However, such kind of pre-processing is not optimized towards the recognition system and relies heavily on the parameters tuned by hand and accurate facial landmarks. To address this problem, recent works use the Spatial Transformer Network (STN) 

[15] to perform an end-to-end optimization with consideration of both face alignment and detection/recognition [5], [44]

. However, the transformation learned in these works uses a holistic parametric model that can only capture coarse geometric information, such as facial orientation, and may introduce notable distortion in the rectified results.

Figure 2: System Overview.

The system contains two modules: the rectification module and the recognition module. The rectification module extracts deep feature by the rectification network and warps the image with a group of local homography transformations (Sec. 

3.2

). The rectified output is regularized by an implicit canonical view face prior, which is optimized by a Denoising Autoencoder (Sec. 

3.3). The red arrows in the face in the regularization box indicate the approximated gradients estimated by DAE. With the rectified faces as input, the recognition network learns discriminative face representation (Sec. 3.4

) via metric learning. The whole system is end-to-end optimized with stochastic gradient descent.

In this paper, we propose a novel method called GridFace to reduce the facial geometric variations and boost the recognition performance. As shown in Fig. 2, our system contains two main modules: the rectification module and the recognition module.

In the rectification module, we apply a face rectification network to estimate a group of local homography transformations for rectifying the input facial image (Sec. 3.2). We approximate the underlying 3D canonical face shape by a group of deformable plane cells. When a face with geometric variations fed in, local homography transformations are estimated to model the warping of each cell respectively. In order to encourage the generation with canonical views, we introduce a regularization based on the canonical view face distribution (Sec. 3.3

). This natural face distribution is not explicitly modeled. Instead, we use a Denoising Autoencoder (DAE) to estimate the gradients of logarithm of probability density, which is inspired by the previous work 

[27, 1]. The recognition module (Sec. 3.4) takes the rectified image as input and learns discriminative representation via metric learning.

In Sec. 4, we first evaluate our method with qualitative and quantitative results to demonstrate the effectiveness of face rectification for recognition in-the-wild. Then we present extensive ablation studies to show the importance of each of the above components. We finally evaluate our method on four challenging public benchmarks LFW, YTF, IJB-A, and Multi-PIE. We obtain large improvement in all benchmarks, and achieve superior or comparable results compared with recent face frontalization and recognition works.

Our contributions are summarized as following:

  1. We propose a novel method to improve face recognition performance by reducing facial geometric variations with local homography transformations.

  2. We introduce a canonical face prior and a Denoising Autoencoder based approximation method to regularize the face rectification process for better rectification quality.

  3. Extensive experiments on constrained and unconstrained environments are conducted to demonstrate the excellent performance of our method.

2 Related Works

Deep Face Recognition. Early works [31], [34] learn face representation by multi-class classification networks. Features learned from thousands of individuals’ faces demonstrate good generalization ability. Sun et al. [30] improve the performance by jointly learning identification and verification losses. Schroff et al. [28] formulate the representation learning task in a metric learning framework, and introduce the triplet loss and hard negative sample mining skill to boost the performance further. Recent works [37], [18] propose the center loss and sphere loss to further reduce intra-class variations in the feature space. Du and Liang [8] propose age-invariant feature. Bhattarai et al. [3] introduce multitask learning for large scale face retrieval. Zhang et al. [43] develop a range loss to effectively utilize the long tail training data. Pose invariant representation is the key step for real world robust recognition system, and has been the focus of many works. For example, Masi et al. [20] propose the face representation by fusing multiple pose-aware CNN models. Peng et al. [25] untangle the identity and pose in representation by reconstruction in the feature space. Lu et al. [19] propose the joint optimization framework for face and pose tasks.

Face Frontalization and Canonicalization. Prior works in face frontalization and canonicalization optimize an image warping to fit a 3D face model [45], [12]

based on localized 2D facial landmarks. Recently, several attempts have been made to improve the generated face quality with neural networks. Early works 

[46],[47]

calibrate faces of various poses into canonical view, and disentangle the pose factor from identity with convolution neural networks. Yim et al. 

[41] improve the identity preserving ability by introducing an auxiliary task to reconstruct the input data. Cole et al. [6] decompose the generation module into geometry and texture parts, training with the differentiable warping.

Recent works further improve the generation quality with Generative Adversarial Network (GAN) [9]. Tran et al. [36] propose DR-GAN to simultaneously learn the frontal face generation and discriminative representation disentangled from pose variations. Yin et al. [42] introduce a 3DMM reconstruction module in the proposed FF-GAN framework to provide better shape and appearance prior. Huang et al. [14] incorporate both global structure and local details in their generator with landmark located patch networks. In our method, we do not require frontal and profile training pairs that are needed in the previous work, and our rectification process is recognition oriented, which induces better recognition performance.

Spatial Transformer Network. The Spatial Transformer Network (STN) [15] performs spatial transforms in the image or feature maps with a differential module, which can be integrated into the deep learning pipeline and optimized end-to-end . The most relevant application of STN to our work is image alignment. Kanazawa et al. [16] match the fine-grained objects by establishing correspondences between two input images with non-rigid deformations. Chen et al. [5] use STN to warp face proposals to canonical view with detected facial landmarks. Zhong et al. [44] use STN for face alignment before recognition. Lin et al. [17] provide a theoretical connection between STN and the Lucas-Kanade algorithm, and introduce the the inverse composition STN to reduce input variations.

The recent work Wu et al. [39] propose a recursive spatial transformer (ReST) for the alignment-free face recognition. They also integrate the recognition network in an end-to-end optimization manner. There are two major differences between our approach and ReST. First, instead of manually dividing the facial region into several regions to allow non-rigid transformation modeling, we use a group of deformable plane cells to deal with complex warping effects. Second, we introduce a regularization prior of canonical view face to achieve better rectification effects.

3 Approach

Notation. Let denote the original image and rectified image. We define the coordinate in the original image as the original coordinate, and the one in the rectified image as the rectified coordinate. Let and denote the points in the original coordinate and rectified coordinate. We use and to denote the homogeneous coordinates as . Without loss of generality, we assume the coordinates of pixels are normalized to .

3.1 Overview

The system contains two parts: the rectification module and the recognition module (Fig. 2). In the rectification process, the rectification network with parameter maps the original face image into the rectified one by non-rigid image warping. Then, the recognition network is trained with the metric learning based on the rectified image . We further introduce a regularization to encourage the rectified face in canonical views, which is modeled as a prior under the distribution of natural faces with canonical views.

3.2 Face Rectification Network

In this section, we present the rectification process. Different from recent face frontalization techniques [36, 42, 14] generating faces from abstract feature, we define the rectification process as warping pixels from the original image to the canonical one, as illustrated in Fig. 3.

capbesideposition=right []

Figure 3: Local Homography Transformation. The rectification process approximates the 3D face as plane cells and canonicalizes it with local homographies. The rectified image is partitioned into cells, and the corresponding homographies are estimated by the rectification network. We put springs at the corners of the cells as soft constraints to avoid large discontinuities in the boundaries.

Formally, we define a template by partitioning the rectified image into non-overlapped cells

(1)

For each cell , we compute the corresponding deformed cell in the original image by estimating a local homography .

Specifically, we formulate the homography matrix as

(2)

The rectification network takes the original image as input and predicts residual matrices . Then the rectified image at cell is obtained with homographies as

(3)

where are the homogeneous coordinates of .

Let denote the collection of corners of each cell as . Since all the local homographies are estimated separately, a cell corner in the rectified image is mapped to multiple points in the original image (see Fig. 3). In order to avoid large discontinuities between the boundaries of neighboring cells in , we further introduce a soft constraint, called deformation constraint . Specifically, let denotes the collection of ’s mapping coordinates in the original image. Then a soft constraint is added to enforce the conformity between every pair of points in as . We incorporate this soft constraint into the learning objective, and cast it as the the deformation loss of the rectification network:

(4)

3.3 Regularization by Denoising Autoencoder

The regularization encourages that the rectification process generates face in canonical views. We define it as an image prior that is directly based on the natural canonical view face distribution as

(5)

In general, this optimization is non-trivial. We do not explicitly model the distribution, but consider the gradient of and maximize it with stochastic gradient descent

(6)

Using results from [27],[1], which are also used in image generation [23] and restoration [29],[4],[22], we approximate the gradient of the prior as

(7)

Here , with and , is the optimal denoising autoencoder trained on the true data distribution (canonical view faces in our work) with the infinitesimal noise level . Using these results, we optimize the Eqn. 5 by first training a Denoising Autoencoder

on the canonical view face dataset, and then estimating the approximated gradient in backpropagation via Eqn. 

7.

3.4 Face Recognition Network

Given the rectified face , we extract the face representation with deep convolutional recognition network . Following the previous works [28], we train the recognition network with triplet loss. Let denote the three images forming a face triplet where and are from the same person, while is from a different person. the recognition loss is

(8)

where is the Euclidean distance between the feature representations and . The hyper-parameters control the margin between intra-person distance and inter-person distance in the triplet loss.

In summary, we jointly optimize the rectification network and recognition network by minimizing an objective, consisting of a deformable term, a recognition term, and a regularization term

(9)
Network Rectification Network Denoising Autoencoder
Input
Stage-1
Conv[8, 3, 2, 1]
MaxPool[2,2,1]
Conv[32, 3, 2, 1]
Conv[8, 3, 2, 1]
Conv[12, 3, 2, 1]
Conv[16, 3, 2, 1]
Conv[24, 3, 2, 1]
FullyConnected[1536]
DeConv[24, 3, 2, 1]
DeConv[16, 3, 2, 1]
DeConv[12, 3, 2, 1]
DeConv[8, 3, 2, 1]
Conv[3, 3, 1, 1]
Stage-2
InceptionModule[16]
MaxPool[2, 2]
Stage-3
InceptionModule[32]*2
MaxPool[2, 2]
Stage-4
InceptionModule[64]*2
MaxPool[2, 2]
Stage-5
FullyConnected[128]
FullyConnected[N]
Table 1: Network Details. Conv[] denotes a convolution layer with kernel size

, stride

and padding

. The deconvolution layer DeConv[] is implemented as the gradient of convolution with respect to data, and the meaning of parameters is still in a convolution sense. MaxPool[

] is a max-pooling layer with

window and stride . FullyConnected[] is a fully-connected layer with

output neurons, and

denotes the number of corresponding transformation parameters. InceptionModule[

] denotes a modified Inception module with the same number of feature maps

in each branch.
Figure 4: Qualitative Analysis of SNFace Testset. We sample the data from the SNFace test set with pose, expression, and illumination variations, and visualize the rectified results under different rectification methods.

4 Experiments

4.1 Experimental Details

Dataset. Our models are learned from a large collection of photos in social networks, referred to as the Social Network Face Dataset (SNFace). The SNFace dataset contains about 10M images and 200K individuals. We randomly choose 2K individuals as the validation set, 2K as the test set, and use the remaining ones as the training set. The 5-point facial landmarks are detected and the face is aligned with similarity transformation.

Network Architecture. In all the experiments in this paper, we use the GoogLeNet [33], [28] for our recognition network. The rectification network is based on a modified Inception module, which contains fewer parameters and a simpler structure. The rectification network takes very limited additional parameters and time computation compared with the recognition network. The Denoising Autoencoder is designed with a convolutional autoencoder structure. The network details are described in Tab. 1.

Implementation Details. The dimension of the original and rectified face of the rectification network are , and the pixel level activations are normalized by dividing 255. The Denoising Autoencoder is trained on a subset of the SNFace dataset, which contains 100K faces in canonical views. An end-to-end optimization is conducted after the Denoising Autoencoder is ready. In the training phase, each mini-batch contains 1024 image triplets. We set an equal learning rate for all trainable layers to 0.1, which is shrunk by a factor of 10 once the validation error stops decreasing. The hyper parameters are determined by the validation set as , and . In all the experiments, we use the same metric learning method with triplet loss. No other data processing and training protocol are used. In the testing phase, we use the Euclidean distance as the verification metric between two face representations.

4.2 What is Learned in Face Rectification?

In this section, we study what is learned in the rectification network. All approaches are evaluated on the SNFace test dataset. We evaluate our model with (i.e., 64 cells in local homography transformations), referred to as Grid-8. We compare with several alternative approaches: the baseline model does not have face rectification; the global model Grid-1 performs the face rectification with global homography transformation; no face prior regularization model Grid-8\reg does not have the regularization during training.

Moreover, in order to compare with the 3D face frontalization technique used in face recognition (e.g., 3D alignment used in DeepFace [34]), we process the full SNFace dataset to synthesize frontal views by using a recent face fronalization method created by Hassner et al. [12], and compare with the model trained on this synthesized data (called baseline-3D) to verify the effectiveness of our rectification process and joint optimization.

Evaluation on SNFace Testset
Method FAR= FAR= FAR=
baseline 92.94 81.76 63.41
baseline-3D 94.02 80.36 58.20
Grid-1 93.49 83.94 66.15
Grid-2 94.02 85.24 68.70
Grid-4 94.38 86.23 71.09
Grid-8\reg 94.10 85.44 69.05
Grid-8 94.92 87.81 72.71
Table 2: Quantitative Results on the SNFace Testset. We compare our method Grid-8 against several other approaches and report verification accuracy on the SNFace test set.

Qualitative Analysis. Fig. 4 depicts the visualization results of the original images and the corresponding rectified images. Obviously, the global homography transformation Grid-1 can capture coarse geometric information, such as 2D rotation and translation, which is also reported in previous works [44],[39]. However, due to its limited capacity, Grid-1 is unable to satisfactorily rectify out-of-plane rotation and local geometric details, and generate with notable distortion (e.g., the big nose in the faces with large pose). Hassner et al. [12] improve further, generating good frontal view faces, but the ghosting effect (most faces under large pose in Fig. 4) and the change of facial shape (e.g., nose in the fourth individual in Fig. 4) may introduce further noise to the recognition system. On the other hand, Grid-8 can capture rich facial structure details. Local geometric warping is detected and approximated by local homographies. Compared with the original image and results from other approaches, the proposed method Grid-8 greatly reduces geometric variations and induces better canonical view results.

Quantitative Analysis. We report quantitative results under verification protocol in Tab. 2. Grid-8 achieves the best performance which outperforms the baseline by a large margin up from to with False Alarm Rate (FAR) at . The global transformation Grid-1 consistently improves the recognition performance compared with the baseline. But as we have seen in the visualization results, global transformation is limited to its transformation capacity and thus introduces large distortion for recognition.

The recognition model trained on the synthesized frontal view data baseline-3D obtains high performance with FAR at , better than the baseline and Grid-1 trained on the original data. But the performance drops dramatically, and finally gets worse than the baseline with FAR at . On the other hand, our method Grid-8 consistently outperforms the baseline-3D and obtains improvement with FAR at .

capbesideposition=right [] capbesideposition=inside [] Method baseline 92.94 91.66 86.58 74.95 Grid-8 94.92 93.51 90.35 85.00

Figure 5: Synthetic 2D Transformations. Visualization of the image and perturbed samples in the synthetic 2D transformation experiment. (a). Original image, where different color boxes corresponding to different noise levels (red for , green for , and blue for ). (b). Cropped faces with noisy landmarks. (c). Rectified faces by our method Grid-8. Most of the scale, rotation, and translation variations are reduced.
Figure 6: Quantitative Results under Synthetic 2D Transformations. Verification accuracy of our model Grid-8 and baseline at FAR= under 2D transformations with different noise levels.
Figure 7: Ablation Studies. (a) i. Rectification without regularization. ii. Rectification with regularization. (b) Rectification with different number of cells. (c) i. Rectification without recognition supervision. ii. Joint learning of rectification and recognition.

Evaluation on Synthetic 2D Transformations. We investigate the effectiveness of face rectification for reducing 2D in-plane transformations, which is typically introduced by facial landmarks. The perturbed data is generated by performing face alignment with noisy landmarks, which are synthesized by adding i.i.d.

Gaussian noise of variance

. The Gaussian noise mimics the inaccurate facial landmarks in the real system, and introduces the scale, rotation, and translation variations in the face alignment. We normalize the face size by interocular distance , and generate perturbed data with different noise levels .

Fig. 7 presents the visualization of synthetic data with small (red boxes with ), middle (green boxes with ) and large (blue boxes with ) noise levels. As shown in Fig. 7 (c), the rectification network can generate canonical view faces, which greatly reduce the in-plane variance. Tab. 7 reports the qualitative comparison between the baseline and Grid-8. We can see that the baseline suffers from the large in-plane variations and the accuracy drops rapidly. Meanwhile, the rectification network Grid-8 yields much better performance even under large geometric variations.

Effectiveness of Regularization. We further explore the effectiveness of regularization. Visualization results of rectified faces are shown in Fig. 7(a). The first two rows present the rectification trained without regularization, and the last two rows show the results with regularization. We can observe that the regularization helps the rectification process generate more canonical view faces, and reduces the cropping variations in the rectified results. Quantitative results are reported in Tab. 2. The regularization achieves improvement with FAR at , and improvement with FAR at .

Figure 8: Evaluation on Challenge Situations.

(a). Qualitative results under large pose and occlusion. (b). Comparisons of standard deviation of facial landmarks under different pose variations.

Number of Partition Cells. We investigate the influence of the number of partition cells in the rectification network. Visualized results of are presented in Fig. 7(b), and the quantitative results in SNFace test set are shown in Tab. 2. As the number of cells increases, the image distortion introduced in the rectified face decreases, and verification performance increases, benefiting from the local homography transformations.

Necessity of Joint Learning. To evaluate the contribution of joint learning the face rectification and recognition, we introduce an ablation experiment learning each part sequentially. This model first learns the face rectification without the recognition supervision, and then trains the recognition module with the fixed rectification module. Fig. 7(c) provides the qualitative results. The consequences of the lack of recognition supervision is obvious and irreversible. The noisy gradient provided by the Denoising Autoencoder introduces much artifacts and the misalignment objective further drops many face details (e.g. close the mouth in the first and second individuals). On the other hand, joint learning of rectification and recognition can greatly reduce artifacts in the rectification results and keep the most of facial details. The recognition accuracy of this sequential learning model is with FAR at , which is far below the joint learning model and even the original baseline.

Evaluation on Challenge Situations. Fig. 8(a) presents the rectification results under challenging occlusion situations like large pose and sunglasses. The effects of rectification process is not hallucinating the missing parts. It reduces the geometric variations and does alignment for the visible parts. We further evaluate the variations of facial landmarks on the Multi-PIE dataset [10]. Four facial landmarks in the right side of face are considered and the corresponding standard deviations are calculated. Fig. 8(b) demonstrates the landmark variations in the original and rectified faces under different face pose. Obviously, the variations of each landmark across different poses are much smaller than ones in the original face, which suggests that our rectification process is robust to pose variation and reduce facial geometric variations significantly.

4.3 Evaluation on Public Benchmarks

To verify the cross-data generalization of learned models, we report our performance on four challenge public benchmarks, which cover large pose, expression, and illumination variations. We further report our models trained under the public dataset MS-Celeb-1M [11], referred to as baseline-Pub and Grid-8-Pub.

LFW and YTF. In the LFW dataset [13], we follow the standard evaluation protocol of unrestricted with labeled outside data, and report the mean accuracy (mAC) of 10-folders verification set. We further follow the identification protocol proposed by Best-Rowden et al. [2], and report the closed-set recognition performance measured by rank-1 rate and the open-set performance measured by Detection and Identification Rate (DIR) with FAR at . In the YTF dataset [38], we follow the standard protocol and report the mAC of 10 folds video verification set. We perform the video-to-video verification by averaging the similarity scores between every pairs of images.

Figure 9: Qualitative Analysis on Public Benchmarks. Left Top: LFW; Left Bottom: YTF; Right Top: IJB-A; Right Bottom: Multi-PIE.

Results on LFW and YTF. Tab. 5 shows our results. In the LFW verification benchmark, our method consistently improves the performance up from to with the MS-Celeb training set and from to with the SNFace training set. Our results are comparable with FaceNet [28] but with the considerably smaller training data (10M training faces VS 200M faces). Under the LFW identification protocol, our method boosts the baseline with significant improvements (up from to in the close-set protocol and from to in the open-set protocol), and achieves the state-of-the-art. In the YTF benchmark, our method Grid-8 () and Grid-8-Pub () also provide consistent improvements over the baseline methods baseline () and baseline-Pub (). Fig. 9 provides the rectification results in LFW and YTF.

Multi-PIE. The Multi-PIE dataset [10] contains 754,200 images from 337 subjects, covering large variations in pose and illumination. We follow the protocol from [41], where the last 137 subjects with 13 poses, 20 illuminations and neutral expression are selected for testing. For each subject, we randomly choose one image with frontal pose and neutral illumination as the gallery, and leave all the rest as probe images.

Results on Multi-PIE. Tab. 5 shows our results. Our method outperforms the baseline methods by a large margin, from (baseline-Pub) to (Grid-8-Pub) and (baseline) to (Grid-8) in the identification rate with face yaw at . We achieve the best performance and outperforms the recent GAN-based face frontalization methods [36, 14, 42]. Moreover, we do not observe the performance degeneration in frontal face, which indicates that our method introduces few artifacts in frontal faces and gains consistent improvement over pose variations. Fig. 9 shows the qualitative results with different pose variations. The rectification process is robust to the change of pose, and reduces the geometric variations for the visible parts.

IJB-A. The IJB-A dataset is a challenge benchmark due to its unconstrained setting. It defines the set-to-set recognition as face template matching. In our evaluation, we do not introduce complicate strategies, and perform the set-to-set recognition via a media pooling followed from the previous method [26]. Specifically, the template feature is extracted by first averaging all image feature with their media ID, and then averaging between medias.

Results on IJB-A. Tab. 5 and Fig. 9 report our results on IJB-A. It is worth pointing out that we employ strong baselines, which achieve (Baseline-Pub) and (Baseline) verification accuracy with FAR at 0.001, and (Baseline-Pub) and (Baseline) rank-1 identification accuracy. By adding our rectification process, our rectification methods outperform the these strong baselines by a large margin. We achieve (Grid-8-Pub) and (Grid-8) improvement on the verification task (with FAR at 0.001), and reduce (Grid-8-Pub) and (Grid-8) error rate on the rank-1 identification task. It is noteworthy that multiple-frame aggregation methods [40], [18] in the set-to-set recognition scenarios (e.g., IJB-A and YTF) can achieve better performance. These techniques could also apply to our method and is left to the future work.

Table 3: Evaluation on LFW and YTF
Method LFW mAC LFW Rank-1 LFW DIR@ YTF mAC
DeepFace [34] 97.35 64.9 44.5 91.4
VGGFace [24] 99.13 - - 97.4
FaceNet [28] 99.64 - - 95.1
DeepID2+ [32] 99.47 95.0 80.7 93.2
WST Fusion [35] 98.37 82.5 61.9 -
SphereFace [18] 99.42 - - 95.0
RangeLoss [43] 99.52 - - 93.7
HiReST-9+ [39] 99.03 93.4 80.9 95.4
Baseline-Pub 99.05 88.9 78.8 93.4
Grid-8-Pub 99.68 96.4 93.1 95.2
Baseline 99.15 91.7 80.3 94.0
Grid-8 99.70 96.7 94.1 95.6
Table 4: Evaluation on Multi-PIE
Method
Yim et al. [41] 99.5 95.0 88.5 79.9 61.9 - -
DRGAN [36] 97.0 94.0 90.1 86.2 83.2 - -
TPGAN [14] - 98.7 98.1 95.4 87.7 77.4 64.6
FF-GAN [42] 95.7 94.6 92.5 89.7 85.2 77.2 61.2
Baseline-Pub 100.0 100.0 100.0 98.9 92.9 78.4 44.3
Grid-8-Pub 100.0 100.0 100.0 99.3 96.1 86.7 62.0
Baseline 100.0 100.0 100.0 100.0 98.7 92.6 65.5
Grid-8 100.0 100.0 100.0 100.0 99.2 94.7 75.4
Table 5: Evaluation on IJB-A
Method Verification Identification
Metric @FAR=0.01 @FAR=0.001 @Rank-1 @Rank-5
PAM [20]
Masi et al.[21]
TripEmbd. [26] -
TempAdpt. [7]
DRGAN [36]
FFGAN [42]
Baseline-Pub
Grid-8-Pub
Baseline
Grid-8

5 Conclusion

In this paper, we develop a method called GridFace to reduce facial geometric variations. We propose a novel non-rigid face rectification method by local homography transformations, and regularize it by imposing natural frontal face distribution with a Denoising Autoencoder. Empirical results show our method greatly reduces geometric variations and improves the recognition performance.

References

  • [1]

    Alain, G., Bengio, Y.: What regularized auto-encoders learn from the data-generating distribution. The Journal of Machine Learning Research

    15(1), 3563–3593 (2014)
  • [2] Best-Rowden, L., Han, H., Otto, C., Klare, B.F., Jain, A.K.: Unconstrained face recognition: Identifying a person of interest from a media collection. IEEE Transactions on Information Forensics and Security 9(12), 2144–2157 (2014)
  • [3]

    Bhattarai, B., Sharma, G., Jurie, F.: Cp-mtml: Coupled projection multi-task metric learning for large scale face retrieval. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

  • [4] Bigdeli, S.A., Jin, M., Favaro, P., Zwicker, M.: Deep mean-shift priors for image restoration. arXiv preprint arXiv:1709.03749 (2017)
  • [5] Chen, D., Hua, G., Wen, F., Sun, J.: Supervised transformer network for efficient face detection. In: European Conference on Computer Vision. pp. 122–138. Springer (2016)
  • [6] Cole, F., Belanger, D., Krishnan, D., Sarna, A., Mosseri, I., Freeman, W.T.: Synthesizing normalized faces from facial identity features. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
  • [7] Crosswhite, N., Byrne, J., Stauffer, C., Parkhi, O., Cao, Q., Zisserman, A.: Template adaptation for face verification and identification. In: Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on. pp. 1–8. IEEE (2017)
  • [8] Du, L., Ling, H.: Cross-age face verification by coordinating with cross-face age verification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)
  • [9] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)
  • [10] Gross, R., Matthews, I., Cohn, J., Kanade, T., Baker, S.: Multi-pie. Image and Vision Computing 28(5), 807–813 (2010)
  • [11] Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In: European Conference on Computer Vision. pp. 87–102. Springer (2016)
  • [12] Hassner, T., Harel, S., Paz, E., Enbar, R.: Effective face frontalization in unconstrained images. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)
  • [13] Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Tech. Rep. 07-49, University of Massachusetts, Amherst (October 2007)
  • [14] Huang, R., Zhang, S., Li, T., He, R.: Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
  • [15] Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems 28. pp. 2017–2025 (2015)
  • [16] Kanazawa, A., Jacobs, D.W., Chandraker, M.: Warpnet: Weakly supervised matching for single-view reconstruction. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)
  • [17] Lin, C.H., Lucey, S.: Inverse compositional spatial transformer networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
  • [18] Liu, Y., Yan, J., Ouyang, W.: Quality aware network for set to set recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
  • [19] Lu, B.L., Zheng, J., Chen, J.C., Chellappa, R.: Pose-robust face verification by exploiting competing tasks. In: Applications of Computer Vision (WACV) (June 2017)
  • [20] Masi, I., Rawls, S., Medioni, G., Natarajan, P.: Pose-aware face recognition in the wild. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)
  • [21] Masi, I., Tran, A.T., Hassner, T., Leksut, J.T., Medioni, G.: Do We Really Need to Collect Millions of Faces for Effective Face Recognition?, pp. 579–596 (2016)
  • [22] Meinhardt, T., Moller, M., Hazirbas, C., Cremers, D.: Learning proximal operators: Using denoising networks for regularizing inverse imaging problems. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
  • [23] Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J.: Plug & play generative networks: Conditional iterative generation of images in latent space. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
  • [24] Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision Conference (2015)
  • [25] Peng, X., Yu, X., Sohn, K., Metaxas, D.N., Chandraker, M.: Reconstruction-based disentanglement for pose-invariant face recognition. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
  • [26] Sankaranarayanan, S., Alavi, A., Castillo, C.D., Chellappa, R.: Triplet probabilistic embedding for face verification and clustering. In: Biometrics Theory, Applications and Systems (BTAS), 2016 IEEE 8th International Conference on. pp. 1–8. IEEE (2016)
  • [27] Särelä, J., Valpola, H.: Denoising source separation. Journal of machine learning research 6(Mar), 233–272 (2005)
  • [28] Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)
  • [29] Sønderby, C.K., Caballero, J., Theis, L., Shi, W., Huszár, F.: Amortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490 (2016)
  • [30] Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identification-verification. In: Advances in Neural Information Processing Systems 27, pp. 1988–1996 (2014)
  • [31] Sun, Y., Wang, X., Tang, X.: Deep learning face representation from predicting 10,000 classes. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2014)
  • [32] Sun, Y., Wang, X., Tang, X.: Deeply learned face representations are sparse, selective, and robust. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)
  • [33] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)
  • [34] Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: Closing the gap to human-level performance in face verification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2014)
  • [35] Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Web-scale training for face identification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)
  • [36]

    Tran, L., Yin, X., Liu, X.: Disentangled representation learning gan for pose-invariant face recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

  • [37] Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision. pp. 499–515. Springer (2016)
  • [38] Wolf, L., Hassner, T., Maoz, I.: Face recognition in unconstrained videos with matched background similarity. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 529–534 (2011)
  • [39] Wu, W., Kan, M., Liu, X., Yang, Y., Shan, S., Chen, X.: Recursive spatial transformer (rest) for alignment-free face recognition. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
  • [40] Yang, J., Ren, P., Zhang, D., Chen, D., Wen, F., Li, H., Hua, G.: Neural aggregation network for video face recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
  • [41] Yim, J., Jung, H., Yoo, B., Choi, C., Park, D., Kim, J.: Rotating your face using multi-task deep neural network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)
  • [42] Yin, X., Yu, X., Sohn, K., Liu, X., Chandraker, M.: Towards large-pose face frontalization in the wild. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
  • [43] Zhang, X., Fang, Z., Wen, Y., Li, Z., Qiao, Y.: Range loss for deep face recognition with long-tailed training data. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
  • [44] Zhong, Y., Chen, J., Huang, B.: Toward end-to-end face recognition through alignment learning. IEEE Signal Processing Letters 24(8), 1213–1217 (Aug 2017). https://doi.org/10.1109/LSP.2017.2715076
  • [45] Zhu, X., Lei, Z., Yan, J., Yi, D., Li, S.Z.: High-fidelity pose and expression normalization for face recognition in the wild. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)
  • [46] Zhu, Z., Luo, P., Wang, X., Tang, X.: Deep learning identity-preserving face space. In: The IEEE International Conference on Computer Vision (ICCV) (December 2013)
  • [47]

    Zhu, Z., Luo, P., Wang, X., Tang, X.: Multi-view perceptron: a deep model for learning face identity and view representations. In: Advances in Neural Information Processing Systems 27. pp. 217–225 (2014)