The human visual system is commendable at recognition across variations in pose, for which two theoretical constructs are preferred. The first postulates invariance based on familiarity where separate view-specific visual representations or templates are learned [6, 26]. The second suggests that structural descriptions are learned from images that specify relations among viewpoint-invariant primitives 
. Analogously, pose-invariance for face recognition in computer vision also falls into two such categories.
The use of powerful deep neural networks (DNNs)  has led to dramatic improvements in recognition accuracy. However, for objects such as faces where minute discrimination is required among a large number of identities, a straightforward implementation is still ineffective when faced with factors of variation such as pose changes . Consider the feature space of the VGGFace  evaluated on MultiPIE  shown in Figure 1, where examples from the same identity class that differ in pose are mapped to distant regions of the feature space. An avenue to address this is by increasing the pose variation in training data. For instance, million face images are used to train DeepFace  and million labelled faces for FaceNet . Another approach is to learn a mapping from different view-specific feature spaces to a common feature space through methods such as Canonical Correlation Analysis (CCA) . Yet another direction is to ensemble over view-specific recognition modules that approximate the non-linear pose manifold with locally linear intervals [20, 12].
There are several drawbacks for the above class of approaches. First, conventional datasets including those sourced from the Internet have long-tailed pose distributions . Thus, it is expensive to collect and label data that provides good coverage for all subjects. Second, there are applications for recognition across pose changes where the dataset does not contain such variations, for instance, recognizing an individual in surveillance videos against a dataset of photographs from identification documents. Third, the learned feature space does not provide insights since factors of variation such as identity and pose might still be entangled. Besides the above limitations, view-specific or multiview methods require extra pose information or images under multiple poses at test time, which may not be available.
In contrast, we propose to learn a novel reconstruction based feature representation that is invariant to pose and does not require extensive pose coverage in training data. A challenge with pose-invariant representations is that discrimination power of the learned feature is harder to preserve, which we overcome with our holistic approach. First, inspired by , Section 3.1 proposes to enhance the diversity of training data with images under various poses (along with pose labels), at no additional labeling expense, by designing a face generation network. But unlike  which frontalizes non-frontal faces, we generate rich pose variations from frontal examples, which leads to advantages in better preservation of details and enrichment rather than normalization of within-subject variations. Next, to achieve a rich feature embedding with good discrimination power, Section 3.2
presents a joint learning framework for identification, pose estimation and landmark localization. By jointly optimizing those three tasks, arich feature embedding including both identity and non-identity information is learned. But this learned feature is still not guaranteed to be pose-invariant.
To achieve pose invariance, Section 3.3 proposes a feature reconstruction-based structure to explicitly disentangle identity and non-identity components of the learned feature. The network accepts a reference face image in frontal pose and another image under pose variation and extracts features corresponding to the rich embedding learned above. Then, it minimizes the error between two types of reconstructions in feature space. The first is self-reconstruction, where the reference sample’s identity feature is combined with its non-identity feature and the second is cross-reconstruction, where the reference sample’s non-identity feature is combined with the pose-variant sample’s identity feature. This encourages the network to regularize the pose-variant sample’s identity feature to be close to that of the reference sample. Thus, non-identity information is distilled away, leaving a disentangled identity representation for recognition at test.
Section 5 demonstrates the significant advantages of our approach on both controlled datasets and uncontrolled ones for recognition in-the-wild, especially on cases. In particular, we achieve strong improvements over state-of-the-art methods on 300-WLP, MultiPIE, and CFP datasets. These improvements become increasingly significant as we consider performance under larger pose variations. We also present ablative studies to demonstrate the utility of each component in our framework, namely pose-variant face generation, rich feature embedding and disentanglement by feature reconstruction.
To summarize, our key contributions are:
To the best of our knowledge, we are the first to propose a novel reconstruction-based feature learning that disentangles factors of variation such as identity and pose.
A comprehensively designed framework cascading rich feature embedding with the feature reconstruction, achieving pose-invariance in face recognition.
A generation approach to enrich the diversity of training data, without incurring the expense of labeling large datasets spanning pose variations.
Strong performance on both controlled and uncontrolled datasets, especially for large pose variations up to .
2 Related Work
While face recognition is an extensively studied area, we provide a brief overview of works most relevant to ours.
Blanz and Vetter pioneered 3D morphable models (3DMM) for high quality face reconstruction  and recently, blend shape-based techniques have achieved real-time rates . For face recognition, such techniques are introduced in DeepFace , where face frontalization is used for enhancing face recognition performance. As an independent application, specific frontalization techniques have also been proposed . Another line of work pertains to 3D face reconstruction from photo collections [29, 18, 42] or a single image [19, 50, 40], where the latter have been successfully used for face normalization prior to recognition. While most of the methods apply the framework of aligning 3DMM with the 2D face landmarks [47, 46, 25] and conduct further refinement. In contrast, our use of 3DMM for face synthesis is geared towards enriching the diversity of training data.
Deep face recognition
Several frameworks have recently been proposed that use DNNs to achieve impressive performances [22, 32, 37, 38, 39, 43, 44]. DeepFace  achieved verification rates comparable to human labeling on large test datasets, with further improvements from works such as DeepID . Collecting face images from the Internet, FaceNet  trains on 200 million images from 8 million subjects. The very deep network can only be well stimulated by the huge volume of training data. We also use DNNs, but adopt the contrasting approach of learning pose-invariant features, since large-scale datasets with pose variations are expensive to collect, or do not exist in several applications such as surveillance.
Pose-invariant face recognition
Early works use Canonical Correlation Analysis (CCA) to analyze the commonality among different pose subspaces [8, 21]. Further works consider generalization across multiple viewpoints  and multiview inter and intra discriminant analysis . With the introduction of DNNs, prior works aim to transfer information from pose variant inputs to a frontalized appearance [41, 45], which is then used for face recognition 
. The frontal appearance reconstruction usually relies on large amount of training data and the pairing across poses is too strict to be practical. Stacked progressive autoencoders (SPAE) map face appearances under larger non-frontal poses to those under smaller ones in a continuous way by setting up hidden layers. The regression based mapping highly depends on training data and may lack generalization ability. Hierarchical-PEP  employs probabilistic elastic part (PEP) model to match facial parts from different yaw angles for unconstrained face recognition scenarios. The 3D face reconstruction method  synthesizes missing appearance due to large view points, which may introduce noise. Rather than compensating the missing information caused by severe pose variations at appearance level, we target learning a pose-invariant representation at feature level which preserves discrimination power through deep training.
Disentangle factors of variation
Contractive discriminative analysis 
learns disentangled representations in semi-supervised framework by regularizing representations to be orthogonal to each other. Disentangling Boltzmann machine regularizes representations to be specific to each target task via manifold interaction. These methods involve non-trivial training procedure, and the pose variation is limited to half-profile views (). Inverse graphics network 
learns an interpretable representation by learning and decoding graphics codes, each of which encodes different factors of variation, but has been demonstrated only on the database generated from 3D CAD models. Multi-View Perceptron
disentangles pose and identity factors by cross-reconstruction of images synthesized from deterministic identity neurons and random hidden neurons. But it does not account for factors such as illumination or expression that are also needed for image-level reconstruction. In contrast, we use carefully designed embeddings as reconstruction targets instead of pixel-level images, which reduces the burden of reconstructing irrelevant factors of variation.
3 Proposed Method
We propose a novel pose-invariant feature learning method for large pose face recognition. Figure 2 provides an overview of our approach. Pose-variant face generation utilizes a 3D facial model to augment the training data with faces of novel viewpoints, besides generating ground-truth pose and facial landmark annotations. Rich feature embedding is then achieved by jointly learning the identity and non-identity features using multi-source supervision. Finally, disentanglement by feature reconstruction is performed to distill the identity feature from the non-identity one for better discrimination ability and pose-invariance.
3.1 Pose-variant Face Generation
The goal is to generate a series of pose-variant faces from a near-frontal image. This choice of generation approach is deliberate, since it can avoid hallucinating missing textures due to self-occlusion, which is a common problem with former approaches [9, 5] that rotate non-frontal faces to a normalized frontal view. More importantly, enriching instead of reducing intra-subject variations provides important training examples in learning pose-invariant features.
We reconstruct the 3D shape from a near-frontal face to generate new face images. Let be the set of frontal face images. A straightforward solution is to learn a nonlinear mapping that maps an image to the coordinates of a 3D mesh. However, it is non-trivial to do so for a large number of vertices (15k), as required for a high-fidelity reconstruction.
Instead, we employ the 3D Morphable Model (3DMM)  to learn a nonlinear mapping that embeds to a low-dimensional parameter space. The 3DMM parameters control the rigid affine transformation and non-rigid deformation from a 3D mean shape to the instance shape . Please refer to Figure 2 for an illustration:
where including scale , rotation , translation , identity coefficient and expression coefficient . The eigenbases and are learned offline using 3D face scans to model the identity  and expression  subspaces, respectively.
Once the 3D shape is recovered, we rotate the near-frontal face by evenly manipulating the yaw angle in the range of . We follow  to use a z-buffer for collecting texture information and render the background for high-quality recovery. The rendered face is then projected to 2D to generate new face images from novel viewpoints.
3.2 Rich Feature Embedding
Most existing face recognition algorithms [19, 20, 32, 43] learn face representation using only identity supervision. An underlying assumption of their success is that deep networks can “implicitly” learn to suppress non-identity factors after seeing a large volume of images with identity labels [32, 39].
However, this assumption does not always hold when extensive non-identity variations exist. As shown in Figure 1 (a), the face representation and pose changes still present substantial correlations, even though this representation is learned throught a very deep neural network (VGGFace ) with large-scale training data (2.6M).
This indicates that using only identity supervision might not suffice to achieve an invariant representation. Motivated by this observation, we propose to utilize multi-source supervision to learn a rich feature embedding , which can be “explicitly” branched into an identity feature and a non-identity feature , respectively. As we will show in the next section, the two features can collaborate to effectively achieve an invariant representation.
More specifically, as illustrated in Figure 3, can be further branched as and to represent pose and landmark cues. For our multi-source training data that are not generated, we apply the CASIA-WebFace database  and provide the supervision from an off-the-shelf pose estimator . Therefore, we have:
where mapping takes
and generates an embedding vectorand denotes the mapping parameters. Here, can be any off-the-shelf recognition network. is used to bridge two embedding vectors. We jointly learn all embeddings by optimizing:
where , and are identity, pose and landmark annotations and , and balance the weights between cross-entropy and loss.
By resorting to multi-source supervision, we can learn the rich feature embedding that “explicitly” encodes both identity and non-identity cues in and , respectively. The remaining challenge is to distill by disentangling from to achieve identity-only representation.
3.3 Disentanglement by Feature Reconstruction
The identity and non-identity features above are jointly learned under different supervision. However, there is no guarantee that the identity factor has been fully disentangled from the non-identity one since there is no supervision applied on the decoupling process. This fact motivates us to propose a novel reconstruction-based framework for effective identity and non-identity disentanglement.
Recall that we have generated a series of pose-variant faces for each training subject in Section 3.1. These images share the same identity but have different viewpoints. We categorize these images into two groups according to their absolute yaw angles: near-frontal faces () and non-frontal faces (). The two groups are used to sample image pairs that follow a specially designed configuration: a reference image which is randomly selected from the near-frontal group and a peer image which is randomly picked from the non-frontal group.
The next step is to obtain the identity and non-identity embeddings of two faces that have the same identity but different viewpoints. As shown in Figure 4, a pair of images are fed into the network to output the corresponding identity and non-identity features:
Note that is not indexed by as the network shares weights to process images of the same pair.
Our goal is to eventually push and close to each other to achieve a pose-invariant representation. A simple solution is to directly minimize the distance between the two features in the embedding subspace. However, this constraint only considers the identity branch, which might be entangled with non-identity, but completely ignores the non-identity factor, which provides strong supervision to purify the identity. Our experiments also indicate that a hard constraint would suffer from limited performance in large-pose conditions.
To address this issue, we propose to relax the constraint under a reconstruction-based framework. More specifically, we firstly introduce two reconstruction tasks:
where denotes the self reconstruction of the near-frontal rich embedding; while denotes the cross reconstruction of the non-frontal rich embedding. Here, is the reconstruction mapping with parameter .
The identity and non-identity features can be rebalanced from the rich feature embedding by minimizing the self and cross reconstruction loss under the cross-entropy constraint:
where , and weigh different constraints. Note that compared to (3.2), here we only finetune (as well as ) to rebalance the identity and non-identity features while keeping fixed, which is an important strategy to maintain the previously learned rich embedding.
In (3.3), we regularize both self and cross reconstructions to be close to the near-frontal rich embedding . Thus, portions of to and are dynamically rebalanced to make the non-frontal peer to be similar to the near-frontal reference . In other words, we encourage the network to learn a normalized feature representation across pose variations, thereby disentangling pose information from identity.
The proposed feature-level reconstruction is significantly different from former methods [32, 9] that attempt to frontalize faces at the image level. It can be directly optimized for pose invariance without suffering from artifacts that are common issues in face frontalization. Besides, our approach is an end-to-end solution that does not rely on extensive preprocessing usually required for image-level face normalization.
Our approach is also distinct from existing methods [20, 19] that synthesize pose-variant faces for data augmentation. Instead of feeding the network with a large number of augmented faces and letting it automatically learn pose-invariant or pose-specific features, we utilize the reconstruction loss to supervise the feature decoupling procedure. Moreover, factors of variation other than pose are also present in training, even though we only use pose as the driver for disentanglement. The cross-entropy loss in (3.3) plays an important role in preserving the discriminative power of identity features across various factors.
4 Implementation Details
. We use pre-trained weights learned from ImageNet
to initialize the network instead of training from scratch. To further improve the performance, we make two important changes: (1) we use stride-2 convolution instead of max pooling to preserve the structure information when halving the feature maps; (2) the dimension of 3DMM parameters is changed to 66-(30 identity, 29 expression and 7 pose) instead of 235- used in . We evenly sample new viewpoints in every from near-frontal faces to left/right profiles to cover the full range of pose variations.
Rich feature embedding The network is designed based on CASIA-net  with some improvements. As illustrated in Figure 3, we change the last fully connected layer to 512- for the rich feature embedding, which is then branched into 256- neurons for the identity feature and 128- neurons for the non-identity feature. To utilize multi-source supervision, the non-identity feature is further forked into 7- neurons for the pose embedding and 136- neurons for the landmark coordinates. Three different datasets are used to train the network: CASIA-WebFace, 300WLP and MultiPIE. We use Adam  stochastic optimizer with an initial learning rate of
, which drops by a factor of 0.25 every 5 epochs until convergence. Note that we train the network from scratch on purpose, since a pre-trained recognition model usually has limited ability to re-encode non-identity features.
Disentanglement by reconstruction Once are learned in the rich feature embedding, we freeze and finetune and to rebalance the identity and non-identity features as explained in Figure 4 and (3.3). The network takes the concatenation (384-) of and and outputs the reconstructed embedding (512-). The mapping is achieved by rolling though two fully connected layers and each of them has 512- neurons. We have tried different network configurations but get similar performance. The initial learning rate is set to 0.0001 and the hyper-parameters are determined via 5-fold cross-validation. We also find that it is import to do early stopping for effective reconstruction-based regularization. In (3.2) and (3.3), we use the cross-entropy loss to preserve the discriminative power of the identity feature. Other identity regularizations, e.g. triplet loss , can be easily applied in a plug-and-play manner.
We evaluate our feature learning method on three main pose-variant databases, MultiPIE , 300WLP  and CFP . We also compare with two top general face recognition frameworks, VGGFace  and N-pair loss face recognition , and three state-of-the-art pose-invariant face recognition methods, namely, MvDA , GMA  and MvDN . Further, we present an ablation study to emphasize the significance of each module that we carefully designed and a cross-database validation demonstrates the good generalization ability of our method.
5.1 Evaluation on MultiPIE
MultiPIE  is composed of 754,200 images of 337 subjects with different factors of variation such as pose, illumination, and expression. There are 15 different head poses set up, where we only use images of 13 head poses with yaw angle changes from to , with difference every consecutive pose bin in this experiment.
We split the data into train and test by subjects, of which the first 229 subjects are used for training and the remaining 108 are used for testing. This is similar to the experimental setting in , but we use entire data including both illumination and expression variations for training while excluding only those images taken with top-down views. Rank-1 recognition accuracy of non-frontal face images is reported. We take to as query and the frontal faces () as gallery, while restricting illumination condition to be neutral.
To be consistent with the experimental setting of , we form a gallery set by randomly selecting 2 frontal face images per subject, of which there are a total of 216 images. We evaluate the recognition accuracy for all query examples, of which there are 619 images per pose. The procedure is done with 10 random selections of gallery sets and mean accuracy is reported.
Evaluation is shown in Table 1. The recognition accuracy at every interval of yaw angle is reported while averaging its symmetric counterpart with respect to the 0-yaw axis. For the two general face recognition algorithms, VGGFace  and N-pair loss , we clearly observe more than 30% accuracy drop when the head pose approaches from . Our method significantly reduces the drop by more than 20%. The general methods are trained with very large databases leveraging across different poses, but our method has the additional benefit of explicitly aiming for a pose invariant feature representation.
The pose-invariant methods, GMA, MvDA, and MvDN demonstrate good performance within yaw angles, but again the performance starts to degrade significantly when yaw angle is larger than . When comparing the accuracy on extreme poses from to , our method achieves accuracy better than the best reported. Besides the improved performance, our method has an advantage over MvDN, since it does not require pose information at test time. On the other hand, MvDN is composed of multiple sub-networks, each of which is specific to a certain pose variation and therefore requires additional information on head pose for recognition.
5.2 Evaluation on 300WLP
, in which it establishes a 3D morphable model and reconstruct the face appearance with varying head poses. It consists of overall 122,430 images from 3,837 subjects. Compared to MultiPIE, the overall volume is smaller, but the number of subjects is significantly larger. For each subject, images are with uniformly distributed continuously varying head poses in contrast to MultiPIE’s strictly controlledhead pose intervals. The lighting conditions as well as the background are almost identical. Thus, it is an ideal dataset to evaluate algorithms for pose variation.
We randomly split 500 subjects of 8014 images as testing data and the rest 3337 subjects of 106,402 images as the training data. Among the testing data, two head pose images per subject form the gallery and the rest 7014 images serves as the probe. Table 2 shows the comparison with two state-of-the-art general face recognition methods, i.e. VGGFace  and N-pair loss face recognition . To the best of our knowledge, we are the first to apply our pose-invariant face recognition framework on this dataset. Thus, we only compare our method with the two general face recognition frameworks.
Since head poses in 300WLP continuously vary, we group the test samples into 6 pose intervals, , , , , and . For short annotation, we mark each interval with the end point, e.g., denotes the pose interval . From Table 2, our method achieves consistently better accuracy especially when pose angle approaches , which is clearly contributed by our feature reconstruction based disentanglement.
5.3 Evaluation on CFP
The Celebrities in Frontal-Profile (CFP) database  focuses on extreme head pose face verification. It consists of 500 subjects, with 10 frontal images and 4 profile images for each, in a wild setting. The evaluation is conducted by averaging the performance of 10 randomly selected splits with 350 identical and 350 non-identical pairs. Our MSMT+SR finetuned on MultiPIE with N-pair loss is the model evaluated in this experiment. The reported human performance is 94.57% accuracy on the frontal-profile protocol and 96.24% on the frontal-frontal protocol, which shows the challenge of recognizing profile views.
Results in Table 4 suggest that our method achieves consistently better performance compared to state-of-the-art. We reach the same Frontal-Frontal accuracy as Chen et al.  while being significantly better on Frontal-Profile by 1.8%. We are slightly better than DR-GAN  on extreme pose evaluation and 0.8% better on frontal cases. DR-GAN is a recent generative method that seeks the identity preservation at the image level, which is not a direct optimization on the features. Our feature reconstruction method preserves identity even when presented with profile view faces. In particular, as opposed to prior methods, ours is the only one that obtains very high accuracy on both the evaluation protocols.
5.4 Control Experiments
We extensively evaluate recognition performance on various baselines to study the effectiveness of each module in our proposed framework. Specifically, we evaluate and compare the following models:
SS: trained on a single source (e.g., CASIA-WebFace) using softmax loss only.
SS-FT: SS fine-tuned on a target dataset (e.g., MultiPIE or 300WLP) using softmax loss only.
MSMT: trained on multiple data sources (e.g., CASIA + MultiPIE or 300WLP) using softmax loss for identity and loss for pose.
MSMT+L2: fine-tuned on MSMT models using softmax loss and Euclidean loss on pairs.
MSMT+SR: fine-tuned on MSMT models using softmax loss and Siamese reconstruction loss.
MSMT: trained on the same multiple data sources as MSMT, using N-pair  metric loss for identity and loss for pose.
MSMT+SR: finetuned on MSMT models with N-pair loss and reconstruction loss.
The SS model serves as the weakest baseline. We observe that simultaneously training the network on multiple sources of CASIA and MultiPIE (or 300WLP) using multi-task objective (i.e., identification loss, pose or landmark estimation loss) is more effective than single-source training followed by fine-tuning. We believe that our MSMT learning can be viewed as a form of curriculum learning  since multiple objectives introduced by multi-source and multi-task learning are at different levels of difficulty (e.g., pose and landmark estimation or identification on MultiPIE and 300WLP are relatively easier than identification on CASIA-WebFace) and easier objectives allow to train faster and converge to better solution.
As an alternative to reconstruction regularization, one may consider reducing the distance between the identity-related features of the same subject under different pose directly (MSMT+L2). Learning to reduce the distance improves the performance over the MSMT model, but is not as effective as our proposed reconstruction regularization method, especially on face images with large pose variations.
Further, we observe that employing the N-pair loss  within our framework also boosts performance, which is shown by the improvements from MSMT to MSMT and MSMT+SR to MSMT+SR. We note that the MSMT baseline is not explored in prior works on pose-invariant face recognition. It provides a different way to achieve similar goals as the proposed reconstruction method. Indeed, a collateral observation through the relative performances of MSMT and MSMT is that the softmax loss is not good at disentangling pose from identity, while metric learning excels at it. Indeed, our feature reconstruction metric might be seen as achieving a similar goal, thus, improvements over MSMT are marginal, while those over MSMT are large.
5.5 Cross Database Evaluation
We evaluate our models, which are trained on CASIA with MultiPIE or 300WLP, on the cross test set 300WLP or MultiPIE, respectively. Results are shown in Table 7 to validate the generalization ability. There are obvious accuracy drops on both databases, for instance, a drop on 300WLP and drop on MultiPIE. However, such performance drops are expected since there exists a large gap in the distribution between MultiPIE and 300WLP.
Interestingly, we observe significant improvements when compared to VGGFace. These are fair comparisons since neither networks is trained on the training set of the target dataset. When evaluated on MultiPIE, our MSMT model trained on 300WLP and CASIA database improves over VGGFace and the model with reconstruction regularization demonstrates stronger performance, showing improvement over VGGFace. Similarly, we observe and improvements for MultiPIE and CASIA trained MSMT models and our proposed MSMT+SR, respectively, over VGGFace when evaluated on the 300WLP test set. This partially confirms that our performance is not an artifact of overfitting to a specific dataset, but is generalizable across different datasets of unseen images.
In the paper, we propose a new reconstruction loss to regularize identity feature learning for face recognition. We also introduce a data synthesization strategy to enrich the diversity of pose, requiring no additional training data. Rich embedding has already shown promising effects revealed by our control experiments, which is interpreted as curriculum learning. The self and cross reconstruction regularization achieves successful disentanglement of identity and pose, to show significant improvements on both MultiPIE, 300WLP and CFP with to gaps. Cross-database evaluation further verifies that our model generalizes well across databases. Future work will focus on closing the systematic gap among databases and further improve the generalization ability.
-  Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
-  V. Blanz and T. Vetter. Face recognition based on fitting a 3D morphable model. TPAMI, 25(9):1063–1074, 2003.
-  C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. FaceWarehouse: a 3D facial expression database for visual computing. TVCG, 20(3):413–425, Mar. 2014.
-  J.-C. Chen, J. Zheng, V. Patel, and R. Chellappa. Fisher vector encoded deep convolutional features for unconstrained face verification. In ICIP, 2016.
-  C. N. Duong, K. Luu, K. G. Quach, and T. D. Bui. Beyond principal components: Deep boltzmann machines for face modeling. In CVPR, 2015.
-  S. Edelman and H. H. Bülthoff. Orientation dependence in the recognition of familiar and novel views of three-dimensional objects. Vision Research, 32(12):2385–2400, 1992.
-  R. Gross, I. Matthew, J. Cohn, T. Kanade, and S. Baker. Multipie. Image and Vision Computing, 2009.
-  D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Cannonical correlation analysis: an overview with application to learning methods. Neural Comput., 16, 2004.
-  T. Hassner, S. Harel, E. Paz, and R. Enbar. Effective face frontalization in unconstrained image. In CVPR, 2015.
-  J. E. Hummel and I. Biederman. Dynamic binding in a neural network for shape recognition. Psychological Review, 99(3):480–517, 1992.
-  M. Kan, S. Shan, H. Chang, and X. Chen. Stacked progressive auto-encoders (spae) for face recognition across poses. In CVPR, 2014.
-  M. Kan, S. Shan, and X. Chen. Multi-view deep network for cross-view classification. In CVPR, 2016.
-  M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen. Multi-view discriminant analysis. In ECCV, 2012.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint: 1412.6980, 2014.
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.In NIPS, 2012.
-  T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In NIPS, 2015.
-  H. Li and G. Hua. Hierarchical-pep model for real-world face recognition. In CVPR, 2015.
-  S. Liang, L. Shapiro, and I. Kemelmacher-Shlizerman. Head reconstruction from internet photos. In ECCV, 2016.
-  I. Masi, A. T. an Trãn, T. Hassner, J. T. Leksut, and G. Medioni. Do we really need to collect millions of faces for effective face recognition? In ECCV, 2016.
-  I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware face recognition in the wild. In CVPR, 2016.
-  A. Nielson. Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data. IEEE Trans. on Image Processing, 11(3), 2002.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, 2015.
-  P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A 3D face model for pose and illumination invariant face recognition. In AVSS, 2009.
-  X. Peng, J. Huang, Q. Hu, S. Zhang, A. Elgammal, and D. Metaxas. From circle to 3-sphere: Head pose estimation by instance parameterization. Computer Vision and Image Understanding, 136:92–102, 2015.
-  X. Peng, S. Zhang, Y. Yu, and D. N. Metaxas. Toward personalized modeling: Incremental and ensemble alignment for sequential faces in the wild. International Journal of Computer Vision, pages 1–14, 2017.
-  T. Poggio and S. Edelman. A network that learns to recognize 3-dimensional objects. Nature, 343(6255):263–266, 1990.
-  S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold interaction. In ICML, 2014.
-  S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza. Disentangling factors of variation for facial expression recognition. In ECCV, 2012.
-  J. Roth, Y. Tong, and X. Liu. Unconstrained 3d face reconstruction. In CVPR, 2015.
-  C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In ICCVW, 2013.
-  S. Sankaranarayanan, A. Alavi, C. Castillo, and R. Chellappa. Triplet probabilistic embedding for face verification and clustering. In arXiv preprint, volume 1605.05396, 2016.
-  F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A unified embedding for face recognition and clustering. In CVPR, 2015.
-  S. Sengupta, J.-C. Chen, C. Castillo, V. Patel, R. Chellappa, and D. Jacobs. Frontal to profile face vefirication in the wild. In WACV, 2016.
-  A. Sharma, A. Kumar, H. D. III, and D. Jacobs. Generalized multiview analysis: A discriminative latent space. In CVPR, 2012.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In arXiv preprint, 2014.
-  K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS, 2016.
-  Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In NIPS, pages 1988–1996. 2014.
-  Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000 classes. In CVPR, 2014.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the gap to Human-Level performance in face verification. In CVPR, 2014.
-  A. T. Tran, T. Hassner, I. Masi, and G. G. Medioni. Regressing robust and discriminative 3d morphable models with a very deep neural network. CoRR, abs/1612.04904, 2016.
-  L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for pose-invariant face recognition. In CVPR, 2017.
-  X. Wang, G. Guo, M. Merler, N. C. Codella, M. Rohith, J. R. Smith, and C. Kambhamettu. Leveraging multiple cues for recognizing family photos. Image and Vision Computing, 58:61–75, 2017.
-  Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, 2016.
-  D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. In CoRR, 2014.
-  X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Towards large-pose face frontalization in the wild. In ICCV, 2017.
-  X. Yu, J. Huang, S. Zhang, and D. N. Metaxas. Face landmark fitting via optimized part mixtures and cascaded deformable model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(11):2212 – 2226, 2015.
-  X. Yu, Z. Lin, J. Brandt, and D. N. Metaxas. Consensus of regression for occlusion-robust facial feature localization. In ECCV, 2014.
-  X. Yu, F. Zhou, and M. Chandraker. Deep deformation network for object landmark localization. In ECCV, 2016.
-  X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Li. Face alignment across large poses: A 3d solution. In CVPR, 2016.
-  X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-fidelity pose and expression normalization for face recognition in the wild. In CVPR, 2015.
-  Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity-preserving face space. In ICCV, 2013.
-  Z. Zhu, P. Luo, X. Wang, and X. Tang. Multi-view perceptron: a deep model for learning face identity and view representations. In NIPS, 2014.
1 Summary of The Supplementary
This supplementary file includes two parts: (a) Additional implementation details are presented to improve the reproducibility; (b) More experimental results are presented to validate our approach in different aspects, which are not shown in the main submission due to the space limitation.
2 Additional Implementation Details
Pose-variant face generation We designed a network to predict 3DMM parameters from a single face image. The design is mainly based on VGG16 . We use the same number of convolutional layers as VGG16 but replacing all max pooling layers with stride-2 convolutional operations. The fully connected (fc) layers are also different: we first use two fc layers, each of which has 1024 neurons, to connect with the convolutional modules; then, a fc layer of 30 neurons is used for identity parameters, a fc layer of 29 neurons is used for expression parameters, and a fc layer of 7 neurons is used for pose parameters. Different from  uses 199 parameters to represent the identity coefficients, we truncate the number of identity eigenvectors to 30 which preserves
uses 199 parameters to represent the identity coefficients, we truncate the number of identity eigenvectors to 30 which preservesof variations. This truncation leads to fast convergence and less overfitting. For texture, we only generate non-frontal faces from frontal ones, which significantly mitigate the hallucinating texture issue caused by self occlusion and guarantee high-fidelity reconstruction. We apply the Z-Buffer algorithm used in  to prevent ambiguous pixel intensities due to same image plane position but different depths.
Rich feature embedding The design of the rich embedding network is mainly based on the architecture of CASIA-net  since it is wildly used in former approach and achieves strong performance in face recognition. During training, CASIA+MultiPIE or CASIA+300WLP are used. As shown in Figure 3 of the main submission, after the convolutional layers of CASIA-net, we use a 512- FC for the rich feature embedding, which is further branched into a 256- identity feature and a 128- non-identity feature. The 128- non-identity feature is further connected with a 136-d landmark prediction and a 7- pose prediction. Notice that in the face generation network, the number of pose parameters is 7 instead of 3 because we need to uniquely depict the projection matrix from the 3D model and the 2D face shape in image domain, which includes scale, pitch, yaw, roll, x translation, y translation, and z translations.
Disentanglement by feature reconstruction Once the rich embedding network is trained, we feed genius pair that share the same identity but different viewpoints into the network to obtain the corresponding rich embedding, identity and non-identity features. To disentangle the identity and pose factors, we concatenate the identity and non-identity features and roll though two 512- fully connected layers to output a reconstructed rich embedding depicted by 512 neurons. Both self and cross reconstruction loss are designed to eventually push the two identity features close to each other. At the same time, a cross-entropy loss is applied on the near-frontal identity feature to maintain the discriminative power of the learned representation. The disentanglement of the identity and pose is finally achieved by the proposed feature reconstruction based metric learning.
3 Additional Experimental Results
In addition to the main submission, we present more experimental results in this section to further validate our approach in different aspects.
3.1 P1 and P2 protocol on MultiPIE
In the main submission, due to space considerations, we only report the mean accuracy over 10 random training and testing splits, on MultiPIE and 300WLP separately. In Table 6 , we report the standard deviation of our method as a more complete comparison. From the results, the standard deviation of our method is also very small, which suggests that the performance is consistent across all the trials. We also compare the cross database evaluation on both mean accuracy and standard deviation in Table
, we report the standard deviation of our method as a more complete comparison. From the results, the standard deviation of our method is also very small, which suggests that the performance is consistent across all the trials. We also compare the cross database evaluation on both mean accuracy and standard deviation in Table7. We show the models trained on 300WLP and tested on MultiPIE with both P1 and P2 protocol. Please note that with P2 protocol, our method still achieves better performance on MultiPIE than MvDN  with 0.7% gap. Further, across different testing protocols, the proposed method consistently outperforms the baseline method MSMT, which clearly shows the effectiveness of our proposed Siamese reconstruction based regularization for pose-invariant feature representation.
3.2 Control Experiments with P2 on MultiPIE
The P2 testing protocol utilizes all the images as the gallery. The performance is expected to be better than that reported on P1 protocol in the main submission since more images are used for reference. There is no standard deviation in this experiment as the gallery is fixed by using all the frontal images. The results are shown in Table 8, which confirms the conclusion that the proposed feature reconstruction based regularization is effective in obtaining pose-invariant and highly discriminative feature representations for face recognition.
3.3 Recognition Accuracy on LFW
We also carried out additional experiments on LFW [lfwdatabase]. As we know, LFW contains mostly near-frontal faces. To better reveal the contribution of our method designed to regularize pose variations, we compare the performance with respect to statistics of pose range (correct pairs num. / total pairs num. in the range). Table 9 shows the results. Our approach outperforms VGG-Face especially in non-frontal settings (¿30), which demonstrates the effectiveness of the proposed method in handling pose variations.
3.4 Feature Embedding of MultiPIE
Figure 5 shows t-SNE visualization [Maaten2014] of VGGFace  feature space and the proposed reconstruction-based disentangling feature space of MultiPIE . For visualization clarity, we only visualize 10 randomly selected subjects from the test set with , , , and yaw angles. Figure 5 (a) shows that samples from VGGFace feature embedding have large overlap among different subjects. In contrast, Figure 5 (b) shows that our approach can tightly cluster samples of the same subject together which leads to little overlap of different subjects, since identity features have been disentangled from pose in this case.
3.5 Feature Embedding of 300WLP
Figure 6 shows t-SNE visualization [Maaten2014] of VGGFace  feature space and the proposed reconstruction-based disentangling feature space, with 10 subjects from 300WLP . Similar to the results of MultiPIE , the VGGFace feature embedding space shows entanglement between identity and the pose, i.e., the man with the phone in view is overlapped with the frontal view image of other persons. In contrast, feature embeddings of our method are largely separated from one subject to another, while feature embeddings of the same subject are clustered together even there are extensive pose variations.
3.6 Probe and Gallery Examples
In Figure 7, we show examples of gallery and probe images that are used in testing. Figure 7 (a) shows the gallery images in from MultiPIE. Each subject only has one frontal image for reference. Figure 7 (b) shows probe images of various pose and expression from MultiPIE. Each subject presents all possible poses and expressions such as neutral, happy, surprise, etc. The illumination is controlled with plain front lighting. Figure 7 (c) shows the gallery images from 300WLP, with two near-frontal images of each subject randomly selected. Figure 7 (d) shows all poses of the same subject from 300WLP.
3.7 Failure cases in MultiPIE and 300WLP
In Figure 8, we show the typical failure cases generated by the proposed method on both MultiPIE and 300WLP. For MultiPIE, the most challenging cases come from exaggerated expression variations, e.g. Figure 8 (a), the second row. For 300WLP, the challenge mostly come from head pose variations and illumination variations. However, images in most failure pairs are visually similar.
|VGG-Face||0.973 (5304/5524)||0.967 (410/424)||0.961 (49/51)||1.00 (1/1)||0.964|
|Ours||0.986 (5445/5524)||0.981 (416/424)||1.00 (51/51)||1.00 (1/1)||0.983|