UV space embeds the manifold of a 3D face as a 2D contiguous atlas. Contiguous UV spaces are natural products of many 3D scanning devices and are often used for 3D Morphable Model (3DMM) construction [1, 2, 3]. Although UV space by nature cannot be constructed from an arbitrary 2D image, a UV map may still be obtained by fitting a 3DMM to the image and sampling the corresponding texture . We illustrate this procedure in Figure 1. Unfortunately, due to the self occlusion of the face, those UV maps are always incomplete and are likely to miss facial parts that are informative. Once completed, this UV map, combined with the corresponding 3D face, is extremely useful, as it can be used to synthesise 2D faces of arbitrary poses. Then, we can probe image pairs of similar poses to improve recognition performance . Hence, the success of pose-invariant face recognition relies on the quality of UV map completion.
Recovering UV maps from a sequence of related facial frames can addressed by employing robust principal component analysis (RPCA) with missing data . This is because self-occlusion at large poses leads to incomplete and missing data and imperfection in fitting leads to regional errors. Principal Component Pursuit (PCP) as proposed in [7, 8] and its variants e.g., [9, 10, 11, 12, 13] are popular algorithms to solve RPCA. PCP employs the nuclear norm and the -norm (convex surrogates of the rank and sparsity constraints, respectively) in order to approximate the original -norm regularised rank minimisation problem. Unfortunately, PCP operates in an isolated manner where domain-dependent prior knowledge, i.e., side information , is always ignored. Moreover, real-world visual data rarely satisfies the stringent assumptions imposed by PCP for exact recovery. These call for a more powerful framework that can assimilate useful priors to alleviate the degenerate or suboptimal solutions of PCP.
It has already been shown that side information is propitious in the context of matrix completion [17, 18] and compressed sensing . Recently, noiseless features have been capitalised on in the PCP framework[20, 21, 22, 23]. In particular, an error-free orthogonal column space was used to drive a person-specific facial deformable model. And features also remove dependency on the row-coherence which is beneficial in the case of union of multiple subspaces[22, 23]. More generally, Chiang et al. used both a column and a row space to recover only the weights of their interaction in a simpler problem. The main hindrance to the success of such methods is the need for a set of clean, noise-free data samples in order to determine the column and/or row spaces of the low-rank component. But there are no prescribed way to find them in practice.
This paper is based on our preliminary work and extended to 1) the problem of UV completion and 2) to incorporate side information provided by generative adversarial networks. As such, we have extended PCP to take advantage of noisy prior information aiming to realise better UV map reconstruction. We then perform pose-invariant face recognition experiments using the completed UV maps. Experimental results indicate the superiority of our framework. The overall work flow is explicated in Fig.2. Our contributions are summarised as follows:
A novel convex program is proposed to use side information, which is a noisy approximation of the low-rank component, within the PCP framework. The proposed method is able to handle missing values while the developed optimization algorithm has convergence guarantees.
Furthermore, we extend our proposed PCP model using side information to exploit prior knowledge regarding the column and row spaces of the low-rank component in a more general algorithmic framework.
In the case of UV completion, we suggest the use of generative adversarial networks to provide subspace features and side information, leading to a seamless integration of deep learning into the robust PCA framework.
We demonstrate the applicability and effectiveness of the proposed approaches on synthetic data as well as on facial image denoising, UV texture completion and pose-invariant face recognition experiments with both quantitative and qualitative evaluation.
The remainder of this paper is organised as follows. We discuss relevant literature in Section 2, while the proposed robust principal component analysis using side information with missing values (PCPSM) along with its extension that incorporates features (PCPSFM) is presented in Section 3. In Section 4, we first evaluate our proposed algorithms on synthetic and real-world data. Then we introduce GAN as a source of features and side information for the subject of UV completion. Finally, face recognition experiments are presented in the last subsection.
Notations Lowercase letters denote scalars and uppercase letters denote matrices, unless otherwise stated. For norms of matrix , is the Frobenius norm; is the nuclear norm; and is the sum of absolute values of all matrix entries. Moreover, represents tr() for real matrices . Additionally, symbolises element-wise multiplication of two matrices of the same dimension.
2 Related work
We discuss two different lines of research, namely low-rank recovery as well as image completion.
2.1 Robust principal component analysis
Suppose that there is a matrix with rank min() and a sparse matrix with entries of arbitrary magnitude. If we are provided with the observation matrix , RPCA aims to recover them by solving the following objective:
where is a regularisation parameter. However, (1) cannot be readily solved because it is NP-hard. PCP instead solves the following convex surrogate:
which, under mild conditions, is equivalent to (1). There exist many efficient solvers for (2) and its applications include background modelling from surveillance video and removing shadows and specularities from face images.
One the first methods for incorporating dictionary was proposed in the context of subspace clustering[22, 23]. The LRR algorithm assumes that we have available an orthogonal column space , where , and optimises the following:
Given an orthonormal statistical prior of facial images, LRR can be used to construct person-specific deformable models from erroneous initialisations.
A generalisation of the above was proposed as Principal Component Pursuit with Features (PCPF) where further row spaces , , were assumed to be available with the following objective:
There is a stronger equivalence relation between (4) and (1) than (2). The main drawback of the above mentioned models is that features need to be accurate and noiseless, which is not trivial in practical scenarios.
2.2 Image completion neural networks
Recent advances in convolutional neural networks (CNN) also show great promises in visual feature learning. Context encoders (CE)
use a encoder-decoder pipeline where the encoder takes an input image with missing regions producing a latent feature representation and the decoder takes the feature representation generating the missing image content. CE uses a joint loss function:
where is the reconstruction loss and is the adversarial loss. The reconstruction loss is given by:
where is a binary mask, is an example image and CE produces an output . The adversarial loss is based on Generative Adversarial Networks (GAN). GAN learns both a generative model from noise distribution to data distribution and a discriminative model by the following objective:
For CE, the adversarial loss is modified to
Generative face completion uses two discriminators instead with the following objective
is a parsing loss of pixel-wise softmax between the estimated UV textureand the ground truth texture of width and height
Patch synthesis optimises a loss function of three terms: the holistic content term, the local texture term and the TV-loss term. The content constraint penalises the difference between the optimisation result and the previous content prediction
where if the optimisation result at a coarser scale. The texture constraint penalises the texture appearance across the hole,
where is the corresponding mask in the VGG-19 feature map , is the number of patches sampled in , is the local neural patch at location , and is the nearest neighbor of . The TV loss encourages smoothness:
3 Robust Principal Component Analysis Using Side Information
In this section, we propose models of RPCA using side information. In particular, we incorporate side information into PCP by using the trace distance of the difference between the low-rank component and the noisy estimate, which can be seen as a generalisation of compressed sensing with prior information where norm has been used to minimise the distance between the target signal and side information .
3.1 The PCPSM and PCPSFM models
Assuming that a noisy estimate of the low-rank component of the data is available, we propose the following model of PCP using side information with missing values (PCPSM):
where are parameters that weigh the effects of side information and noise sparsity.
The proposed PCPSM can be revamped to generalise the previous attempt of PCPF by the following objective of PCP using side information with features and missing values (PCPSFM):
where are bilinear mappings for the recovered low-rank matrix and side information respectively. Note that the low-rank matrix is recovered from the optimal solution () to objective (16) via . If side information is not available, PCPSFM reduces to PCPF with missing values by setting to zero. If the features are not present either, PCP with missing values can be restored by fixing both of them at identity. However, when only the side information is accessible, objective (16) is transformed back into PCPSM.
3.2 The algorithm
If we substitute for and orthogonalise and , the optimisation problem (16) is identical to the following convex but non-smooth problem:
which is amenable to the multi-block alternating direction method of multipliers (ADMM).
The corresponding augmented Lagrangian of (17) is:
where and are Lagrange multipliers and is the learning rate.
The ADMM operates by carrying out repeated cycles of updates till convergence. During each cycle, are updated serially by minimising (18) with other variables fixed. Afterwards, Lagrange multipliers
are updated at the end of each iteration. Direct solutions to the single variable minimisation subproblems rely on the shrinkage and the singular value thresholding operators. Let serve as the shrinkage operator, which naturally extends to matrices, , by applying it to matrix element-wise. Similarly, let be the singular value thresholding operator on real matrix , with
being the singular value decomposition (SVD) of.
Minimising (18) w.r.t. at fixed is equivalent to the following:
where . Its solution is shown to be . Furthermore, for ,
where , whose update rule is , and for ,
where with a closed-form solution . Finally, Lagrange multipliers are updated as usual:
The overall algorithm is summarised in Algorithm 1.
3.3 Complexity and convergence
Orthogonalisation of the features
via the Gram-Schmidt process has an operation count ofand respectively. The update in Step is the most costly step of each iteration in Algorithm 1. Specifically, the SVD required in the singular value thresholding action dominates with complexity.
A direct extension of the ADMM has been applied to our 3-block separable convex objective. Its global convergence is proved in Theorem 1. We have also used the fast continuation technique already applied to the matrix completion problem to increase incrementally for accelerated superlinear performance. The cold start initialisation strategies for variables and Lagrange multipliers are described in . Besides, we have scheduled to be updated first and taken the initial learning rate as suggested in. As for stopping criteria, we have employed the Karush-Kuhn-Tucker (KKT) feasibility conditions. Namely, within a maximum number of iterations, when the maximum of and dwindles from a pre-defined threshold , the algorithm is terminated, where signifies values at the th iteration.
Let the iterative squence be generated by the direct extension of ADMM, Algorithm 1, then the sequence converges to a solution of the Karush-Kuhn-Tucher (KKT) system in the fully observed case.
We first show that function is sub-strong monotonic. From, we know that is a KKT point, where , , , if and , otherwise. Since is convex, by definition, we have
Since is identity in (17), we have
where the third line follows from when and when , and the fourth line follows from , and . But is bounded, so there always exists such that
Thus, overall we have
Combining with (24), we arrive at
which shows that satisfies the sub-strong monotonicity assumption.
4 Experimental results
4.1 Parameter calibration
In this section, we illustrate the enhancement made by side information through both numerical simulations and real-world applications. First, we explain how parameters used in our implementation are tuned. Second, we compare the recoverability of our proposed algorithms with state-of-the-art methods for incorporating features or dictionary, viz. PCPF  and LRR  on synthetic data as well as the baseline PCP  when there are no features available. Last, we show how powerful side information can be for the task of UV completion in post-invariant face recognition, where both features and side information are derived from generative adversarial networks.
and the heuristics for predicting the dimension of principal singular space is not adopted here due to its lack of validity on uncharted real data. We also include Partial Sum of Singular Values (PSSV) in our comparison for its stated advantage in view of the limited number of images available. The stopping criteria for PCPF, LRR, PCP and PSSV are all set to the same KKT optimality conditions for reasons of consistency.
In order to tune the algorithmic parameters, we first conduct a benchmark experiment as follows: a low-rank matrix is generated from , where have entries from a distribution; a sparse matrix is generated by randomly setting entries to zero with others taking values of
with equal probability; side informationis assumed perfect, that is, ;
is set as the left-singular vectors of; and is set as the right-singular vectors of ; all entries are observed. It has been found that a scaling ratio , a tolerance threshold and a maximum step size to avoid ill-conditioning can bring all models except PSSV to convergence with a recovered of rank , a recovered of sparsity and an accuracy on the order of . Still, these apply to PSSV as is done similarly in.
Although theoretical determination of and is beyond the scope of this paper, we nevertheless provide empirical guidance based on extensive experiments. A parameter weep in the space for perfect side information is shown in Figure 3(a) and for observation as side information in Figure 3(b) to impart a lower bound and a upper bound respectively. It can be easily seen that (or for a general matrix of dimension ) from Robust PCA works well in both cases. Conversely, depends on the quality of the side information. When the side information is accurate, a large should be selected to capitalise upon the side information as much as possible, whereas when the side information is improper, a small should be picked to sidestep the dissonance caused by the side information. Here, we have discovered that a κ value of 0.2 works best with synthetic data and a value of 0.5 is suited for public video sequences, both of which will be used in all experiments in subsequent sections together with other aforementioned parameter settings. It is worth emphasising again that prior knowledge of the structural information about the data yields more appropriate values for and .
4.2 Phase transition on synthetic datasets
We now focus on the recoverability problem, i.e. recovering matrices of varying ranks from errors of varying sparsity. True low-rank matrices are created via , where matrices
have independent elements drawn randomly from a Gaussian distribution of mean
and varianceso is the rank of . Next, we generate error matrices , which possess non-zero elements located randomly within the matrix. We consider two types of entries for : Bernoulli and , where is the projection operator. thus becomes the simulated observation. For each pair, three observations are constructed. The recovery is successful if for all these three problems,
from the recovered . In addition, let be the SVD of . Feature is formed by randomly interweaving column vectors of with arbitrary orthonormal bases for the null space of , while permuting the expanded columns of with random orthonormal bases for the kernel of forms feature . Hence, the feasibility conditions are fulfilled: , , where is the column space operator.
For each trial, we construct the side information by directly adding small Gaussian noise to each element of : ,
. As a result, the standard deviation of the error in each element isof that among the elements themselves. On average, the Frobenius percent error, , is . Such side information is genuine in regard to the fact that classical PCA with accurate rank is not able to eliminate the noise . We set to 10 throughout.
Full observation Figures 4(a.I) and (a.II) plot results from PCPF, LRR and PCPSFM. On the other hand, the situation with no available features is investigated in Figures 4(a.III) and 4(a.IV) for PCP and PCPSM. The frontier of PCPF has been advanced by PCPSFM everywhere for both sign types. Especially at low ranks, errors with much higher density can be removed. Without features, PCPSM surpasses PCP by and large with significant expansion at small sparsity for both cases. Results from RPCAG and PSSV are worse than PCP with LRR marginally improving (see Figures 4(b.I), (b.II), (b.III) and b(IV)).
Partial observation Figures 5(a.I) and (a.II) map out the results for PCPF, LRR and PCPSFM when of the elements are occluded and Figures 5(a.III) and (a.IV) for featureless PCP and PCPSM. In all cases, areas of recovery are reduced. However, there are now larger gaps between PCPF, PCPSFM and PCP, PCPSM. This marks the usefulness of side information particularly in the event of missing observations. We remark that in unrecoverable areas, PCPSM and PCPSFM still obtain much smaller values of . FRPCAG fails to recover anything at all.
4.3 Face denoising
If a surface is convex Lambertian and the lighting is isotropic and distant, then the rendered model spans a 9-D linear subspace . Nonetheless, facial images are only approximately so because facial harmonic planes have negative pixels and real lighting conditions entail unavoidable occlusion and albedo variations. It is thus more reasonable to decompose facial image formation as a low-rank component for face description and a sparse component for defects. In pursuit of this low-rank portrayal, we suggest that there can be further boost to the performance of facial characterisation by leveraging an image which faithfully represents the subject.
We consider images of a fixed pose under different illuminations from the extended Yale B database for testing. All 64 images were studied for each person. observation matrices were formed by vectorising each image and the side information was chosen to be the average of all images, tiled to the same size as the observation matrix for each subject. In addition, of randomly selected pixels of each image were set as missing entries.
For LLR, PCPF and PCPSFM to run, we learn the feature dictionary following an approach by Vishal et al. . In a nutshell, the feature learning process can be treated as a sparse encoding problem. More specifically, we simultaneously seek a dictionary and a sparse representation such that:
where is the number of atoms, ’s count the number of non-zero elements in each sparsity code and is the sparsity constraint factor. This can be solved by the K-SVD algorithm. Here, feature is the dictionary and feature corresponds to a similar solution using the transpose of the observation matrix as input. For implementation details, we set to , to and used iterations for each subject.
As a visual illustration, two challenging cases are exhibited in Figure 6. For subject , it is clearly evident that PCPSM and PCPSFM outperform the best existing methods through the complete elimination of acquisition faults. More surprisingly, PCPSFM even manages to restore the flash in the pupils that is barely present in the side information. For subject , PCPSM indubitably reconstructs a more vivid right eye than that from PCP which is only discernible. With that said, PCPSFM still prevails by uncovering more shadows, especially around the medial canthus of the right eye, and revealing a more distinct crease in the upper eyelid as well a more translucent iris. We further unmask the strength of PCPSM and PCPSFM by considering the stringent side information made of the average of 10 other subjects. Surprisingly, PCPSM and PCPSFM still manage to remove the noise, recovering an authentic image (Figures 6(c.IV) and 6(c.VII)). We also notice that PSSV, RPCAG, FRPCAG do not improve upon PCP as in simluation experiments. Thence, we will focus on comparisons with PCP, LRR, PCPF only.
4.4 UV map completion
We concern ourselves with the problem of completing the UV texture for each of a sequence of video frames. That is, we apply PCPSM and PCPSFM to a collection of incomplete textures lifted from the video. This parameter-free approach is advantageous to a statistical texture model such as 3D Morphable Model (3DMM) [40, 41] by virtue of its difficulty in reconstructing unseen images captured ’in-the-wild’ (any commercial camera in arbitrary conditions).
4.4.1 Texture extraction
Given a 2D image, we extract the UV texture by fitting 3DMM. Specifically, following 
, three parametric models are employed. There are a 3D shape model (31), a texture model (32) and a camera model (33):
where and are shape, texture and camera parameters to optimise; and are the shape and texture eigenbases respectively with the number of vertices in shape model; and are the means of shape and texture models correspondingly, learnt from 10000 face scans of different individuals; is a perspective camera transformation function.
The complete cost function for 3DMM fitting is:
where denotes the operation of sampling the feature image onto the projected 2D locations. The second term is a landmark term with weighting used to accelerate in-the-wild 3DMM fitting, where 2D shape is provided by. The final two terms are regularisation terms to counter over-fitting, where and
are diagonal matrices with the main diagonal being eigenvalues of the shape and texture models respectively. (34) is solved by the Gauss-Newton optimisation framework (see for details).
4.4.2 Quantitative evaluation
We quantitatively evaluate the completed UV maps by our proposed methods on the 4DFAB dataset . 4DFAB is the first 3D dynamic facial expression dataset designed for biometric applications, where 180 participants were invited to attend four sessions at different times. Hence, to complete UV maps for one session, we can leverage images from another session as the side information. For each of 5 randomly selected subjects, one dynamic sequence of 155 frames was randomly cut from the second session. After vectorisation, a observation matrix was formed. To produce UV masks of different poses, we rotate each face with different yaw and pitch angles. The yaw angle ranges from to in steps of , wheres the pitch angle is selected from . Therefore, for each subject, a set of 155 unique masks were generated. We also tiled one image of the same subject from the first session into a matrix as side information. was provided by the left and right singular vectors of the original sequence while was set to the identity.
From Figure 7, we observe that (I) RPCA approaches can deal with cases when more than of the pixels are missing; (II) imperfect side information (shaved beard, removed earrings and different lights) still help with the recovery process. We record peak signal-to-noise ratios (PSNR) and structural similarity indices (SSIM) between the completed UV maps and the original map in Table I. It is evident that with the assistance of side information much higher fidelity can be achieved. The use of imperfect side information nearly comes on a par with perfect features.
4.4.3 Generative adversarial networks
More often than not, ground-truth , are not accessible to us for in-the-wild videos. Learning methods such as (30) must be leveraged to acquire or . However, (30) is not ideal: (I) it is not robust to errors; (II) it cannot handle missing values; (III) it requires exhaustive search of optimal parameters which vary from video to video; (IV) it only admits greedy solutions. On the other hand, we can use GAN to produce authentic pseudo ground-truth and then obtain and
accordingly. Moreover, such completed sequence provides us good side information. For GAN, we employ the image-to-image conditional adversarial network (appropriately customised for UV map completion) to conduct UV completion. Details regarding the architecture and training of GAN can be found in the supplementary material.
4.4.4 Qualitative demonstration
To examine the ability of our proposed methods on in-the-wild images. We perform experiments on the 300VW dataset . This dataset contains 114 in-the-wild videos that exhibit large variations in pose, expression, illumination, background, occlusion, and image quality. Each video shows exactly one person, and each frame is annotated with 68 facial landmarks. We performed 3DMM fitting on these videos and lifted one corresponding UV map for earch frame, where the visibility mask was produced by z-buffering based on the fitted mesh. The side information was generated by taking the average of the completed UVs from GAN. and were assigned to the singular vectors of the completed texture sequence from GAN.
We display results for one frame from each of 9 arbitrary videos in Figure 8. As evident from the images, GAN alone has unavoidable drawbacks: (I) when the 3DMM fitting is not accurate, GAN is unable to correct such defects; (II) when the image itself contains errors, GAN is unable to remove them. On the other hand, PCP often fails to produce a complete UV. PCPSM always produces a completed UV texture, which is an improvement over PCP, but it makes boundaries undesirably visible. Visually, LRR and PCPSFM have the best performance, being able to produce good completed UVs for a large variety of poses, identities, lighting conditions and facial characteristics. This justifies the quality of subspaces and side information from GAN for use in the robust PCA framework. We also synthesise 2D faces of three different poses using the the completed UV maps in Figure 9.
4.5 Face recognition
. More specifically, to fully exploit the existing vast sources of in-the-wild data, one needs to recognise faces in a pose-invariant way. Modern approaches include pose-robust feature extraction, multi-view subspace learning, face synthesis, etc. These often fall short of expectations either due to fundamental limitations or inability to fuse with other useful methods. For example, Generalized Multiview Analysis cannot take into account of pose normalisation or deep neural network-based pose-robust feature extraction and vice versa. It is thus fruitful to provide a framework where information from different perspectives can be fused together to deliver better prediction.
We quantitatively evaluate our proposed fusion methods by carrying out face recognition experiments on the completed UVs. The experiments are performed on three standard databases, i.e. VGG, CFP and YTF. We perform both ablation as well as video-based face recognition experiments against known benchmarks.
VGG The VGG2 dataset contains a training set of 8,631 identities (3,141,890 images) and a test set of 500 identities (169,396 images). VGG2 has large variations in pose, age, illumination, ethnicity and profession. To facilitate the evaluation of face matching across different poses, VGG2 also provides a face template list for each of some 368 subjects, which contains 2 front templates, 2 three-quarter templates and 2 profile templates. Each template includes 5 images.
CFP The CFP dataset consists of 500 subjects, each of which has 10 frontal and 4 profile images. As such, we define the evaluation protocol for frontal-frontal (FF) and frontal-profile (FP) face verification on 500 same-person pairs and 500 different-person pairs.
YTF The YTF dataset consists of videos of different people. The clip duration varies from frames to frames, with an average length of frames. We follow its unrestricted with labelled outside data protocol and report the results on video pairs.
4.5.2 Face Feature Embedding
We use ResNet-27 [61, 62] for - facial feature embedding with softmax loss (see Supplementary Material for more details). Figure 10 illustrates the set-based face feature embedding used for face recognition and verification. After 3DMM fitting, we extract 3D face shapes and incomplete UV maps. Then, we utilise the proposed UV completion methods (GAN, PCP, PCPSM, LRR and PCPSFM) to derive compeleted UV maps. Frontal faces are synthesised from full UV maps and the 3D shapes through which - features are obtained from the last fully connected layer of the feature embedding network. Finally, we calculate the feature centre as the feature description of this set of face images.
4.5.3 Ablation experiments
We conduct face verification and identification experiments for the proposed methods on CFP and VGG datasets. Since the proposed methods are based on a sequence of images, we design our experiments for ablation study as follows: for experiments on VGG, we use the 368 identities from the pose template list, for each of which we divide the total 30 face images into two sets so that each set has 15 face images from frontal, three-quarter and profile templates; for experiments on CFP, we use all of the 500 identities, for each of which we divide the total 14 face images into two sets so that each set has 5 frontal face images and 2 profile face images. For face verification experiments, we construct positive set pairs and hard negative set pairs (nearest inter-class), where is 368 for the VGG dataset and 500 for the CFP dataset. For face identification experiments, one set of the face images from each identity is used as the gallery and the other set of the face images is taken to be the probe. So, for each probe, we predict its identity by searching for the nearest gallery in the feature space. Rank-1 accuracy is used here to assess the face identification performance.
Detailed comparisons of the proposed methods against the baseline methods, PCP and LRR, are tabulated in Table II. For face verification experiments on the VGG dataset, we observe that feature subspace and side information from GAN (LRR and PCPSM) improve the vanilla PCP in terms of accuracy scores with a further boost in performance if both of them are considered together (PCPSFM). On the CFP dataset, the performance is slightly worse than that on VGG, but the improvement can still be made by exploiting features and side information obtained from GAN. A similar trend is seen for the face identification experiments. These results confirm the visual outcomes we see from the previous section.
|Accuracy ()||Ver.||Id. (Rank-1)||Ver.||Id. (Rank-1)|
4.5.4 Video-based face recognition
|VGG Face ||2.6M|
|Deep Face ||4M||91.40|
|Center Loss ||0.7M||94.9|
|Range Loss ||1.5M||93.70|
|Sphere Loss ||0.5M||95.0|
|Marginal Loss ||4M||95.98|
Face recognition on in-the-wild videos is a challenging task because of the rich pose variations. Videos from the YTF (YouTube Face) dataset also suffer from low resolution and serious compression artifacts which make the problem even worse. To conduct experiments on YTF, we follow the ”restricted” protocol, which forbids the use of subject identity labels during training. Therefore, we remove the 230 overlapping identities from VGG and re-train our GAN model for video-based face recognition experiments. For each method, individual frontal face image projection from the completed UV is produced for facial feature extraction. For each video, the final feature representation is a - feature centre of all the frames.
We compare the face verification performance of the proposed methods with state-of-the-art approaches on the YTF dataset. The mean verification rates for best-performing deep learning methods are listed in Table III. We see that our GAN model alone is among the best reported architectures and it outperforms the classical PCP. Nonetheless, their fusion (PCPSM, LRR and PCPSFM) is superior to either of them. More specifically, PCPSM improves PCP and GAN by and respectively. For LRR, the improvements are and . PCPSFM has the most gain of over PCP and over LRR. Corresponding ROC curves are plotted in Figure 12, the proposed PCPSFM obviously improves the accuracy of video-based face recognition. We randomly select one sample video (Barbra Streisand) and plot the distributions of the cosine distance between each face frame and the video feature centre in Figure 11. As evident from the plots, the mean of PCPSFM is the highest, which indicates that the proposed method is able to remove the noise of the each video frame and generate new frontal faces with low intra-variance. These findings are in agreement with our recognition results above and confirm the advantages of combing GAN and robust PCA for the application of face recognition.
In this paper, we study the problem of robust principal component analysis with features acting as side information in the presence missing values. For the application domain of UV completion, we also propose the use of generative adversarial networks to extract side information and subspaces, which, to the best of our knowledge, is the first occasion where RPCA and GAN have been fused. We also prove the convergence of ADMM for our convex objective. Through synthetic and real-world experiments, we demonstrate the advantages of side information. In virtue of in-the-wild data, we corroborate our fusion strategy. Finally, face recognition benchmarks accredit the efficacy of our proposed approach over state-of-art methods.
6 Supplementary Material
6.1 Generative adversarial networks
For GAN, we employ the image-to-image conditional adversarial network to conduct UV completion. As is shown in Figure 13, there are two main components in the image-to-image conditional GAN: a generator module and a discriminator module.
Generator Module Given incomplete UV texture input, the generator works as an auto-encoder to construct completed instances. We adopt the pixel-wise norm as the reconstruction loss:
where is the estimated UV texture and is the ground truth texture of width and height . To preserve the image information in the original resolution, we follow the encoder-decoder design in , where skip connections between mirrored layers in the encoder and decoder stacks are made. We first fill the incomplete UV texture with random noise and then concatenate it with its mirror image as the generator input. Since the face is not exactly symmetric, we have avoided using symmetry loss as in. Also, unlike the original GAN model
which is initialised from a noise vector, the hidden representations obtained from our encoder capture more variations as well as relationships between invisible and visible regions, and thus help the decoder fill up the missing regions.
Discriminator Module Although the previous generator module can fill missing pixels with small reconstruction errors, it does not guarantee the output textures to be visually realistic and informative. With only the pixel-wise reconstruction loss, the UV completion results would be quite blurry and missing important details. To improve the quality of synthetic images and encourage more photo-realistic results, we adopt a discriminator module to distinguish real and fake UVs. The adversarial loss, which is a reflection of how the generator could maximally fool the discriminator and how well the discriminator could distinguish between real and fake UVs, is defined as
where , and represent the distributions (Gaussian) of the noise variable , the partial UV texture and the full UV texture respectively.
Objective Function The final loss function for the proposed UV-GAN is a weighted sum of generator loss and discriminator loss:
is the weight to balance generator loss and discriminator loss. is set to empirically.
. The encoder unit consists of convolution, batch normalisation and ReLU, and the decoder unit consists of deconvolution, batch normalisation and ReLU. The convolution involves
spatial filters applied with stride. Convolution in the encoder and the discriminator is also downsampled by a factor of , while in the decoder it is upsampled by a factor of .
As shown in Figure 14(a), the generator utilises the U-Net  architecture which has skip connections between layer in the encoder and the layer in the decoder, where is the total number of layers. These skip connections concatenate activations from the layer to the layer. Note that batch normalisation is not applied to the first Conv64 layer in the encoder. All ReLUs in the encoder are leaky, with slope 0.2, whereas ReLUs in the decoder are not leaky.
For the discriminator, we use the PatchGAN as in . In Figure 14(b), we depict the architecture of the discriminator. Again, batch normalisation is not applied to the first Conv64 layer. However, all ReLUs are now leaky, with slope 0.2. We have also set the stride of the last two encoder modules to .
Training We train our networks from scratch by initialising the weights from a Gaussian distribution with zero mean and standard deviation. In order to train our UV completion model by pair-wise image data, we make use of both under-controlled and in-the-wild UV datasets. For the under-controlled UV data, we randomly select 180 subjects from the 4DFAB dataset. For the in-the-wild UV data, we employ the pseudo-complete UVs from the UMD video dataset via Poisson blending. We have meticulously chosen videos with large pose variations such that coverage of different poses is adequate. In the end, we have a combined UV dataset of 1,892 identities with 5,638 unique UV maps.
6.2 Deep face feature embedding networks
. And we set the kernel size of max-pooling towith stride . The network is initialised from Gaussian distribution and trained on the VGG training set (c3.1 million images) under the supervisory signals of soft-max. After an initial learning rate of 0.1, we successively contract it by a factor of 10 at the , , and epoch. We train the network in parallel on four GPUs so the overall batch size is . The input face size of the network is pixels.
-  J. Booth and S. Zafeiriou, “Optimal uv spaces for facial morphable model construction,” in ICIP, 2014.
-  V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in SIGGRAPH, 1999.
-  A. Patel and W. A. P. Smith, “3d morphable face models revisited.” in CVPR, 1999.
-  J. Booth, E. Antonakos, S. Ploumpis, G. Trigeorgis, Y. Panagakis, and S. Zafeiriou, “3d face morphable models in-the-wild,” in CVPR, 2017.
-  Q. Cao, L. Shen, W. Xie, O. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose andage,” in arXiv:1710.08092, 2017.
-  F. Shang, Y. Liu, J. Cheng, and H. Cheng, “Robust principal component analysis with missing data,” in CIKM, 2014, pp. 1149–1158.
-  E. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM, vol. 58, no. 3, pp. 1–37, 2011.
-  V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky, “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal on Optimization, vol. 21, no. 2, pp. 572––596, 2011.
-  A. Aravkin, S. Becker, V. Cevher, and P. Olsen, “A variational approach to stable principal component pursuit,” in UAI, 2014, pp. 32–41.
-  B. Bao, G. Liu, C. Xu, and S. Yan, “Inductive robust principal component analysis,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3794 – 3800, 2012.
-  R. Cabral, F. De la Torre, J. Costeira, and A. Bernardino, “Unifying nuclear norm and bilinear factorization approaches for low-rank matrix decomposition,” in ICCV, 2013.
H. Xu, C. Caramanis, and S. Sanghavi, “Robust pca via outlier pursuit,”IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3047–3064, 2012.
-  Z. Zhou, X. Li, J. Wright, E. Candès, and Y. Ma, “Stable principal component pursuit,” in ISIT, 2010.
-  J. Jiao, T. Courtade, K. Venkat, and T. Weissman, “Justification of logarithmic loss via the benefit of side information,” IEEE Transactions on Information Theory, vol. 61, no. 10, pp. 5357–5365, 2015.
-  A. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,” IEEE Transactions on Information Theory, vol. 22, no. 1, pp. 1–10, 1976.
-  E. Candès, “The restricted isometry property and its implications for compressed sensing,” Comptes Rendus Mathematique, vol. 346, no. 9, pp. 589–592, 2008.
-  K. Chiang, C. Hsieh, and I. Dhillon, “Matrix completion with noisy side information,” in NIPS, 2015.
-  M. Xu, J. R, and Z. Zhou, “Speedup matrix completion with side information: Application to multi-label learning,” in NIPS, 2013.
-  J. Mota, N. Deligiannis, and M. Rodrigues, “Compressed sensing with prior information: Strategies, geometry, and bounds,” IEEE Transactions on Information Theory, 2017.
-  K. Chiang, C. Hsieh, and I. Dhillon, “Robust principal component analysis with side information,” in ICML, 2016.
-  C. Sagonas, Y. Panagakis, S. Zafeiriou, and M. Pantic, “Raps: Robust and efficient automatic construction of person-specific deformable models,” in CVPR, 2014.
-  G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rank representation,” in ICML, 2010, pp. 663–670.
-  G. Liu, Q. Liu, and P. Li, “Blessing of dimensionality: Recovering mixture data via dictionary pursuit,” TPAMI, vol. 39, no. 1, pp. 47–60, 2017.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. Efros, “Context encoders: Feature learning by inpainting,” in CVPR, 2016.
-  C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, “High-resolution image inpainting using multi-scale neural patch synthesis,” in CVPR, 2017.
-  Y. Li, S. Liu, J. Yang, and M. Yang, “Generative face completion,” in CVPR, 2017.
-  N. Xue, Y. Panagakis, and S. Zafeiriou, “Side information in robust principal component analysis: Algorithms and applications,” in ICCV, 2017.
-  Y. Chen, A. Jalali, S. Sanghavi, and C. Caramanis, “Low-rank matrix recovery from errors and erasures,” IEEE Transactions on Information Theory, vol. 59, no. 7, pp. 4324–4337, 2013.
-  K.-C. Toh and S. Yun, “An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems,” Pacific Journal of Optimization, vol. 6, no. 615-640, p. 15, 2010.
-  R. T. Rockafellar, “Monotone operators and the proximal point algorithm,” SIAM journal on control and optimization, vol. 14, no. 5, pp. 877–898, 1976.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
optimization and statistical learning via the alternating direction method of
Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
-  Z. lin, M. Chen, and Y. Ma, “The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices,” UIUC Technical Report, 2009.
-  H. Sun, J. Wang, and T. Deng, “On the global and linear convergence of direct extension of ADMM for 3-block separable convex minimization models,” Journal of Inequalities and Applications, no. 227, p. 227, 2016.
-  M. Hintermüller and T. Wu, “Robust principal component pursuit via inexact alternating minimization on matrix manifolds,” Journal of Mathematical Imaging and Vision, vol. 51, no. 3, pp. 361–377, 2015.
-  T. Oh, Y. Tai, J. Bazin, H. Kim, and I. Kweon, “Partial sum minimization of singular values in robust PCA: Algorithm and applications,” TPAMI, vol. 38, no. 4, pp. 744–758, 2016.
A. Shabalin and A. Nobel, “Reconstruction of a low-rank matrix in the presence
of gaussian noise,”
Journal of Multivariate Analysis, vol. 118, pp. 67–76, 2013.
-  R. Basri and D. Jacobs, “Lambertian reflectance and linear subspaces,” TPAMI, vol. 25, no. 2, pp. 218–233, 2003.
-  V. Patel, T. Wu, S. Biswas, P. Phillips, and R. Chellappa, “Dictionary-based face recognition under variable lighting and pose,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 3, pp. 954–965, 2012.
-  M. Aharon, M. Elad, and A. Bruckstein, “-svd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on signal processing, vol. 54, no. 11, pp. 4311–4322, 2006.
-  V. Blanz and T. Vetter, “Face recognition based on fitting a 3d morphable model,” TPAMI, vol. 25, no. 9, pp. 1063–1074, 2003.
-  J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway, “A 3d morphable model learnt from 10,000 faces,” in CVPR, 2016.
-  J. Booth, E. Antonakos, S. Ploumpis, G. Trigeorgis, Y. Panagakis, and S. Zafeiriou, “3d face morphable models in-the-wild,” in CVPR, 2017.
-  A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2d and 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks),” in ICCV, 2017.
-  S. Cheng, I. Kotsia, M. Pantic, and S. Zafeiriou, “4dfab: A large scale 4d facial expression database for biometric applications,” in arXiv:1712.01443, 2017.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in arXiv:1611.07004, 2016.
-  J. Shen, S. Zafeiriou, G. G. Chrysos, J. Kossaifi, G. Tzimiropoulos, and M. Pantic, “The first facial landmark tracking in-the-wild challenge: Benchmark and results,” in ICCVW, 2015, pp. 1003–1011.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in CVPR, 2014.
-  Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in NIPS, 2014, pp. 1988–1996.
-  F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in CVPR, 2015.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition.” in BMVC, vol. 1, no. 3, 2015, p. 6.
-  J. Chen, V. Patel, L. Liu, V. Kellokumpu, G. Zhao, M. Pietikäinen, and R. Chellappa, “Robust local features for remote face recognition,” Image and Vision Computing, 2017.
-  R. Ranjan, C. D. Castillo, and R. Chellappa, “L2-constrained softmax loss for discriminative face verification,” in arXiv:1703.09507, 2017.
-  S. Shekhar, V. M. Patel, and R. Chellappa, “Synthesis-based robust low resolution face recognition,” in arXiv:1707.02733, 2017.
-  C. Ding and D. Tao, “A comprehensive survey on pose-invariant face recognition,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 7, no. 3, pp. 1–42, 2016.
-  A. Sharma, A. Kumar, H. Daume, and D. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in CVPR, 2012.
-  C. Ding, J. Choi, D. Tao, and L. Davis, “Multi-directional multi-level dual-cross patterns for robust face recognition,” TPAMI, 2015.
-  M. Kan, S. Shan, H. Chang, and X. Chen, “Stacked progressive auto-encoders (spae) for face recognition,” in CVPR, 2014.
-  Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in arXiv:1710.08092, 2017.
-  S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs, “Frontal to profile face verification in the wild,” in WACV, 2016.
-  L. Wolf, T. Hassner, and I. Maoz, “Face recognition in unconstrained videos with matched background similarity,” in CVPR, 2011.
-  X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao, “Range loss for deep face recognition with long-tail,” in ICCV, 2017.
-  Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in ECCV, 2016.
-  W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in CVPR, 2017.
-  J. Deng, Y. Zhou, and S. Zafeiriou, “Marginal loss for deep face recognition,” in CVPRW, 2017.
-  R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis,” in ICCV, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014, pp. 2672–2680.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
-  A. Bansal, C. Castillo, R. Ranjan, and R. Chellappa, “The do’s and don’ts for cnn-based face verification,” in arXiv:1705.07426, 2017.
-  P. Pérez, M. Gangnet, and A. Blake, “Poisson image editing,” in TOG, 2003.