Log In Sign Up

Batch Face Alignment using a Low-rank GAN

by   Jiabo Huang, et al.

This paper studies the problem of aligning a set of face images of the same individual into a normalized image while removing the outliers like partial occlusion, extreme facial expression as well as significant illumination variation. Our model seeks an optimal image domain transformation such that the matrix of misaligned images can be decomposed as the sum of a sparse matrix of noise and a rank-one matrix of aligned images. The image transformation is learned in an unsupervised manner, which means that ground-truth aligned images are unnecessary for our model. Specifically, we make use of the remarkable non-linear transforming ability of generative adversarial network(GAN) and guide it with low-rank generation as well as sparse noise constraint to achieve the face alignment. We verify the efficacy of the proposed model with extensive experiments on real-world face databases, demonstrating higher accuracy and efficiency than existing methods.


Nuclear Norm based Matrix Regression with Applications to Face Recognition with Occlusion and Illumination Changes

Recently regression analysis becomes a popular tool for face recognition...

Domain-invariant Face Recognition using Learned Low-rank Transformation

We present a low-rank transformation approach to compensate for face var...

Label Denoising Adversarial Network (LDAN) for Inverse Lighting of Face Images

Lighting estimation from face images is an important task and has applic...

Low-rank representations with incoherent dictionary for face recognition

Face recognition remains a hot topic in computer vision, and it is chall...

Multispectral Image Intrinsic Decomposition via Low Rank Constraint

Multispectral images contain many clues of surface characteristics of th...

Iterative Grassmannian Optimization for Robust Image Alignment

Robust high-dimensional data processing has witnessed an exciting develo...

1 Introduction

Along with the rapid development of the internet in recent years, there are increasing number of image and video sharing sites such as Facebook and Youtube arise, which lead to a dramatic increase in the amount of human face data available online and also inspire lots of renewed interest in large, unconstrained face datasets [9, 21]

. However, domain transformations caused by significant illumination variation, partial occlusion, as well as poor or even no alignment make it difficult for most of the existing vision algorithms to work, such as the reconstruction of 3D face model and face recognition. Batch image alignment aims to align multiple images of an object or objects of interest to a fixed canonical template

[1, 12]. With the help of effective batch image alignment algorithm, unconstrained image set can be normalized and information encoded in them can be harnessed intelligently. In this work, we focus on batch face alignment and redefined this problem as aligning multiple face images from an individual to a fixed canonical template with the normalized poses, expressions, illumination conditions and occlusions.

To a large extent, progress in batch image alignment has been driven by the introduction of increasingly sophisticated measures of image similarity

[17]. Some of the algorithms try to transform a misaligned matrix111Flatten and stack a set of aligned/misaligned images as the columns of a matrix to a rank one matrix in order to obtain a set of similar images. For examples, Learned-Miller’s influential congealing algorithm [11, 8] aims to minimize the sum of entropies of pixel values at each pixel location while the least squares congealing procedure of [3, 4] seeks an alignment that minimizes the sum of squared distances between pairs of images. However, if there is a large illumination variation in images, the aligned matrix11footnotemark: 1 might have an unknown rank higher than one. In this case, Vedaldi et. al. [20] choose to minimize the rank of the aligned matrix. But such algorithms are unable to handle corruptions and occlusions that often occur in real images. Inspired by The Robust Parameterized Component Analysis (RPCA) algorithm of [2], the Robust Alignment by Sparse and Low-rank Decomposition (RASL) [16], considering both large illumination variation and gross pixel corruptions, decomposes a misaligned matrix as the sum of a sparse noise matrix as well as a low-rank matrix of recovered aligned images. Overall, most of the existing batch image alignment algorithms solve the problem in a linear manner so it is difficult for them to handle non-linear variations such as illumination condition, partial occlusion as well as extreme facial expression.

GANs [6], as one of the most popular topics in recent few years, have been proved to have remarkable transforming ability between domains. The proposed adversarial loss, which guides the model to produce indistinguishable synthetic images, is the key to GANs’ success. GANs have reached unprecedented heights in image generation [5, 18] and image edition [23]. Recently, they have also achieved impressive results in conditional image generation applications, such as text2image [19]

and image inpainting

[15]. In all these works, the distributions of both source and target domain are explicit, which means that there are numerous samples can be used by the GAN to fit the distributions of both sides. In this work, we prove that if a proper signal is given, GAN is capable of learning the transformation from an explicit source domain to an implicit target domain in an unsupervised way.

Contributions. In this paper, we propose an novel unsupervised learning model for robustly and efficiently aligning human face images, despiting partial occlusions and large variation of illumination. Our solution builds on recent advances in GAN, sparse decomposition, and low-rank analysis. It solves the batch face alignment problem by guiding the training of GAN with low-rank generation and sparse noise constraint. We show how GAN can be trained to fit an implicit distribution if a proper signal is given. We also verify the efficacy and efficiency of our model by experiments on real face images.

Organization. The remainder of this paper is organized as follows: In Section 2, we introduce how RASL works to align a set of images while how GAN works to transfer between domains and introduce our proposed method in details. Then we provide experimental results in Section 3 to showcase the efficacy of our model on real images. Section 4 provides concluding remarks and propose potential extensions to our model and the acknowledgements come last.

2 Methodology

In this section, we present our method to batch face alignment. The proposed network is a variant of GAN supervised by low-rank generation and sparse noise constraint, which learns an effective transformation from the source domain to an implicit target domain.

2.1 Sparse and Low-rank Decomposition

It is general to measure the similarity of a set of images according to the rank of a matrix which is constructed by multiple images as columns. A well-aligned matrix should have low-rank because the images in it are linearly correlated. However, In most practical scenarios, the low-rank matrix of correlated images breaks down easily if the images are even slightly misaligned with respect to each other or if there is any occlusions or corruptions in the images.

RASL[16] models the batch image alignment as a sparse and low rank decomposition which is formulated as Eq. (1)


where represents the input misaligned matrix, is a set of invertible transformations such that . Low rank matrix consists of well-aligned images while the sparse matrix represents the errors caused by occlusions and corruptions. is a parameter that trades off the rank of the solution versus the sparsity of the error. The objective searches for a set of transformations such that the rank of the transformed images becomes as small as possible, when the sparse errors are subtracted. The low rank matrix provides us a well-aligned image set which is free of any occlusion and corruption.

In original RASL method, since the size of occlusions in different image sets can be totally different, it is necessary to adjust the manually in order to balance the weights of low-rank constraint and sparse noise constraint according to different inputs. Besides, because the transformations are learned for each misaligned image separately, every time a new image set is fed into the model, RASL should searche for a set of transformations from scratch and this is quite inefficient. Therefore, both the robustness and efficiency of the RASL have a large room to be improved.

2.2 Gan

GANs enjoy a good reputation of being able to learn the between-domains transformation. It usually constructed by a generator and a discriminator. The generator aims to map the samples from the source domain into the target domain while the discriminator tries to distinguish between the real images from the target domain against the fake images produced by the generator. The objective of GAN can be formulated as:


where are images from target domain while are samples from source domain . tries to transfer the input into an image which follows the distribution of target domain and produce the possibility of whether its input is a real sample from .

GAN is able to learn the between-domains mapping only when there are a large amount of samples which can help the model determine the distributions of both domains. However, the target domain might be implicit in many problems. For example, for the batch image alignment, it is difficult to define the result of alignment for an input image set since even only one image in the input set is modified, the aligned result will be totally different. In such a problem, there are no ground-truth samples can help to determine target distribution. Besides, original GAN generally aims to learn an image-to-image mapping, but in the batch face alignment, we are trying to search for a set-to-set mapping instead. Therefore, although GAN shows its impressive ability of mapping between different domains, it is not intuitive to adopt it in our task.

2.3 Low-rank GAN

Inspired by RASL as well as GAN, our proposed model aims to take a set of misaligned images as input and generate a low-rank matrix as well as a sparse matrix of noise. We make the aligned matrix a rank-one matrix as [3, 4] did since the well-aligned images are supposed to be same in every aspect ideally. Therefore, we abandon the low-rank constraint and make the generator produce only one aligned image according to the whole misaligned image set. We keep the sparse noise constraint to make sure that the transformations made by GAN are actually doing the alignment as our expectation, which is related to all the input images. What’s more, since the transformations of different images in an image set are supposed to be correlated, we concatenate fixed-length image set in the channels dimension as the input of the generator. The objective of our model is expressed as Eq. (5)


where is a set of misaligned images from domain concatenated in the channels dimension. tries to synthesis image that look similar to images from domain as well as the images in the input , while aims to distinguish between synthetic image and real samples . is the of the noise matrix and controls the relative importance of adversarial objective and sparse noise constraint’s objective. We aim to solve:


The optimization of Eq. (6) is not directly tractable because of the nonconvexity of . It was shown that optimization of can be replaced by optimizing as long as the number of non-zero entries in matrix is not too large [2]. Therefore, the objective of sparse noise constraint can be modified into


The gradient-based updates can use any standard gradient-based learning rule and we used standard Adam solver [10] in our experiments.

2.4 Network Architecture

We adopt the network architecture from CycleGAN [24]

which has achieved impressive results in neural style transfer. The generator contains two stride-2 convolution layers, 9 residual blocks as well as two fractionally strided convolutions with stride

for input images. The discriminator is a

PatchGANs, since it aims to classify whether several overlapping patches from an image are real or fake in a fully convolutional manner, it can be applied to arbitrarily-sized inputs. And such a patch-level discriminator has fewer parameters than a full-image discriminator. Instance normalization is also used during training. The overall network structure is shown in Fig.

1. The generator is constructed by an encoder, a transformer as well as a decoder. It takes multiple images from an individual as input and concatenates them in channels dimension before feeding them to the encoder and finally produces a synthetic image with the same size as input. After that, both input images and synthetic images are fed to the discriminator and the discriminator will then produce the possibilities of whether they are real or fake images.

Figure 1: Network Structure. The network is constructed by a generator and a discriminator while the generator consists of an encoder, a transformer and a decoder. Best view in colored.

3 Experiments

We evaluate our model on AR Face Database [14]. This face database contains over 4,000 color images from 70 men and 56 women. All the images contain one frontal face with different facial expressions, illumination conditions and occlusions (sun glasses and scarf). We separate the database into a training set and a testing set containing 116 and 10 people respectively. We first compare the aligned results of RASL with our model to prove that our model is more robust to a large area of occlusion and extreme variation of expression. Then, we try to reconstruct the 3D face model according to the aligned results of both models. Through this experiment, we aim to show that a high-quality aligned image is significant for face image processing like 3D face reconstruction.

3.1 Implementation Details

For all input images, we first crop the face region with the help of the face detector MTCNN [22], then we resize them into

. Each image is standardized with zero means and unit standard deviation. We randomly separate all images from each individual into several image sets and each set contains 8 images. We replace the negative log likelihood objective by a least-squares loss

[13]. This loss is more stable during training and can help the model generate higher quality results. We set

and use the standard Adam solver with a batch size of 16. All networks are trained from scratch with a fixed learning rate of 0.0002. Our model is implemented by tensorflow and we do all the experiments using a standard PC with a NVIDIA Titan Pascal GPU.

3.2 Aligned Results

For RASL, we leave all the parameters set in default and take the average of images, which are contained in the produced low-rank matrix, as its output. Both the input and output of low-rank decomposition in RASL is in grayscale. Fig. 2 shows a few examples from the testing set. Comparing the outputs in the second row, when there is a large area of occlusion like a scarf, RASL cannot perfectly remove them but the situation is better in our model. Comparing the outputs in the third row and the last row, when there is a large variation of expression in the input images, the mouth region in RASL’s output is blurred while ours is sharper.

Figure 2: Comparison of the aligned results from RASL and our model. Each row shows samples from one individual in the testing set. Columns (a)-(h) are the input images, (i) is the aligned result from RASL while (j) is the result of our model.

3.3 3D Reconstruction

Hu et. al. proposed a 3D reconstruction algorithm [7] which takes frontal face images from an individual to reconstruct the 3D face model. We make use of it and carefully set all the parameters as well as the directions of illumination in each image as it requires. For each individual, we use 8 original images, 8 RASL’s aligned images as well as one aligned image from our model for the reconstruction. From Fig. 3 (b), it is clear that occlusion will certainly affect the quality of the reconstructed 3D model. The second row of Fig. 3 (c) shows that aligned face from RASL with partial blur can sometimes make the situation worse while (d) proves that our aligned result, encoding all the useful information and free of any occlusion, does help the reconstruction.

Figure 3: Comparison of the 3D reconstructed results. (a) is sample images from input. The complete input image sets are shown in Fig. 2. (b), (c) and (d) compare the 3D reconstructed results of original images, aligned images from RASL and our model respectively.

4 Conclusion

We propose an effective batch face alignment method based on GAN and use the low rank and sparse constraint to supervise the training of our model in order to prove that GAN is capable of learning a mapping to an implicit domain. Our model is more efficient than traditional matrix decomposition based method since after the off-line training, it doesn’t need to learn transformations from scratch when a new image set is fed. And the efficacy of our model is also verified by the experiments. In the future, we would like to consider the case of poses variation.


This project is supported by the NSFC (No. U1611461, 61672544), Guangdong Natural Science Foundation (No. 2015A030311047), Fundamental Research Funds for the Central Universities (No. 161gpy41), and Tip-top Scientific and Technical Innovative Youth Talents of Guangdong special support program (No. 2016TQ03X263).


  • [1] L. G. Brown (1992) A survey of image registration techniques. CSUR 24 (4), pp. 325–376. Cited by: §1.
  • [2] E. J. Candès, X. Li, Y. Ma, and J. Wright (2011)

    Robust principal component analysis

    JACM 58 (3), pp. 11. Cited by: §1, §2.3.
  • [3] M. Cox, S. Sridharan, S. Lucey, and J. Cohn (2008) Least squares congealing for unsupervised alignment of images. In CVPR, pp. 1–8. Cited by: §1, §2.3.
  • [4] M. Cox, S. Sridharan, S. Lucey, and J. Cohn (2009) Least-squares congealing for large numbers of images. In ICCV, pp. 1949–1956. Cited by: §1, §2.3.
  • [5] E. L. Denton, S. Chintala, R. Fergus, et al. (2015) Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pp. 1486–1494. Cited by: §1.
  • [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [7] J. Hu, W. Zheng, X. Xie, and J. Lai (2017) Sparse transfer for facial shape-from-shading. PR 68, pp. 272–285. Cited by: §3.3.
  • [8] G. B. Huang, V. Jain, and E. Learned-Miller (2007) Unsupervised joint alignment of complex images. In ICCV, pp. 1–8. Cited by: §1.
  • [9] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 07-49, University of Massachusetts, Amherst. Cited by: §1.
  • [10] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.3.
  • [11] E. G. Learned-Miller (2006) Data driven image models through continuous joint alignment. IEEE TPAMI 28 (2), pp. 236–250. Cited by: §1.
  • [12] J. A. Maintz and M. A. Viergever (1998) A survey of medical image registration. Medical image analysis 2 (1), pp. 1–36. Cited by: §1.
  • [13] X. Mao, Q. Li, H. Xie, R. Y. Lau, and Z. Wang (2016)

    Multi-class generative adversarial networks with the l2 loss function

    CoRR, abs/1611.04076 2. Cited by: §3.1.
  • [14] A. Martínez and R. Benavente (2007) The ar face database, 1998. Computer Vision Center, Technical Report 3, pp. 5. Cited by: §3.
  • [15] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of CVPR, pp. 2536–2544. Cited by: §1.
  • [16] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma (2012) RASL: robust alignment by sparse and low-rank decomposition for linearly correlated images. IEEE TPAMI 34 (11), pp. 2233–2246. Cited by: §1, §2.1.
  • [17] J. P. Pluim, J. A. Maintz, and M. A. Viergever (2003) Mutual-information-based registration of medical images: a survey. IEEE transactions on medical imaging 22 (8), pp. 986–1004. Cited by: §1.
  • [18] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1.
  • [19] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §1.
  • [20] A. Vedaldi, G. Guidi, and S. Soatto (2008) Joint data alignment up to (lossy) transformations. In CVPR, pp. 1–8. Cited by: §1.
  • [21] L. Wolf, T. Hassner, and I. Maoz (2011) Face recognition in unconstrained videos with matched background similarity. In CVPR, pp. 529–534. Cited by: §1.
  • [22] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §3.1.
  • [23] J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros (2016) Generative visual manipulation on the natural image manifold. In ECCV, pp. 597–613. Cited by: §1.
  • [24] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    arXiv preprint arXiv:1703.10593. Cited by: §2.4.