Along with the rapid development of the internet in recent years, there are increasing number of image and video sharing sites such as Facebook and Youtube arise, which lead to a dramatic increase in the amount of human face data available online and also inspire lots of renewed interest in large, unconstrained face datasets [9, 21]
. However, domain transformations caused by significant illumination variation, partial occlusion, as well as poor or even no alignment make it difficult for most of the existing vision algorithms to work, such as the reconstruction of 3D face model and face recognition. Batch image alignment aims to align multiple images of an object or objects of interest to a fixed canonical template[1, 12]. With the help of effective batch image alignment algorithm, unconstrained image set can be normalized and information encoded in them can be harnessed intelligently. In this work, we focus on batch face alignment and redefined this problem as aligning multiple face images from an individual to a fixed canonical template with the normalized poses, expressions, illumination conditions and occlusions.
To a large extent, progress in batch image alignment has been driven by the introduction of increasingly sophisticated measures of image similarity. Some of the algorithms try to transform a misaligned matrix111Flatten and stack a set of aligned/misaligned images as the columns of a matrix to a rank one matrix in order to obtain a set of similar images. For examples, Learned-Miller’s influential congealing algorithm [11, 8] aims to minimize the sum of entropies of pixel values at each pixel location while the least squares congealing procedure of [3, 4] seeks an alignment that minimizes the sum of squared distances between pairs of images. However, if there is a large illumination variation in images, the aligned matrix11footnotemark: 1 might have an unknown rank higher than one. In this case, Vedaldi et. al.  choose to minimize the rank of the aligned matrix. But such algorithms are unable to handle corruptions and occlusions that often occur in real images. Inspired by The Robust Parameterized Component Analysis (RPCA) algorithm of , the Robust Alignment by Sparse and Low-rank Decomposition (RASL) , considering both large illumination variation and gross pixel corruptions, decomposes a misaligned matrix as the sum of a sparse noise matrix as well as a low-rank matrix of recovered aligned images. Overall, most of the existing batch image alignment algorithms solve the problem in a linear manner so it is difficult for them to handle non-linear variations such as illumination condition, partial occlusion as well as extreme facial expression.
GANs , as one of the most popular topics in recent few years, have been proved to have remarkable transforming ability between domains. The proposed adversarial loss, which guides the model to produce indistinguishable synthetic images, is the key to GANs’ success. GANs have reached unprecedented heights in image generation [5, 18] and image edition . Recently, they have also achieved impressive results in conditional image generation applications, such as text2image 
and image inpainting. In all these works, the distributions of both source and target domain are explicit, which means that there are numerous samples can be used by the GAN to fit the distributions of both sides. In this work, we prove that if a proper signal is given, GAN is capable of learning the transformation from an explicit source domain to an implicit target domain in an unsupervised way.
Contributions. In this paper, we propose an novel unsupervised learning model for robustly and efficiently aligning human face images, despiting partial occlusions and large variation of illumination. Our solution builds on recent advances in GAN, sparse decomposition, and low-rank analysis. It solves the batch face alignment problem by guiding the training of GAN with low-rank generation and sparse noise constraint. We show how GAN can be trained to fit an implicit distribution if a proper signal is given. We also verify the efficacy and efficiency of our model by experiments on real face images.
Organization. The remainder of this paper is organized as follows: In Section 2, we introduce how RASL works to align a set of images while how GAN works to transfer between domains and introduce our proposed method in details. Then we provide experimental results in Section 3 to showcase the efficacy of our model on real images. Section 4 provides concluding remarks and propose potential extensions to our model and the acknowledgements come last.
In this section, we present our method to batch face alignment. The proposed network is a variant of GAN supervised by low-rank generation and sparse noise constraint, which learns an effective transformation from the source domain to an implicit target domain.
2.1 Sparse and Low-rank Decomposition
It is general to measure the similarity of a set of images according to the rank of a matrix which is constructed by multiple images as columns. A well-aligned matrix should have low-rank because the images in it are linearly correlated. However, In most practical scenarios, the low-rank matrix of correlated images breaks down easily if the images are even slightly misaligned with respect to each other or if there is any occlusions or corruptions in the images.
where represents the input misaligned matrix, is a set of invertible transformations such that . Low rank matrix consists of well-aligned images while the sparse matrix represents the errors caused by occlusions and corruptions. is a parameter that trades off the rank of the solution versus the sparsity of the error. The objective searches for a set of transformations such that the rank of the transformed images becomes as small as possible, when the sparse errors are subtracted. The low rank matrix provides us a well-aligned image set which is free of any occlusion and corruption.
In original RASL method, since the size of occlusions in different image sets can be totally different, it is necessary to adjust the manually in order to balance the weights of low-rank constraint and sparse noise constraint according to different inputs. Besides, because the transformations are learned for each misaligned image separately, every time a new image set is fed into the model, RASL should searche for a set of transformations from scratch and this is quite inefficient. Therefore, both the robustness and efficiency of the RASL have a large room to be improved.
GANs enjoy a good reputation of being able to learn the between-domains transformation. It usually constructed by a generator and a discriminator. The generator aims to map the samples from the source domain into the target domain while the discriminator tries to distinguish between the real images from the target domain against the fake images produced by the generator. The objective of GAN can be formulated as:
where are images from target domain while are samples from source domain . tries to transfer the input into an image which follows the distribution of target domain and produce the possibility of whether its input is a real sample from .
GAN is able to learn the between-domains mapping only when there are a large amount of samples which can help the model determine the distributions of both domains. However, the target domain might be implicit in many problems. For example, for the batch image alignment, it is difficult to define the result of alignment for an input image set since even only one image in the input set is modified, the aligned result will be totally different. In such a problem, there are no ground-truth samples can help to determine target distribution. Besides, original GAN generally aims to learn an image-to-image mapping, but in the batch face alignment, we are trying to search for a set-to-set mapping instead. Therefore, although GAN shows its impressive ability of mapping between different domains, it is not intuitive to adopt it in our task.
2.3 Low-rank GAN
Inspired by RASL as well as GAN, our proposed model aims to take a set of misaligned images as input and generate a low-rank matrix as well as a sparse matrix of noise. We make the aligned matrix a rank-one matrix as [3, 4] did since the well-aligned images are supposed to be same in every aspect ideally. Therefore, we abandon the low-rank constraint and make the generator produce only one aligned image according to the whole misaligned image set. We keep the sparse noise constraint to make sure that the transformations made by GAN are actually doing the alignment as our expectation, which is related to all the input images. What’s more, since the transformations of different images in an image set are supposed to be correlated, we concatenate fixed-length image set in the channels dimension as the input of the generator. The objective of our model is expressed as Eq. (5)
where is a set of misaligned images from domain concatenated in the channels dimension. tries to synthesis image that look similar to images from domain as well as the images in the input , while aims to distinguish between synthetic image and real samples . is the of the noise matrix and controls the relative importance of adversarial objective and sparse noise constraint’s objective. We aim to solve:
The optimization of Eq. (6) is not directly tractable because of the nonconvexity of . It was shown that optimization of can be replaced by optimizing as long as the number of non-zero entries in matrix is not too large . Therefore, the objective of sparse noise constraint can be modified into
The gradient-based updates can use any standard gradient-based learning rule and we used standard Adam solver  in our experiments.
2.4 Network Architecture
We adopt the network architecture from CycleGAN for input images. The discriminator is a
PatchGANs, since it aims to classify whether several overlapping patches from an image are real or fake in a fully convolutional manner, it can be applied to arbitrarily-sized inputs. And such a patch-level discriminator has fewer parameters than a full-image discriminator. Instance normalization is also used during training. The overall network structure is shown in Fig.1. The generator is constructed by an encoder, a transformer as well as a decoder. It takes multiple images from an individual as input and concatenates them in channels dimension before feeding them to the encoder and finally produces a synthetic image with the same size as input. After that, both input images and synthetic images are fed to the discriminator and the discriminator will then produce the possibilities of whether they are real or fake images.
We evaluate our model on AR Face Database . This face database contains over 4,000 color images from 70 men and 56 women. All the images contain one frontal face with different facial expressions, illumination conditions and occlusions (sun glasses and scarf). We separate the database into a training set and a testing set containing 116 and 10 people respectively. We first compare the aligned results of RASL with our model to prove that our model is more robust to a large area of occlusion and extreme variation of expression. Then, we try to reconstruct the 3D face model according to the aligned results of both models. Through this experiment, we aim to show that a high-quality aligned image is significant for face image processing like 3D face reconstruction.
3.1 Implementation Details
For all input images, we first crop the face region with the help of the face detector MTCNN , then we resize them into
. Each image is standardized with zero means and unit standard deviation. We randomly separate all images from each individual into several image sets and each set contains 8 images. We replace the negative log likelihood objective by a least-squares loss. This loss is more stable during training and can help the model generate higher quality results. We set
and use the standard Adam solver with a batch size of 16. All networks are trained from scratch with a fixed learning rate of 0.0002. Our model is implemented by tensorflow and we do all the experiments using a standard PC with a NVIDIA Titan Pascal GPU.
3.2 Aligned Results
For RASL, we leave all the parameters set in default and take the average of images, which are contained in the produced low-rank matrix, as its output. Both the input and output of low-rank decomposition in RASL is in grayscale. Fig. 2 shows a few examples from the testing set. Comparing the outputs in the second row, when there is a large area of occlusion like a scarf, RASL cannot perfectly remove them but the situation is better in our model. Comparing the outputs in the third row and the last row, when there is a large variation of expression in the input images, the mouth region in RASL’s output is blurred while ours is sharper.
3.3 3D Reconstruction
Hu et. al. proposed a 3D reconstruction algorithm  which takes frontal face images from an individual to reconstruct the 3D face model. We make use of it and carefully set all the parameters as well as the directions of illumination in each image as it requires. For each individual, we use 8 original images, 8 RASL’s aligned images as well as one aligned image from our model for the reconstruction. From Fig. 3 (b), it is clear that occlusion will certainly affect the quality of the reconstructed 3D model. The second row of Fig. 3 (c) shows that aligned face from RASL with partial blur can sometimes make the situation worse while (d) proves that our aligned result, encoding all the useful information and free of any occlusion, does help the reconstruction.
We propose an effective batch face alignment method based on GAN and use the low rank and sparse constraint to supervise the training of our model in order to prove that GAN is capable of learning a mapping to an implicit domain. Our model is more efficient than traditional matrix decomposition based method since after the off-line training, it doesn’t need to learn transformations from scratch when a new image set is fed. And the efficacy of our model is also verified by the experiments. In the future, we would like to consider the case of poses variation.
This project is supported by the NSFC (No. U1611461, 61672544), Guangdong Natural Science Foundation (No. 2015A030311047), Fundamental Research Funds for the Central Universities (No. 161gpy41), and Tip-top Scientific and Technical Innovative Youth Talents of Guangdong special support program (No. 2016TQ03X263).
-  (1992) A survey of image registration techniques. CSUR 24 (4), pp. 325–376. Cited by: §1.
Robust principal component analysis. JACM 58 (3), pp. 11. Cited by: §1, §2.3.
-  (2008) Least squares congealing for unsupervised alignment of images. In CVPR, pp. 1–8. Cited by: §1, §2.3.
-  (2009) Least-squares congealing for large numbers of images. In ICCV, pp. 1949–1956. Cited by: §1, §2.3.
-  (2015) Deep generative image models using a￼ laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pp. 1486–1494. Cited by: §1.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
-  (2017) Sparse transfer for facial shape-from-shading. PR 68, pp. 272–285. Cited by: §3.3.
-  (2007) Unsupervised joint alignment of complex images. In ICCV, pp. 1–8. Cited by: §1.
-  (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 07-49, University of Massachusetts, Amherst. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.3.
-  (2006) Data driven image models through continuous joint alignment. IEEE TPAMI 28 (2), pp. 236–250. Cited by: §1.
-  (1998) A survey of medical image registration. Medical image analysis 2 (1), pp. 1–36. Cited by: §1.
Multi-class generative adversarial networks with the l2 loss function. CoRR, abs/1611.04076 2. Cited by: §3.1.
-  (2007) The ar face database, 1998. Computer Vision Center, Technical Report 3, pp. 5. Cited by: §3.
-  (2016) Context encoders: feature learning by inpainting. In Proceedings of CVPR, pp. 2536–2544. Cited by: §1.
-  (2012) RASL: robust alignment by sparse and low-rank decomposition for linearly correlated images. IEEE TPAMI 34 (11), pp. 2233–2246. Cited by: §1, §2.1.
-  (2003) Mutual-information-based registration of medical images: a survey. IEEE transactions on medical imaging 22 (8), pp. 986–1004. Cited by: §1.
-  (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1.
-  (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §1.
-  (2008) Joint data alignment up to (lossy) transformations. In CVPR, pp. 1–8. Cited by: §1.
-  (2011) Face recognition in unconstrained videos with matched background similarity. In CVPR, pp. 529–534. Cited by: §1.
-  (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §3.1.
-  (2016) Generative visual manipulation on the natural image manifold. In ECCV, pp. 597–613. Cited by: §1.
Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593. Cited by: §2.4.