The implement attention conditional GANs (AcGAN) model.
Face aging is of great importance for cross-age recognition and entertainment-related applications. Recently, conditional generative adversarial networks (cGANs) have achieved impressive results for face aging. Existing cGANs-based methods usually require a pixel-wise loss to keep the identity and background consistent. However, minimizing the pixel-wise loss between the input and synthesized images likely resulting in a ghosted or blurry face. To address this deficiency, this paper introduces an Attention Conditional GANs (AcGANs) approach for face aging, which utilizes attention mechanism to only alert the regions relevant to face aging. In doing so, the synthesized face can well preserve the background information and personal identity without using the pixel-wise loss, and the ghost artifacts and blurriness can be significantly reduced. Based on the benchmarked dataset Morph, both qualitative and quantitative experiment results demonstrate superior performance over existing algorithms in terms of image quality, personal identity, and age accuracy.READ FULL TEXT VIEW PDF
The implement attention conditional GANs (AcGAN) model.
Face aging, also known as age progression, aims to render a given face image with natural aging effects under a certain age or age group. In recent years, face aging has attracted major attention due to its extensive use in numerous applications, entertainment , finding missing children 
, cross-age face recognition, etc. Although impressive results have been achieved recently [4, 5, 6, 7, 8], there are still many challenges due to the intrinsic complexity of aging in nature and the insufficient labeled aging data. Intuitively, the generated face images should be photo-realistic, e.g., without serious ghosting artifacts. In addition to that, the face aging accuracy and personal identify permanence of the generated face images should be guaranteed simultaneously.
Recently, the generative adversarial networks (GANs)  have shown an impressive ability in generating synthetic images  and face aging [11, 4, 5, 6, 7, 8]. These approaches render faces with more natural aging effects in terms of high quality, identity consistency, and aging accuracy compared to the previous conventional solutions, such as prototype-based  and physical model-based methods [12, 13]. However, the problems have not been completely solved. For example, Zhang et al. 
first proposed a conditional adversarial autoencoder (CAAE) for face aging by traversing on the face manifold in low dimension, but it cannot keep the identity information of generated faces well. To solve this problem, Yanget al.  and Wang et al. 
proposed a condition GANs with a pre-trained neural network to preserve the identity of generated images. Most existing GANs-based methods usually train the model with the pixel-wise loss[4, 8] to preserve identity consistency and keep background information. But to minimize the Euclidean distance between the synthesized images and the input images will easily cause the synthesized images becoming ghosted or blurred . In particular, this problem would be more severe if the gap between the input age and the target age becomes larger.
Inspired by the success of attention mechanism in image-to-image translation, in this paper, we propose an Attention Conditional GANs (AcGANs) to tackle these issues mentioned-above. Specifically, the proposed AcGANs consists of a generator and a discriminator . The generator receives an input image and a target age code and the output of the generator contains an attention mask and a color mask. The attention mask learns the modified regions relevant to face aging and the color mask learns how to modify. The final output of the generator is a combination of the attention and the color masks. Since the attention mechanism only modifies the regions relevant to face aging, it can preserve the background information and the personal identity well without using the pixel-wise loss in training. The discriminator
consists of an image discriminator and an age classifier, aiming to make the generated face be more photo-realistic and guarantee the synthesized face lies in the target age group.
The main contributions of this paper are: i) We propose a novel approach for face aging, which utilizes an attention mechanism to modify the regions relevant to face aging. Thus, the synthesized face can well preserve the background information and personal identity without using pixel-wise loss, and the ghost artifacts and blurriness can be significantly reduced. ii) Both qualitative and quantitative experiment results on Morph demonstrate the effectiveness of our model in terms of image quality, personal identity, and age accuracy.
We divide faces with different ages into 5 nonoverlapping groups, i.e., 11-20, 21-30, 31-40, 41-50, and 50+. Given a face image , where and are the hight and width of the image, respectively. We use a one-hot label to indicate the age group that belongs to. The aim is to learn a generator to generate a synthesized face image that lies in target age group , looks realistic, and has the same identity as the input face image .
The proposed approach, shown in Fig. 1, consists of two main modules: i) A generator is trained to generate a synthesized face with target age ; ii) A discriminator aims to make looks realistic and guarantee lie in target age group .
Generator Given a input face image and a target age
, we need to pad the one-hot labelinto . Then, we form the input of generator as a concatenation .
One key ingredient of our approach is to make focus on those regions of the image that are relevant to face aging and keep the background information unchanged and preserve identity consistency. For this purpose, we have embedded an attention mechanism to the generator. Concretely, instead of regressing a full image, our generator outputs two masks, an attention mask and a color mask . The final generated image can be obtained as:
where denotes the element-wise product, , and . The mask indicates to which extend each pixel of the contributions to the generative image .
Discriminator This module consists of an image discriminator and an age classifier , aiming to make the generated face be realistic and guarantee the synthesized face lies in the target age group. Note that and share parameters , as shown in Fig. 1, which makes the performance of the discriminator improve significantly.
The defined loss function includes three terms: 1)The adversarial loss proposed by Gulrajani et al.  that pushed the distribution of the generated images to the distribution of the training images; 2) The attention loss to drive the attention masks to be smooth and prevent them from saturating; 3) The age classification loss to make the generated facial image more accurate in age classification.
. Specifically, the original GAN formulation is based on the Jenson-Shannon (JS) divergence loss function and aims to maximize the probability of correctly classifying real and fake images while the generator tries to fool the discriminator. This loss is potentially not continuous for the parameters of the generator and can locally saturate leading to vanishing gradients in the discriminator. This is addressed in WGAN by replacing JS with the continuous Earth Mover Distance. To maintain a Lipschitz constraint, WGAN-GP  added a gradient penalty for the critic network computed as the norm of the gradients for the critic input.
Formally, let be the distribution of the input image , and
be the random interpolation distribution betweenand . Then, the adversarial loss can be written as:
where is a penalty coefficient.
Attention Loss Note that when training the model, we do not have ground-truth annotation for the attention masks . Similarly as for the color masks , they are learned from the resulting gradients of the discriminative module and the age classification loss. However, the attention masks can easily saturate to 1, which makes that the attention module does not effect. To prevent this situation, we regularize the mask with a -weight penalty. Besides, to enforce smooth spatial color transformation when combining the pixel from the input image and the color transformation , we perform a Total Variation Regularization over . The attention loss can be defined as:
where and is the entry of . Besides, is a penalty coefficient.
Age Classification Loss While reducing the image adversarial loss, the generator must also reduce the age error by the age classifier
. The age classification loss is defined with two components: an age estimation loss with fake images used to optimize G, and an age estimation loss of real images used to learn the age classifier. This loss is computed as:
where is the label of input image , corresponds to a softmax loss.
Final Loss To generate the target age image , we build a loss function by linearly combining all previous losses:
where , and are the hyper-parameters that control the relative importance of every loss term. Finally, we can define the following minimax problem:
where draws samples from the data distribution. Additionally, we constrain our discriminator to lie in , which represents the set of 1-Lipschitz functions.
In this section, we introduce our implementation details and then evaluate our proposed model both qualitatively and quantitatively on a large public dataset Morph , which contains 55,000 face images of 13,617 subjects from 16 to 77 years old. To better demonstrate the superiority in preserving identity features of our methods, we have also compared the two state-of-the-art methods: Conditional Adversarial Autoencoder Network (CAAE)  and Identity-Preserved Conditional Generative Adversarial Networks (IPCGANs) .
Following prior works [4, 5, 6], before fed into the networks, the faces are (1) aligned by the ﬁve facial landmarks detected by MTCNN , (2) cropped to pixels of 10% more area, thus not only hair but also beard are all covered, (3) divided into five age groups, i.e.
, 10-20, 21-30, 31-40, 41-50, 51+. Consequently, a total of 54,539 faces are collected and then we split Morph dataset into two parts, 90% for training and the rest for testing without overlapping. The realization of AcGANs is based on the open-source “PyTorch” framework111The code has been released in https://github.com/JensonZhu14/AcGAN..
During training, we adopt an architecture similar with  which is shown in Fig. 1. Different from , our generator receives images and condition feature maps concatenated together along channel as input, which is larger than of CAAE and IPCGANs, thus a more clear result is generated. Furthermore, the conditional feature maps are similar to one-hot code in some ways where only one of which is ﬁlled with ones while the rest are all ﬁlled with zeros. For IPCGANs, we first train the age classiﬁer which is ﬁnetuned based on AlexNet on the CACD  and other parameters are set according to . For CAAE, we remove the gender information and use 5 age groups instead of for fair comparison. For AcGANs, we set to 10, while is 2, is 100, is 10, and is , respectively. For all of them including AcGANs, we choose Adam to optimize both and with learning rate and batch-size set to and 64, respectively. Thus we train the and
in turn every iteration with total 100 epochs on four 2080 Ti GPU.
In this subsection, we first visualize the aging process from the perspective of what AcGANs have learned from the input image, i.e., attention mask and color mask. As shown in Fig. 2, we select four face images from the test dataset randomly regardless of their original age group and exhibit the aging results in the first row while the second row is attention mask and the third row is color mask correspondingly. According to the attention mask, we can draw a convincing conclusion that AcGANs indeed learns which parts of the face should be aged.
We further qualitatively compare the generated faces of different methods in Fig. 3. All of the three generated results show that AcGANs has a more powerful capability of removing ghosted artifacts. Meanwhile, the adornments marked in the red rectangle of the last two faces are preserved integrally by AcGANs, which has proved that AcGANs has learned what should be aged in the face once again.
, there are two critical evaluation metrics in age progression,i.e. identity permanence and aging accuracy. We first generate the elder faces from young faces, i.e. faces of 10-20 age group, and then evaluate them separately.
To estimate the aging accuracy, we use Face++ API  to estimate the age distributions, i.e., mean value of both generic and generated faces in each age group, where less discrepancy between real and fake images indicates more accurate simulation of aging effects. For simplicity, we report the mean value of age distributions while the discrepancy with generic age distribution is shown in brackets (seen in Table 1). For identity permanence, Face veriﬁcation experiments are also conducted on Face++ API, where high verification confidence and verification rate indicates a powerful performance to preserve identity information. From Table 2 it can be seen that the top is verification confidence between ground truth young faces and their aging elder ones generated by AcGANs, and the bottom is verification rate between them which means the accuracy that they are the same person. The best values for each column of both Table 1 and Table 2 are indicated in bold.
On Morph, it could be easily seen that our AcGANs consistently outperform CAAE and IPCGANs in two metrics during all four aging processes. Although IPCGANs has a better capability in preserving identity information, it generates worse inferior aging faces than CAAE, while CAAE fails to keep the original identity. However, AcGANs could not only achieve a better aging result but also preserve identity consistently in an advantageous position.
|Estimated Age Distributions|
In this paper, we propose a novel approach based on an attention mechanism for face aging. Since the attention mechanism only modifies the regions relevant to face aging, the proposed approach can well preserve the background information and the personal identity without using the pixel-wise loss, significantly reducing the ghost artifacts and blurring. Besides, the proposed approach is simple for it consists of only a generator and a discriminator sub-networks and can be learned without additional pre-trained models. Moreover, both qualitative and quantitative experiments validate the effectiveness of our approach.
Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester, vol. 2014, no. 5, pp. 2, 2014.
“Image-to-image translation with conditional adversarial networks,”in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134.
International Conference on Machine Learning, 2017, pp. 214–223.