1 Introduction
Facial aging and facial rejuvenation, also referred to as age progression and age regression, aim to render a face photograph with natural aging and rejuvenating effects on the individual’s face [Fu et al.2010]
. It has broad applications, including facial appearance prediction, crossage face recognition
[Park et al.2010], finding missing children, and movie entertainment [Wang et al.2016].Although the problem has attracted much attention of the research community, there are serious challenges, primarily from the intrinsic complexity of aging in the physical world and a shortage of labeled aging photograph data. Generally speaking, aging accuracy and identify permanence are commonly regarded as two crucial metrics to evaluate the quality of facial aging and rejuvenation in the recent literature [Shu et al.2015, Suo et al.2010, Yang et al.2016].
1.1 Related Work
Early approaches were mainly based on the skin’s anatomical structure and simulated the profile growth and facial muscle changes with respect to the elapsed time [Ramanathan and Chellappa2008]. Although these methods provided novel insights for facial aging synthesis, they are difficult to generalize for other tasks because of their complex modeling techniques.
The later datadriven approaches can be roughly categorized into prototypebased methods [Tiddeman et al.2001, KemelmacherShlizerman et al.2014, Shu et al.2015] and physical model based ones [Suo et al.2010, Park et al.2010, Suo et al.2012]. By dividing training data into several disjoint age groups, prototypebased approaches learn a transformation over these age groups [Burt and Perrett1995, KemelmacherShlizerman et al.2014]. Because they only consider the general aging mechanism, they are simple and fast. However, they neglect personalized information, causing them to generate unrealistic images. Although [Shu et al.2015]
utilized dictionary learning to estimate the age pattern of each age group from the corresponding subdictionary, this approach presents serious ghosting artifacts. Physical model based approaches, on the other hand, employ parametric models to simulate the aging mechanisms of the muscles
[Suo et al.2012], the wrinkle [Ramanathan and Chellappa2008, Suo et al.2010], the skin, and the skull of a particular individual [Lanitis et al.2002, Ramanathan and Chellappa2006]. Nevertheless, they suffer from a complex modeling procedure with high computational cost. Moreover, it is difficult for these approaches to collect a large ground truth face dataset, with a long time span of each individual, to model the subtle aging mechanism.Recently, the generative adversarial networks (GANs) have shown an impressive ability in generating synthetic images [Goodfellow et al.2014, Gauthier2014, Radford et al.2015], and facial aging and rejuvenation [Wang et al.2016, Duong et al.2017, Zhang et al.2017, Yang et al.2017]. For example, [Wang et al.2016] transformed faces across different ages smoothly by modeling the intermediate transition states in a RNN model. And [Zhang et al.2017] proposed conditional adversarial autoencoder to simulate facial muscle sagging caused by aging. These approaches render faces with more appealing aging effects and less ghosting artifacts compared to the earlier methods. However, aging accuracy and identity permanence can hardly be achieved simultaneously. The reason is that they focus more on modeling facial transformation between age groups, where the age factor plays a dominant role while the identity information plays a subordinate role. Furthermore, learning facial aging between age groups does not allow the generation of facial images for an arbitrary age.
1.2 Our Approach
In this paper, we propose a novel GANsbased approach, named Conditional MultiAdversarial Autoencoder with Ordinal Regression (CMAAEOR), which combines the advantage of GANs in synthesizing visually plausible images and that of ordinal regression in accurate age estimation. Compared with existing methods, our method can simultaneously handle the identity permanence and aging accuracy better on facial aging and rejuvenation. Concretely, CMAAEOR utilizes a convolutional encoder to extract a latent feature from an input face photograph, followed by projecting the feature onto the face manifold conditional on age through a deconvolutional generator. The encoder and the generator are trained with four parts, (1) an agedistancebased weighted squared Euclidean loss in the image space, (2) the identity loss to minimize the inputoutput distance by a latent feature representation, which embeds personalized characteristics from a pretrained encoder, (3) the GAN loss that encourages generated faces to be indistinguishable from actual faces, and (4) the ordinal regression loss to force generated faces to exhibit desirable aging effect. These four parts simultaneously ensure that the resultant faces present desired aging effects and the identity properties remain stable. In contrast to the previous approaches that regarded an age group as a conditional input, the proposed method allows a specific age input and utilizes an age estimation technique to ensure the aging accuracy. Consequently, CMAAEOR produces more photorealistic and aging accurate images, as shown in Figure 1. The main contributions are summarized as follows:

The proposed method incorporates face verification and age estimation techniques to preserve identity permanence and achieve high aging accuracy. In addition, our framework accepts an arbitrary age as the conditional input, instead of a predefined discrete age group.

An agedistancebased weighted squared Euclidean loss in facial image space is utilized in our framework to emphasize the aging effect.

Experimental results illustrate the appealing performances of the proposed method in facial aging and rejuvenation. Besides, our method is robust against variations in pose, eyeglasses, and occlusion.
2 The Method
We introduce the proposed Conditional MultiAdversarial Autoencoder with Ordinal Regression (CMAAEOR) in detail. Then the objective function of CMAAEOR is described.
2.1 Conditional MultiAdversarial Autoencoder with Ordinal Regression
In our framework, the provided actual face image is first mapped to a latent vector through a convolutional encoder . Then the vector is projected onto the face manifold conditional with a desired age through a deconvolutional generator
. The latent vector preserves personalized face features and the age controls facial aging or rejuvenation. To predict aging trend well and keep personspecific information stable, a compound training procedure with four different loss functions is employed. Specifically, (1) an agedistancebased weighted squared Euclidean loss in the image space is used for eliminating inputoutput gap, (2) the identity loss is for minimizing the inputoutput distance in a highlevel feature representation which embeds the personalized characteristics, (3) the GAN loss is for encouraging generated faces to be indistinguishable from the actual faces, and (4) the ordinal regression loss is for ensuring the aging accuracy of the generated faces. The detailed structure of the CMAAEOR is shown as Figure
2.Encoder & Generator: Facial aging and rejuvenation only require a forward pass through encoder and a generator . The encoder maps the input face to a feature vector, i.e., , where is the dimension of the face feature. Given and conditional age label , the generator generates the output face . Unlike existing GANrelated works [Zhang et al.2017, Yang et al.2017], the age label is a specific age with one dimension rather than an age group with a onehot age label, so that a specific aged face can be generated.
Discriminator: According to the principle of conditional generative adversarial networks (cGANs) [Mirza and Osindero2014], the discriminator on face images forces the generator to yield more realistic faces. The goal of the generator is to confuse the discriminator through capturing the distribution of true face, whereas the optimization procedure of is to distinguish the natural face images from the ones generated by . The risk function of optimizing this minimax twoplayer game can be written as:
where denotes an actual face image following a certain distribution and is a conditional age label with one dimension. After the process converges, the distribution of the synthesized images is equivalent to . Accordingly, the training process alternately minimizes the following equations:
(3) 
Ordinal Regression: Once the aforementioned procedure is completed, an age estimation technique is used to help the generated image become more accurate in the agelevel. In our case, a CNNbased ordinal regression method is introduced to estimate the age of the generated face image, because it has achieved a remarkable accuracy in the age estimation area [Niu et al.2016]. The regression loss for the generated image is written as
(4) 
Here is the distance between feature representations. For more implementation details of ordinal regression , readers are referred to [Niu et al.2016].
Identity Preservation: Another core issue of facial aging and rejuvenation is to keep the persondependent or personspecific properties consistent. By measuring the inputoutput distance in a proper feature space that is sensitive to the identify change and relatively robust to other variations, we incorporate an associate constraint into our proposed model. Specifically, we utilize an encoder (pretrained by training dataset) to extract a latent vector , which preserves personal identity in highlevel feature representation for each face image. Then the identity loss can be written as
(5) 
The architecture of the pretrained encoder is the same as the encoder .
2.2 Objective Function
Besides the three aforementioned loss functions, an agedistancebased weighted squared Euclidean loss in the image space is adopted for further eliminating the inputoutput gap, e.g., the color aberration. Because the original squared Euclidean loss may eliminate the facial aging effect to some degree, we proposed an agedistancebased weighted squared Euclidean loss to avoid this issue. The proposed weighting strategy is based on the intuition that the higher the gap in age, the larger the gap in inputoutput faces. The formulation of this loss can be written as
(6) 
where , , and correspond to the shape of image (e.g. width, height, and channel) and . Here and are the age labels of the input face and the generated face, respectively.
Finally, the objective functions for the generator and the discriminator are written as:
(8) 
where , , and are the coefficients of pixel loss, identity loss, GAN loss, and ordinal regression loss, respectively. These coefficients are tradeoffs between face aging accuracy and identity performance.
In our framework, we first pretrain a convolutional encoder as a face identity descriptor and a CNNbased ordinal regression as an age estimator based on the training dataset. Then and are trained alternately by Eqs. (2.2) and (8) until optimality reaches. Finally, learns the desired age transformation pattern and becomes a reliable discriminator.
3 Experimental Results
We perform a comprehensive comparison between our proposed approach and several published methods.
3.1 Data Collection
In our experiments, we utilized three datasets for training and evaluation: the MORPH dataset [Ricanek and Tesafaye2006], the UTKFace dataset [Zhang et al.2017], and the FGNET dataset^{1}^{1}1http://www.prima.inrialpes.fr/FGnet/.
The MORPH is a large publicly available aging database, consisting of subject’s ethnicity, height, weight, and gender. It contains 55,608 facial images and the age of each subject ranges from 16 to 77 years, with the average age being approximately 33 years. The UTKFace is also a largescale face dataset with long age span, ranging from 0 to 116 years. This dataset includes 23,709 facial images with annotations of age, gender, and ethnicity. The third facial aging dataset, the FGNET, consisting of only 1,002 images of 82 subjects, is used for testing in our work. In addition, images from these three datasets cover a wide variations in eyeglasses, pose, facial expression, illumination, occlusion, etc.
In order to make the training phase effective, we align all the faces according to 68 landmarks in each face [Kazemi and Josephine2014]. Each facial image is cropped to pixels.
3.2 Implementation Details
The architecture of the CMAAEOR is constructed as in Figure 2. Specifically, we normalize four parameters into . They are (1) the pixel values of the input images, (2) the output of
by using a sigmoid activation function, (3) the value of the label through dividing the maximal age of each dataset, and (4) the output of the network
through using the sigmoid function. Furthermore, the desired age label is concatenated to
, forming the input of . Based on our experiences, such a normalization helps the training process converge faster. In , , and, the convolution of stride 2 is employed instead of pooling (
e.g., max pooling) because strided convolution is fully differentiable and allows the network to learn its own spatial downsampling
[Radford et al.2015]. Note that we do not use the batch normalization (BN) for
and because it blurs personal features and makes the generated faces drift far away from inputs in testing. However, BN will make the framework more stable if it is applied on . All intermediate layers of each block (i.e., , , , and) use the ReLU as the activation function. Further, paddings are added to the layers to make the size of the input and the output identical.
The coefficients of four parameters , and are set to 0.10, 1.00, 1.00, and 0.02 for the MORPH, and 0.50, 1.00, 1.00 and 0.01 for the UTKFace, respectively. At the training stage, we employ Adam [Kingma and Ba2014] with the initial learning rate of and the weight decay factor of . After the ordinal regression and the encoder are pretrained, we alternatively update the discriminator with GAN loss and the generator
with GAN loss, age estimation loss, pixellevel loss, and identity loss at every iteration. The networks are trained with a batch size of 100 for 200 epochs in total, which takes around 2.5 hours on four GTX 1080Ti GPUs.
3.3 Performance Comparison
Facial Aging and Rejuvenation: Given an input face and its target age label, CMAAEOR generates the target age faces along the direction of facial aging or rejuvenation. We evaluate its performance for the FGNET and the MORPH datasets. For the MORPH, we randomly divide the whole dataset into the training set (80%) and the testing set (20%) without overlapping. For the FGNET, we utilize the UTKFace dataset as the training dataset and the FGNET as the testing set, following the setting in [Zhang et al.2017]. The facial aging and rejuvenation results of our method on these two datasets are shown in Figures 3 and 4, respectively. It can be seen that CMAAEOR preserves the personal identity well even with a long age span and produces richer texture such as wrinkle in older faces, making older faces more realistic.
Aging Accuracy and Effect of Ordinal Regression: To verify the accuracy of facial aging and rejuvenation of our method, an ordinal regression is trained as age estimator (the architecture of is shown in Figure 2), which achieves the remarkable performance for the MORPH and the FGNET datasets (see the left column of Table 1). Then we compared the performances of generating facial images by the CMAAEOR with and without the ordinal regression network , to justify the effectiveness of the ordinal regression in our networks. The age estimation errors of synthesized images under these conditions are shown in the right column of Table 1. It can be seen that the ordinal regression improves the accuracy of synthesized facial images remarkably. Further, Figure 5 shows the effects of to synthesized images. It is clear that helps the framework generate aging face with high accuracy. Although the outputs without can present aging, its effect is subtle since brings agelevel details to the generated face. The detailed results are illustrated in Table 1. Note that other published work should have similar MAE as our method without . The reason is that those methods evaluate MAE in discrete age groups. That can cause the 50aged face and the 60aged being placed in the same group, producing a large MAE value.
Dataset  Actual Faces  Synthesized Faces  

Estimator  with  without  
MORPH  
FGNET 
Comparison with Prior Works: We compare our synthetic results with several representative prior works, including FT demo^{2}^{2}2Face Transformer (FT) demo. http://cherry.dcs.aber.ac.uk/transformer/, CDL: coupled dictionary learning [Shu et al.2015], RFA: recurrent facial aging [Wang et al.2016], and CAAE: conditional adversarial autoencoder [Zhang et al.2017]. As a prototypebased method, FT demo regards different age groups as aging prototypes to learn the aging pattern. Different from FT, CDL utilized dictionary learning to estimate age pattern between age groups. And RFA transformed faces across different ages by modeling the intermediate transition states in a RNN model, while CAAE utilized a conditional adversarial autoencoder network to achieve a bidirectional face aging. For fair comparison, we choose the same faces with their works as our input, and directly cite their synthetic results, as most of prior works did. From the results of Figure 6, it can be seen that the age changing of the synthetic face is not obvious in these prior works. In contrast, our method achieves facial aging with more clear changes. Moreover, our network simultaneously achieves age progression and regression in the same framework, and can generate facial image with an arbitrary age instead of a predefined discrete age group.
Robustness: As aforementioned, the input images may have large variation in eyeglasses, pose, and occlusion. To demonstrate the robustness of the CMAAEOR, we select the faces with eyeglass variation, nonfrontal pose, and occlusion, respectively, as shown in Figure 7. Note that the previous works [Wang et al.2016, KemelmacherShlizerman et al.2014] often apply face normalization to alleviate the variation of pose and expression but they may still suffer from the occlusion issue. In contrast, the CMAAEOR generates faces without the need of removing these variations, paving the way to robust performance in realworld applications.
4 Conclusions
In this paper, a novel GANsbased method, named the Conditional MultiAdversarial Autoencoder with Ordinal Regression (CMAAEOR), is proposed to predict facial aging and rejuvenation. It involves age estimation, face verification, and synthesis of visually plausible images as well as eliminating the inputoutput gap by using a weighted pixellevel penalty. Different from the previous approaches, this method can simultaneously address face aging accuracy and identity permanence well. To our knowledge, what’s more, it is the first time to generate an aged face with a specific age instead of a discrete age group. This can help face aging prediction obtain higher accuracy and finer estimate. Finally, experimental results on both face aging and rejuvenation demonstrate the effectiveness and robustness of the proposed method.
References
 [Burt and Perrett1995] D. Michael Burt and David I. Perrett. Perception of age in adult Caucasian male faces: Computer graphic manipulation of shape and colour information. In Proc. R. Soc. Lond. B, volume 259, pages 137–143, 1995.
 [Duong et al.2017] Chi Nhan Duong, Kha Gia Quach, Khoa Luu, T. Hoang Ngan le, and Marios Savvides. Temporal nonvolume preserving approach to facial ageprogression and ageinvariant face recognition. arXiv preprint arXiv:1703.08617, 2017.
 [Fu et al.2010] Yun Fu, Guodong Guo, and Thomas S. Huang. Age synthesis and estimation via faces: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(11):1955–1976, 2010.

[Gauthier2014]
Jon Gauthier.
Conditional generative adversarial nets for convolutional face
generation.
Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester
, 2014(5):2, 2014.  [Goodfellow et al.2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.

[Kazemi and Josephine2014]
Vahid Kazemi and Sullivan Josephine.
One millisecond face alignment with an ensemble of regression trees.
In
The 27th IEEE Conference on Computer Vision and Pattern Recognition
, pages 1867–1874, 2014.  [KemelmacherShlizerman et al.2014] Ira KemelmacherShlizerman, Supasorn Suwajanakorn, and Steven M. Seitz. Illuminationaware age progression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3334–3341, 2014.
 [Kingma and Ba2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [Lanitis et al.2002] Andreas Lanitis, Christopher J. Taylor, and Timothy F. Cootes. Toward automatic simulation of aging effects on face images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):442–455, 2002.
 [Mirza and Osindero2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [Niu et al.2016] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. Ordinal regression with multiple output CNN for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4920–4928, 2016.
 [Park et al.2010] Unsang Park, Yiying Tong, and Anil K. Jain. Ageinvariant face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):947–954, 2010.
 [Radford et al.2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [Ramanathan and Chellappa2006] Narayanan Ramanathan and Rama Chellappa. Modeling age progression in young faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 387–394, 2006.
 [Ramanathan and Chellappa2008] Narayanan Ramanathan and Rama Chellappa. Modeling shape and textural variations in aging faces. In The 8th IEEE International Conference on Automatic Face & Gesture Recognition, pages 1–8, 2008.
 [Ricanek and Tesafaye2006] Karl Ricanek and Tamirat Tesafaye. Morph: A longitudinal image database of normal adult ageprogression. In The 7th International Conference on Automatic Face and Gesture Recognition, pages 341–345, 2006.
 [Shu et al.2015] Xiangbo Shu, Jinhui Tang, Hanjiang Lai, Luoqi Liu, and Shuicheng Yan. Personalized age progression with aging dictionary. In Proceedings of the IEEE International Conference on Computer Vision, pages 3970–3978, 2015.
 [Suo et al.2010] Jinli Suo, SongChun Zhu, Shiguang Shan, and Xilin Chen. A compositional and dynamic model for face aging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3):385–401, 2010.
 [Suo et al.2012] Jinli Suo, Xilin Chen, Shiguang Shan, Wen Gao, and Qionghai Dai. A concatenational graph evolution aging model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2083–2096, 2012.
 [Tiddeman et al.2001] Bernard Tiddeman, Michael Burt, and David Perrett. Prototyping and transforming facial textures for perception research. IEEE Computer Graphics and Applications, 21(5):42–50, 2001.
 [Wang et al.2016] Wei Wang, Zhen Cui, Yan Yan, Jiashi Feng, Shuicheng Yan, Xiangbo Shu, and Nicu Sebe. Recurrent face aging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2378–2386, 2016.
 [Yang et al.2016] Hongyu Yang, Di Huang, Yunhong Wang, Heng Wang, and Yuanyan Tang. Face aging effect simulation using hidden factor analysis joint sparse representation. IEEE Transactions on Image Processing, 25(6):2493–2507, 2016.
 [Yang et al.2017] Hongyu Yang, Di Huang, Yunhong Wang, and Anil K. Jain. Learning Face Age Progression: A Pyramid Architecture of GANs. arXiv preprint arXiv:1711.10352, 2017.
 [Zhang et al.2017] Zhifei Zhang, Yang Song, and Hairong Qi. Age Progression/Regression by Conditional Adversarial Autoencoder. arXiv preprint arXiv:1702.08423, 2017.
Comments
There are no comments yet.