1 Introduction
Facial age analysis, including age translation, age generation and age estimation, is one of crucial components in modern face analysis for entertainment and forensics. Age translation (also known as face aging) aims to aesthetically render the facial appearance to a given age. In recent years, with the developments of generative adversarial network (GAN) [7]
and imagetoimage translation
[12, 39], impressive progresses [38, 32, 29, 18, 17]have been achieved on age translation. These methods often utilize a target age vector as a condition to dominate the facial appearance translation. Fig.
2 summarizes the commonly used frameworks for age translation. As shown in Fig. 2 (a), IPCGAN [29] directly incorporates the target age with inputs to synthesize facial aging images, while CAAE [38] (shown in Fig. 2 (b)) guides age translation by concatenating the age label with the facial image representations in the latent space. It is obvious that the performance of face aging depends on the accurate age labels. Although previous methods [32, 18, 17] have achieved remarkable visual results, in practice, it is difficult to collect labeled images of continuous ages for the intensive aging progression. Since all the existing datasets perform a longtailed age distribution, researchers often employ the time span of 10 years as the age clusters for age translation. This age cluster formulation potentially limits the diversity of aging patterns, especially for the younger.Recently, the variational autoencoder (VAE)
[15] shows the promising ability in discovering interpretability of the latent factors [3, 11, 31]. By augmenting VAE with a hyperparameter , VAE [19] controls the degree of disentanglement in the latent space. Benefiting from the disentangling abilities in VAE, this paper proposes a novel Universal Variational Aging (UVA) framework to formulate facial age priors in a disentangling manner. Compared with the existing methods, UVA is more capable of disentangling the facial images into an agerelated distribution and an ageirrelevant distribution in the latent space, rather than directly utilizing age labels as conditions for age translation. To be specific, the proposed method introduces two latent variables to model the agerelated and ageirrelevant information, and employs the variational evidence lower bound (ELBO) to encourage these two parts to be disentangled in the latent space. As shown in Fig. 2(c), the agerelated distribution is assumed as a Gaussian distribution, where the mean value
is the real age of the input image, while the ageirrelevant prior is set to a normal distribution
. The disentangling manner makes UVA perform more flexible and controllable for facial age analysis. Additionally, to synthesize photorealistic facial images, an extended conditional version of introspective adversarial learning mechanism [11] is introduced to UVA, which selfestimates the differences between the real and generated samples.In this way, by manipulating the mean value of agerelated distribution, UVA can easily realize facial age translation with arbitrary age (shown in Fig. 1), whether the age label exists or not in the training dataset. We furter observe an interesting phenomenon that when sampling noise from the ageirrelevant distribution, UVA can generate photorealistic face images with a specific age. Moreover, given a face image as input, we can easily obtain its age label from the mean value of the agerelated distribution, which indicates the ability of UVA to achieve age estimation. As stated above, we can implement three different tasks for facial age analysis in a unified framework. To the best of our knowledge, UVA is the first attempt to achieve facial age analysis, including age translation, age generation and age estimation, in a universal framework. The main contributions of UVA are as follows:

We propose a novel Universal Variational Aging (UVA) framework to formulate the continuous facial aging mechanism in a disentangling manner. It leads to a universal framework for facial age analysis, including age translation, age generation and age estimation.

Benefiting from the variational evidence lower bound, in UVA, the facial images are encoded and disentangled into an agerelated distribution and an ageirrelevant distribution in the latent space. An extended conditional introspective adversarial learning mechanism is introduced to obtain photorealistic facial image synthesis.

Different from the existing conditional age translation methods, which utilize the age labels/clusters as a condition, UVA tries to estimate an age distribution from the longtailed facial age dataset. This age distribution estimation provides a new condition means to model continuous ages that contributes to the interpretability of the image synthesis for facial age analysis.
2 Related Work
2.1 Variational Autoencoder
Variational Autoencoder (VAE) [15, 22] consists of two networks: an inference network maps the data to the latent variable , which is assumed as a gaussian distribution, and a generative network reversely maps the latent variable to the visible data . The object of VAE is to maximize the variational lower bound (or evidence lower bound, ELBO) of :
(1) 
2.2 Age Translation and Generation
Recently, deep conditional generative models have shown considerable ability in age translation [38, 29, 18, 17]. Zhang et al. [38] propose a Conditional Adversarial Autoencoder(CAAE) to transform an input facial image to the target age. To capture the rich textures in the local facial parts, Li et al. [18] propose a Global and Local Consistent Age Generative Adversarial Network (GLCAGAN) method. Meanwhile, IdentityPreserved Conditional Generative Adversarial Networks (IPCGAN) [29] introduces an identitypreserved term and an age classification term into age translation. Although these methods have achieved promising visual results, they have limitations in modeling continuous aging mechanism.
With the development of deep generative models, i.e., Generative Adversarial Network(GAN)[7], Variational Autoencoder(VAE), face generation has achieved dramatically success in recent years. Karras et al. [13] provide a new progressive training way for GAN, which grows both the generator and discriminator progressively. Huang et al. [11] propose the Introspective Variational Autoencoders(IntroVAE), in which the inference model and the generator are jointly trained in an introspective way.
3 Approach
In this section, we propose a universal variational aging framework for age translation, age generation and age estimation. The key idea is to employ an inference network to embed the facial image into two disentangled variational representations, where one is agerelated and another is ageirrelevant, and a generator network to produce photorealistic images from the resampled latent representations. As depicted in Fig. 3, we assign two different priors to regularize the inferred representations, and train and with age preserving regularization in an introspective adversarial manner. The details are given in the following.
3.1 Disentangled Variational Representations
In the original VAE [15], a probabilistic latent variable model is learnt by maximizing the variational lower bound to the marginal likelihood of the observable variables. However, the latent variable is difficult to interpret and control, for each element of is treated equally in the training. To alleviate this, we manually split into two parts, i.e., one part representing the agerelated information and another part representing the ageirrelevant information.
Assume and are independent on each other, then the posterior distribution . The prior distribution , where and are the prior distributions for and , respectively. According to Eq. (1), the optimization objective for the modified VAE is to maximize the lower bound(or evidence lower bound, ELBO) of :
(2) 
To make and correspond with different types of facial information, and are set to be different distributions. They are both centered isotropic multivariate Gaussian but and , where y is a vector filled by the age label of . Through using the two agerelated and ageirrelevant priors, maximizing the above variational lower bound encourages the posterior distribution and to be disentangled and to model the agerelated and ageirrelevant information, respectively.
Following the original VAE [15], we assume the posterior and follows two centered isotropic multivariate Gaussian, respectively, i.e., , and . As depicted in Fig. 3, , , and are the output vectors of the inference network . The input of the generator network is the concatenation of and , where and are sampled from and using a reparameterization trick, respectively, i.e., , , where , .
For description convenience, the optimization object in Eq. (2) is rewritten in the negative version:
(3) 
where , and denote the three terms in Eq. (2), respectively. They can be computed as below:
(4) 
(5) 
(6) 
where is the reconstruction image of the input , is the age label of , denotes the dimension of and .
3.2 Introspective Adversarial Learning
To alleviate the problem of generating blurry samples in VAEs, the introspective adversarial learning mechanism [11] is introduced to the proposed UVA. This makes the model able to selfestimate the differences between the real and generated samples without extra adversarial discriminators. The inference network is encouraged to distinguish between the real and generated samples while the generator network tries to fool it as GANs.
Different from IntroVAE [11], the proposed method employs a part of the posterior distribution, rather than the whole distribution, to serve as the estimator of the image reality. Specifically, the ageirrelevant posterior is selected to help the adversarial learning. When training , the model minimizes the KLdistance of the posterior from its prior for the real data and maximize it for the generated samples. When training , the model minimizes this KLdistance for the generated samples.
Similar to IntroVAE [11], the proposed UVA is trained to discriminate the real data from both the model reconstructions and samples. As shown in Fig. 3, these two types of samples are the reconstruction sample and the new samples , where , and are sampled from , and , respectively.
The adversarial training objects for and are defined as below:
(7) 
(8) 
where is a positive margin, is a weighting coefficient, , and are computed from the real data , the reconstruction sample and the new samples , respectively.
The proposed UVA can be viewed as a conditional version of IntroVAE [11]. The disentangled variational representation and the modified introspective adversarial learning makes it superior to IntroVAE in the interpretability and controllability of image generation.
3.3 Age Preserving Regularization
Age accuracy is important to facial age analysis. The most current face aging methods[38, 29, 18, 17, 32] usually utilize an additional pretrained age estimation network [30] to supervise the generated face images. While in our method, facial age can be estimated using the inference network by computing the mean value of the inferred vector . To better capture and disentangle agerelated information, an age regularization term is utilized on the learned representation :
(9) 
where denotes the dimension of , is the output vector of the input , is the age label of .
For further supervising the generator to reconstruct the agerelated information, a similar age regularization term is also employed on the reconstructed and generated samples in Fig. 3, i.e., and . The age preserving loss is reformulated as:
(10) 
where and are computed by from and , respectively.
3.4 Training UVA networks
As shown in Fig. 3, the inference network embeds the facial image into two disentangled distributions, i.e., and , where and are the output vectors of . The generator network produces the reconstruction and the sample , where , and are sampled from , and , respectively. To learn the disentangled variational representations and produce highfidelity facial images, the network is trained using a weighted sum of the above losses, defined as:
(11) 
(12) 
where , , , , are the tradeoff parameters to balance the important of losses. Noted that the third term in the ELBO loss in Eq. (3) has been contained in the adversarial loss for , i.e., the first term of in Eq. (7).
3.5 Inference and Sampling
By regularizing the disentangled representations with the agerelated prior and ageirrelevant prior , our UVA is thus a universal framework for age translation, generation and estimation tasks, as illustrated in Fig. 2 (c).
Age Translation To achieve the age translation task, we concatenate the ageirrelevant variable and a target age variable as the input of the generator ,where is sampled from the posterior distribution over the input face and is sampled from . The age translation result is written as:
(13) 
Age Generation For the age generation task, there are two settings. One is to generate image from noise with any ages . To be specific, we concatenate a target age variable and an ageirrelevant variable as the input of , where and are sampled from and , respectively. Then the age generation result is:
(14) 
The other is to generate image from noise , which shares the same agerelated variable with the input. Specifically, we concatenate the agerelated variable and an ageirrelevant variable as the input of the , where is sampled from the posterior distribution over the input face and is sampled from . The age generation result is formulated as:
(15) 
Age Estimation In this paper, age estimation is also conducted by the proposed UVA to verify the agerelated variable extracting and disentangling performance. We calculate the mean value of dimension vector as the age estimation result, defined as:
(16) 
where is one of the output vectors of the inference network .
4 Experiments
4.1 Datasets and Settings
4.1.1 Datasets
CrossAge Celebrity Dataset (CACD2000) [4] consists of 163,446 color facial images of 2,000 celebrities, where the ages range from 14 to 62 years old. However, there are many dirty data in it, which leads to a challenging model training. We choose the classical 8020 split on CACD2000.
Morph [23] is the largest public available dataset collected in the constrained environment. It contains 55,349 color facial images of 13,672 subjects with ages ranging from 16 to 77 years old. Conventional 8020 split is conducted on Morph.
UTKFace [36] is a largescale facial age dataset with a long age span, which ranges from 0 to 116 years old. It contains over 20,000 facial images in the wild with large variations in expression, occlusion, resolution and pose. We employ classical 8020 split on UTKFace.
MegaAgeAsian [37] is a newly released facial age dataset, consisting of 40,000 Asian faces with ages ranging from 0 to 70 years old. There are extremely variations, such as distortion, largearea occlusion and heavy makeup in MegaAgeAsian. Following [37], we reserve 3,945 images for testing and the remains are treated as training set.
FGNET [16] contains 1,002 facial images of 82 subjects. We employ it as the testing set to evaluate the generalization of UVA.
4.1.2 Experimental Settings
Following [17], we employ the multitask cascaded CNN [35] to detect the faces. All the facial images are cropped and aligned into 224 224, while the most existing methods [29, 38, 18] only achieve age translation with 128
128. Our model is implemented with Pytorch. During training, we choose Adam optimizer
[14] with of 0.9, of 0.99, a fixed learning rate of and batch size of 28. The tradeoff parameters , , , , are all set to 1. Besides, is set to 1000 and is set to 0.5. More details of the network architectures and training processes are provided in the supplementary materials.4.2 Qualitative Evaluation of UVA
By manipulating the mean value and sampling from agerelated distribution, the proposed UVA can translate arbitrary age based on the input. Fig. 4 and 5 present the age translation results on CACD2000 and Morph, respectively. We observe that the synthesized faces are getting older and older with ages growing. Specifically, the face contours become longer, the forehead hairline are higher and the nasolabial folds are deepened.
4.2.1 Age Translation
Since both CACD2000 and Morph lack of images of children, we conduct age translation on MegaAgeAsian and UTKFace. Fig. 6 shows the translation results on MegaAgeAsian from 10 to 70 years old. Fig. 7 describes the results on UTKFace from 0 to 115 years old with an interval of 5 years old. Obviously, from birth to adulthood, the aging effect is mainly shown on craniofacial growth, while the aging effect from adulthood to elder is reflected on the skin aging, which is consistent with human physiology.
Furthermore, superior to the most existing methods [38, 29, 18, 17] that can only generate images at specific age groups, our UVA is able to realize continuous age translation with 1yearold interval. The translation results on UTKFace from 0 to 119 years old with 1yearold interval is presented in the supplementary materials, due to the page limitation.
To evaluate the model generalization, we test UVA on FGNET and show the translation results in Fig. 8. The left image of each subject is the input and the right one is the translated image with 60 years old. Note that our UVA is trained only on CACD2000 and tested on FGNET.
4.2.2 Age Generation
Benefiting from the disentangled agerelated and ageirrelevant distributions in the latent space, the proposed UVA is able to realize age generation. On the one hand, conditioned on a given facial image, UVA can generate new faces with the same agerelated representation as the given image by fixing and sampling . Fig. 10 presents some results under this situation on the Morph and CACD2000 datasets. Our UVA generates diverse faces (such as different genders, appearances and haircuts) with the same age as the input facial image.
On the other hand, by manipulating an arbitrary of agerelated distribution and sampling both and , UVA has the ability to generate faces with various ages. Fig. 11 and 12 display the age generation results from 20 to 60 on the CACD2000 and Morph datasets, respectively. Each row has the same ageirrelevant representation and each column has the same agerelated representation . We can observe that the ageirrelevant information, such as gender and identity, is preserved across rows, and the agerelated information, such as white hairs and wrinkles, is preserved across columns. These demonstrate that our UVA effectively disentangles agerelated and ageirrelevant representations in the latent space. More results on the UTKFace and MegaAgeAsian datasets are presented in the supplementary materials, due to the page limitation.
4.2.3 Generalized Abilities of UVA
Benefiting from the disentangling manner, UVA has the abilities to estimate the agerelated distribution from the facial images, even if the training dataset is longtailed. Since the facial images in Morph ages from 16 to 77, when performing age translation with 10 years old, as shown in Fig. 13, it is amazing that UVA can synthesize photorealistic aging images. This observation demonstrates the generalized abilities of our framework, and also indicates that UVA can efficiently and accurately estimate the agerelated distribution from the facial images, even if the training dataset performs a longtailed age distribution.
4.3 Quantitative Evaluation of UVA
4.3.1 Age Estimation
To further demonstrate the disentangling ability of UVA, we conduct age estimation experiments on Morph and MegaAgeAsian, both of which are widely used datasets in age estimation task. We detail the evaluation metrics in the supplementary materials.
For experiment on Morph, following [21], we choose the 8020 random split protocol and report the mean absolute error(MAE). The results are listed in Table 1, where our model has not been pretrained on the external data. We can see from Table 1 that the proposed UVA achieves comparative performance with the statoftheart methods.
Methods  Pretrained  Morph 

MAE  
ORCNN[20]    3.34 
DEX[24]  IMDBWIKI  2.68 
Ranking [5]  Audience  2.96 
Posterior[37]    2.87 
SSRNet[34]  IMDBWIKI  2.52 
MV Loss[21]    2.514 
ThinAgeNet [6]  MSCeleb1M  2.346 
UVA    2.52 

Used partial data of the dataset;
For age estimation on MageAgeAsian, following [37], we reserve 3,945 images for testing and employ the Cumulative Accuracy (CA) as the evaluation metric. We report CA(3), CA(5), CA(7) as [37, 34] in Table 2. Note that our model has not pretrained on the external data and achieves comparative performance.
Methods  Pretrained  MegaAgeAsian  

CA(3)  CA(5)  CA(7)  
Posterior[37]  IMDBWIKI  62.08  80.43  90.42 
MobileNet[34]  IMDBWIKI  44.0  60.6   
DenseNet[34]  IMDBWIKI  51.7  69.4   
SSRNet[34]  IMDBWIKI  54.9  74.1   
UVA    60.47  79.95  90.44 
The age estimation results on Morph and MegaAgeAsian are nearly as good as the stateofthearts, which also demonstrates that the agerelated representation is well disentangled from the ageirrelevant representation.
4.3.2 Aging Accuracy
We conduct aging accuracy experiments of UVA on Morph and CACD2000, which is an essential quantitative metric to evaluate age translation. For fair comparison, following [33, 18], we utilize the online face analysis tool of Face++ [2] to evaluate the ages of the synthetic results, and divide the testing data of Morph and CACD2000 into four age groups: 30(AG0), 3140(AG1), 4150(AG2), 51+(AG3). We choose AG0 as the input and synthesize images in AG1, AG2 and AG3. Then we estimate the ages of the synthesized images and calculate the average ages for each group.
(a) on Morph  

Method  Input  AG1  AG2  AG3 
CAAE[38]    28.13  32.50  36.83 
Yang et al.[33]    42.84  50.78  59.91 
UVA    36.59  50.14  60.78 
Real Data  28.19  38.89  48.10  58.22 
(b) on CACD2000  
Method  Input  AG1  AG2  AG3 
CAAE[38]    31.32  34.94  36.91 
Yang et al.[33]    44.29  48.34  52.02 
UVA    39.19  45.73  51.32 
Real Data  30.73  39.08  47.06  53.68 
As shown in Table 3, we compare the UVA with CAAE [38] and Yang et al. [33] on Morph and CACD2000. We observe that the transformed ages by UVA are closer to the natural data than by CAAE [38] and is comparable to Yang et al. [33]. Note that, for age translation among multiple ages, the proposed UVA only needs to train one model, but Yang et al. [33] need to train multiple models for each transformation, such as InputAG1, InputAG2 and InputAG3.
4.3.3 Diversity
In this subsection, we utilize the Inception Distance (FID) [9] to evaluate the quality and diversity of the generated data. FID measures the distance between the generated and the real distributions in the feature space. Here, the testing images are generated from noise with target ages , where and are sampled from and , respectively. The FID results are detailed in Table 4.
Morph  CACD2000  UTKFace  MegaAgeAsian  
FID  17.154  37.08  31.18  32.79 
4.4 Ablation Study
We conduct ablation study on Morph to evaluate the effectiveness of introspective adversarial learning and age preserving regularization. We introduce the details in the supplemental materials, due to the page limitation.
5 Conclusion
This paper has proposed a Universal Variational Aging framework for continuous age analysis. Benefiting from variational evidence lower bound, the facial images are disentangled into an agerelated distribution and an ageirrelevant distribution in the latent space. A conditional introspective adversarial learning mechanism is introduced to improve the quality of facial aging synthesis. In this way, we can deal with longtailed data and implement three different tasks, including age translation, age generation and age estimation. To the best of our knowledge, UVA is the first attempt to achieve facial age analysis in a universal framework. The qualitative and quantitative experiments demonstrate the superiority of the proposed UVA on five popular datasets, including CACD2000, Morph, UTKFace, MegaAgeAsian and FGNET. This indicates that UVA can efficiently formulate the facial age prior, which contributes to both photorealistic and interpretable image synthesis for facial age analysis.
References
 [1] Agingbooth. pivi and co. https://itunes.apple.com/us/app/ agingbooth/id357467791?mt=8/, [Online].
 [2] Face++ research toolkit. megvii inc. http://www. faceplusplus.com, [Online].
 [3] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in betavae. arXiv preprint arXiv:1804.03599, 2018.
 [4] B.C. Chen, C.S. Chen, and W. H. Hsu. Face recognition and retrieval using crossage reference coding with crossage celebrity dataset. IEEE Transactions on Multimedia, 17(6):804–815, 2015.
 [5] S. Chen, C. Zhang, M. Dong, J. Le, and M. Rao. Using rankingcnn for age estimation. In CVPR, 2017.
 [6] B.B. Gao, H.Y. Zhou, J. Wu, and X. Geng. Age estimation using expectation of label distribution learning. In IJCAI, 2018.
 [7] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
 [8] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
 [9] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
 [10] X. Hou, L. Shen, K. Sun, and G. Qiu. Deep feature consistent variational autoencoder. In WACV. IEEE, 2017.
 [11] H. Huang, R. He, Z. Sun, T. Tan, et al. Introvae: Introspective variational autoencoders for photographic image synthesis. In NeurIPS, 2018.

[12]
P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros.
Imagetoimage translation with conditional adversarial networks.
CVPR, 2017.  [13] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 [14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [15] D. P. Kingma and M. Welling. Autoencoding variational bayes. In ICLR, 2014.
 [16] A. Lanitis, C. J. Taylor, and T. F. Cootes. Toward automatic simulation of aging effects on face images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):442–455, 2002.
 [17] P. Li, Y. Hu, R. He, and Z. Sun. Global and local consistent waveletdomain age synthesis. IEEE Transactions on Information Forensics and Security, 2018.
 [18] P. Li, Y. Hu, Q. Li, R. He, and Z. Sun. Global and local consistent age generative adversarial networks. ICPR, 2018.
 [19] L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. VAE: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
 [20] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua. Ordinal regression with multiple output cnn for age estimation. In CVPR, 2016.

[21]
H. Pan, H. Han, S. Shan, and X. Chen.
Meanvariance loss for deep age estimation from a face.
In CVPR, 2018. 
[22]
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
ICML, 2014.  [23] K. Ricanek and T. Tesafaye. Morph: A longitudinal image database of normal adult ageprogression. In FGR. IEEE, 2006.

[24]
R. Rothe, R. Timofte, and L. Van Gool.
Deep expectation of real and apparent age from a single image without
facial landmarks.
International Journal of Computer Vision
, 126(24):144–157, 2018.  [25] S. Semeniuta, A. Severyn, and E. Barth. A hybrid convolutional variational autoencoder for text generation. arXiv preprint arXiv:1702.02390, 2017.
 [26] X. Shu, J. Tang, H. Lai, L. Liu, and S. Yan. Personalized age progression with aging dictionary. In ICCV, Dec. 2015.
 [27] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NeurIPS, 2015.
 [28] W. Wang, Z. Cui, Y. Yan, J. Feng, S. Yan, X. Shu, and N. Sebe. Recurrent face aging. In CVPR, Jun. 2016.
 [29] Z. Wang, X. Tang, W. Luo, and S. Gao. Face aging with identitypreserved conditional generative adversarial networks. In CVPR, 2018.
 [30] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
 [31] X. Wu, H. Huang, V. M. Patel, R. He, and Z. Sun. Disentangled variational representation for heterogeneous face recognition. In AAAI, 2019.
 [32] H. Yang, D. Huang, Y. Wang, and A. K. Jain. Learning face age progression: A pyramid architecture of gans. In CVPR, 2018.
 [33] H. Yang, D. Huang, Y. Wang, and A. K. Jain. Learning face age progression: A pyramid architecture of gans. CVPR, 2018.
 [34] T.Y. Yang, Y.H. Huang, Y.Y. Lin, P.C. Hsiu, and Y.Y. Chuang. Ssrnet: A compact soft stagewise regression network for age estimation. In IJCAI, 2018.
 [35] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
 [36] S. Y. Zhang, Zhifei and H. Qi. Age progression/regression by conditional adversarial autoencoder. In CVPR. IEEE, 2017.
 [37] Y. Zhang, L. Liu, C. Li, et al. Quantifying facial age by posterior of age comparisons. arXiv preprint arXiv:1708.09687, 2017.
 [38] Z. Zhang, Y. Song, and H. Qi. Age progression/regression by conditional adversarial autoencoder. In CVPR, 2017.
 [39] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. ICCV, 2017.
6 Supplementary Materials
In this supplementary material, we first introduce the network architecture and the training algorithm of our proposed UVA. Then we depict the metrics used for age estimation experiments. Additional comparisons are conducted to demonstrate the continuous manner of UVA in Section 4. Besides, Section 5 presents the ablation study, followed by additional results of age generation and translation.
6.1 Network Architecture
The network architectures of the proposed UVA are shown in Table 5. All images are aligned to 256256 as the inputs.
Layer  Input  Filter/Stride 
Output Size 
Encoder  
Econv1  
Eavgpool1  Econv1  
Eres1  Eavgpool1  
Eavgpool2  Eres1  
Eres2  Eavgpool2  
Eavgpool3  Eres2  
Eres3  Eavgpool3  
Eavgpool4  Eres3  
Eres4  Eavgpool4  
Eavgpool5  Eres4  
Eres5  Eavgpool5  
Eavgpool6  Eres5  
Eres6  Eavgpool6  
Eflatten, fc  Eres6    1024 
Esplit  Efc    256,256,256,256 
Erepara1  Esplit[0,1]    256 
Erepara2  Esplit[2,3]    256 
Ecat  Erepara1, Erepara2,    512 
Generator  
Gfc,reshape  Ecat    
Gres1  Gfc  
Gupsample1  Gres1  , nearest  
Gres2  Gupsample1  
Gupsample2  Gres2  , nearest  
Gres3  Gupsample2  
Gupsample3  Lres3  , nearest  
Gres4  Gupsample3  
Gupsample4  Gres4  , nearest  
Gres5  Gupsample4  
Gupsample5  Lres5  , nearest  
Gres6  Gupsample5  
Gupsample6  Gres7  , nearest  
Gres7  Gupsample6  
Gconv1  Gres7 
6.2 Training Algorithm
The training process of UVA is described in Algorithm 1.
6.3 Evaluation Metrics of Age Estimation
We evaluate the performance of UVA on age estimation task with the Mean Absolute Error(MAE) and Cumulative Accuracy (CA).
MAE is widely used to evaluate the performance of age estimation models,which is defined as the average distance between the inferred age and the groundtruth:
(17) 
The Cumulative Accuracy (CA) is defined as:
(18) 
where is the number of the test images whose absolute estimated error is smaller than .
6.4 Additional Comparison Results on FGNET
Additional comparison results with the conditional adversarial autoencoder (CAAE) [38] on FGNET are shown in Figure 14. Note that our model is trained on UTKFace. We observe that CAAE can only translate the input into specific age groups, including 05, 610, 1115, 1620, 2130, 3140, 4150, 5160, 6170, and 7180, while UVA can perform age translation with the arbitrary age. In addition, CAAE only focuses on the face appearances without hair, while UVA achieves age translation on the whole face. As shown in Fig. 14, the hairline gradually becomes higher as age increases.
6.5 Ablation Study
In this section, we conduct ablation study on Morph to evaluate the effectiveness of the introspective adversarial loss, the age preserving loss and the age regularization term, respectively. We report the qualitative visualization results for a comprehensive comparison. Fig. 15 shows visual comparisons between UVA and its three incomplete variants. We observe that the proposed UVA is visually better than its variants across all ages. Without the or loss, the aging images lack the detailed texture information. Besides, without the loss, the aging faces are obviously blur. The visualization results demonstrate that all three components are essential for UVA.
6.6 Age Generation
Benefiting from the disentangled agerelated and ageirrelevant distributions in the latent space, the proposed UVA is able to realize age generation. On the one hand, given a facial image, when fixing agerelated distribution and sampling from noise, UVA can generate various faces with the specific age. Fig. 18 shows the age generation results on Morph, CACD2000, MegaAgeAsian and UTKFace. On the other hand, by manipulating of the agerelated distribution and sampling from noise, UVA can generate faces with various ages. Fig. 16 shows the age generation results from 10 to 70 on MegaAgeAsian.
6.7 Age Translation
As shown in Fig. 17, we observe that UTKFace exhibits a longtailed age distribution. The age translation results on UTKFace are presented in Fig. 19, 20 and 21. The ages range from 0 to 119 years old with 1yearold aging interval. Since the images in UTKFace range from 0 to 116 years old, it is amazing that UVA can synthesize photorealistic aging images even with 117,118 and 119 years old. These results suggest that UVA can effeciently and accurately estimate the agerelated distribution from the facial images, even if the training dataset performs a longtailed age distribution. Considering that generating highresolution images is significant to enlarge the application field of face aging, in the future, we will build a new facial age dataset to support the agerelated study.
Comments
There are no comments yet.