1 Introduction
Facial age analysis, including age translation, age generation and age estimation, is one of crucial components in modern face analysis for entertainment and forensics. Age translation (also known as face aging) aims to aesthetically render the facial appearance to a given age. In recent years, with the developments of generative adversarial network (GAN) [7]
and image-to-image translation
[12, 39], impressive progresses [38, 32, 29, 18, 17]have been achieved on age translation. These methods often utilize a target age vector as a condition to dominate the facial appearance translation. Fig.
2 summarizes the commonly used frameworks for age translation. As shown in Fig. 2 (a), IPC-GAN [29] directly incorporates the target age with inputs to synthesize facial aging images, while CAAE [38] (shown in Fig. 2 (b)) guides age translation by concatenating the age label with the facial image representations in the latent space. It is obvious that the performance of face aging depends on the accurate age labels. Although previous methods [32, 18, 17] have achieved remarkable visual results, in practice, it is difficult to collect labeled images of continuous ages for the intensive aging progression. Since all the existing datasets perform a long-tailed age distribution, researchers often employ the time span of 10 years as the age clusters for age translation. This age cluster formulation potentially limits the diversity of aging patterns, especially for the younger.
Recently, the variational autoencoder (VAE)
[15] shows the promising ability in discovering interpretability of the latent factors [3, 11, 31]. By augmenting VAE with a hyper-parameter , -VAE [19] controls the degree of disentanglement in the latent space. Benefiting from the disentangling abilities in VAE, this paper proposes a novel Universal Variational Aging (UVA) framework to formulate facial age priors in a disentangling manner. Compared with the existing methods, UVA is more capable of disentangling the facial images into an age-related distribution and an age-irrelevant distribution in the latent space, rather than directly utilizing age labels as conditions for age translation. To be specific, the proposed method introduces two latent variables to model the age-related and age-irrelevant information, and employs the variational evidence lower bound (ELBO) to encourage these two parts to be disentangled in the latent space. As shown in Fig. 2(c), the age-related distribution is assumed as a Gaussian distribution, where the mean value
is the real age of the input image, while the age-irrelevant prior is set to a normal distribution
. The disentangling manner makes UVA perform more flexible and controllable for facial age analysis. Additionally, to synthesize photorealistic facial images, an extended conditional version of introspective adversarial learning mechanism [11] is introduced to UVA, which self-estimates the differences between the real and generated samples.
In this way, by manipulating the mean value of age-related distribution, UVA can easily realize facial age translation with arbitrary age (shown in Fig. 1), whether the age label exists or not in the training dataset. We furter observe an interesting phenomenon that when sampling noise from the age-irrelevant distribution, UVA can generate photorealistic face images with a specific age. Moreover, given a face image as input, we can easily obtain its age label from the mean value of the age-related distribution, which indicates the ability of UVA to achieve age estimation. As stated above, we can implement three different tasks for facial age analysis in a unified framework. To the best of our knowledge, UVA is the first attempt to achieve facial age analysis, including age translation, age generation and age estimation, in a universal framework. The main contributions of UVA are as follows:
-
We propose a novel Universal Variational Aging (UVA) framework to formulate the continuous facial aging mechanism in a disentangling manner. It leads to a universal framework for facial age analysis, including age translation, age generation and age estimation.
-
Benefiting from the variational evidence lower bound, in UVA, the facial images are encoded and disentangled into an age-related distribution and an age-irrelevant distribution in the latent space. An extended conditional introspective adversarial learning mechanism is introduced to obtain photorealistic facial image synthesis.
-
Different from the existing conditional age translation methods, which utilize the age labels/clusters as a condition, UVA tries to estimate an age distribution from the long-tailed facial age dataset. This age distribution estimation provides a new condition means to model continuous ages that contributes to the interpretability of the image synthesis for facial age analysis.
2 Related Work
2.1 Variational Autoencoder
Variational Autoencoder (VAE) [15, 22] consists of two networks: an inference network maps the data to the latent variable , which is assumed as a gaussian distribution, and a generative network reversely maps the latent variable to the visible data . The object of VAE is to maximize the variational lower bound (or evidence lower bound, ELBO) of :
(1) |

2.2 Age Translation and Generation
Recently, deep conditional generative models have shown considerable ability in age translation [38, 29, 18, 17]. Zhang et al. [38] propose a Conditional Adversarial Autoencoder(CAAE) to transform an input facial image to the target age. To capture the rich textures in the local facial parts, Li et al. [18] propose a Global and Local Consistent Age Generative Adversarial Network (GLCA-GAN) method. Meanwhile, Identity-Preserved Conditional Generative Adversarial Networks (IPCGAN) [29] introduces an identity-preserved term and an age classification term into age translation. Although these methods have achieved promising visual results, they have limitations in modeling continuous aging mechanism.
With the development of deep generative models, i.e., Generative Adversarial Network(GAN)[7], Variational Autoencoder(VAE), face generation has achieved dramatically success in recent years. Karras et al. [13] provide a new progressive training way for GAN, which grows both the generator and discriminator progressively. Huang et al. [11] propose the Introspective Variational Autoencoders(IntroVAE), in which the inference model and the generator are jointly trained in an introspective way.
3 Approach
In this section, we propose a universal variational aging framework for age translation, age generation and age estimation. The key idea is to employ an inference network to embed the facial image into two disentangled variational representations, where one is age-related and another is age-irrelevant, and a generator network to produce photo-realistic images from the re-sampled latent representations. As depicted in Fig. 3, we assign two different priors to regularize the inferred representations, and train and with age preserving regularization in an introspective adversarial manner. The details are given in the following.
3.1 Disentangled Variational Representations
In the original VAE [15], a probabilistic latent variable model is learnt by maximizing the variational lower bound to the marginal likelihood of the observable variables. However, the latent variable is difficult to interpret and control, for each element of is treated equally in the training. To alleviate this, we manually split into two parts, i.e., one part representing the age-related information and another part representing the age-irrelevant information.
Assume and are independent on each other, then the posterior distribution . The prior distribution , where and are the prior distributions for and , respectively. According to Eq. (1), the optimization objective for the modified VAE is to maximize the lower bound(or evidence lower bound, ELBO) of :
(2) |
To make and correspond with different types of facial information, and are set to be different distributions. They are both centered isotropic multivariate Gaussian but and , where y is a vector filled by the age label of . Through using the two age-related and age-irrelevant priors, maximizing the above variational lower bound encourages the posterior distribution and to be disentangled and to model the age-related and age-irrelevant information, respectively.
Following the original VAE [15], we assume the posterior and follows two centered isotropic multivariate Gaussian, respectively, i.e., , and . As depicted in Fig. 3, , , and are the output vectors of the inference network . The input of the generator network is the concatenation of and , where and are sampled from and using a reparameterization trick, respectively, i.e., , , where , .
For description convenience, the optimization object in Eq. (2) is rewritten in the negative version:
(3) |
where , and denote the three terms in Eq. (2), respectively. They can be computed as below:
(4) |
(5) |
(6) |
where is the reconstruction image of the input , is the age label of , denotes the dimension of and .
3.2 Introspective Adversarial Learning
To alleviate the problem of generating blurry samples in VAEs, the introspective adversarial learning mechanism [11] is introduced to the proposed UVA. This makes the model able to self-estimate the differences between the real and generated samples without extra adversarial discriminators. The inference network is encouraged to distinguish between the real and generated samples while the generator network tries to fool it as GANs.
Different from IntroVAE [11], the proposed method employs a part of the posterior distribution, rather than the whole distribution, to serve as the estimator of the image reality. Specifically, the age-irrelevant posterior is selected to help the adversarial learning. When training , the model minimizes the KL-distance of the posterior from its prior for the real data and maximize it for the generated samples. When training , the model minimizes this KL-distance for the generated samples.
Similar to IntroVAE [11], the proposed UVA is trained to discriminate the real data from both the model reconstructions and samples. As shown in Fig. 3, these two types of samples are the reconstruction sample and the new samples , where , and are sampled from , and , respectively.
The adversarial training objects for and are defined as below:
(7) |
(8) |
where is a positive margin, is a weighting coefficient, , and are computed from the real data , the reconstruction sample and the new samples , respectively.
The proposed UVA can be viewed as a conditional version of IntroVAE [11]. The disentangled variational representation and the modified introspective adversarial learning makes it superior to IntroVAE in the interpretability and controllability of image generation.
3.3 Age Preserving Regularization
Age accuracy is important to facial age analysis. The most current face aging methods[38, 29, 18, 17, 32] usually utilize an additional pre-trained age estimation network [30] to supervise the generated face images. While in our method, facial age can be estimated using the inference network by computing the mean value of the inferred vector . To better capture and disentangle age-related information, an age regularization term is utilized on the learned representation :
(9) |
where denotes the dimension of , is the output vector of the input , is the age label of .
For further supervising the generator to reconstruct the age-related information, a similar age regularization term is also employed on the reconstructed and generated samples in Fig. 3, i.e., and . The age preserving loss is reformulated as:
(10) |
where and are computed by from and , respectively.
3.4 Training UVA networks
As shown in Fig. 3, the inference network embeds the facial image into two disentangled distributions, i.e., and , where and are the output vectors of . The generator network produces the reconstruction and the sample , where , and are sampled from , and , respectively. To learn the disentangled variational representations and produce high-fidelity facial images, the network is trained using a weighted sum of the above losses, defined as:
(11) |
(12) |
where , , , , are the trade-off parameters to balance the important of losses. Noted that the third term in the ELBO loss in Eq. (3) has been contained in the adversarial loss for , i.e., the first term of in Eq. (7).
3.5 Inference and Sampling
By regularizing the disentangled representations with the age-related prior and age-irrelevant prior , our UVA is thus a universal framework for age translation, generation and estimation tasks, as illustrated in Fig. 2 (c).
Age Translation To achieve the age translation task, we concatenate the age-irrelevant variable and a target age variable as the input of the generator ,where is sampled from the posterior distribution over the input face and is sampled from . The age translation result is written as:
(13) |
Age Generation For the age generation task, there are two settings. One is to generate image from noise with any ages . To be specific, we concatenate a target age variable and an age-irrelevant variable as the input of , where and are sampled from and , respectively. Then the age generation result is:
(14) |
The other is to generate image from noise , which shares the same age-related variable with the input. Specifically, we concatenate the age-related variable and an age-irrelevant variable as the input of the , where is sampled from the posterior distribution over the input face and is sampled from . The age generation result is formulated as:
(15) |
Age Estimation In this paper, age estimation is also conducted by the proposed UVA to verify the age-related variable extracting and disentangling performance. We calculate the mean value of -dimension vector as the age estimation result, defined as:
(16) |
where is one of the output vectors of the inference network .
4 Experiments
4.1 Datasets and Settings
4.1.1 Datasets
Cross-Age Celebrity Dataset (CACD2000) [4] consists of 163,446 color facial images of 2,000 celebrities, where the ages range from 14 to 62 years old. However, there are many dirty data in it, which leads to a challenging model training. We choose the classical 80-20 split on CACD2000.
Morph [23] is the largest public available dataset collected in the constrained environment. It contains 55,349 color facial images of 13,672 subjects with ages ranging from 16 to 77 years old. Conventional 80-20 split is conducted on Morph.
UTKFace [36] is a large-scale facial age dataset with a long age span, which ranges from 0 to 116 years old. It contains over 20,000 facial images in the wild with large variations in expression, occlusion, resolution and pose. We employ classical 80-20 split on UTKFace.
MegaAge-Asian [37] is a newly released facial age dataset, consisting of 40,000 Asian faces with ages ranging from 0 to 70 years old. There are extremely variations, such as distortion, large-area occlusion and heavy makeup in MegaAge-Asian. Following [37], we reserve 3,945 images for testing and the remains are treated as training set.
FG-NET [16] contains 1,002 facial images of 82 subjects. We employ it as the testing set to evaluate the generalization of UVA.
4.1.2 Experimental Settings
Following [17], we employ the multi-task cascaded CNN [35] to detect the faces. All the facial images are cropped and aligned into 224 224, while the most existing methods [29, 38, 18] only achieve age translation with 128
128. Our model is implemented with Pytorch. During training, we choose Adam optimizer
[14] with of 0.9, of 0.99, a fixed learning rate of and batch size of 28. The trade-off parameters , , , , are all set to 1. Besides, is set to 1000 and is set to 0.5. More details of the network architectures and training processes are provided in the supplementary materials.4.2 Qualitative Evaluation of UVA

By manipulating the mean value and sampling from age-related distribution, the proposed UVA can translate arbitrary age based on the input. Fig. 4 and 5 present the age translation results on CACD2000 and Morph, respectively. We observe that the synthesized faces are getting older and older with ages growing. Specifically, the face contours become longer, the forehead hairline are higher and the nasolabial folds are deepened.

4.2.1 Age Translation

Since both CACD2000 and Morph lack of images of children, we conduct age translation on MegaAge-Asian and UTKFace. Fig. 6 shows the translation results on MegaAge-Asian from 10 to 70 years old. Fig. 7 describes the results on UTKFace from 0 to 115 years old with an interval of 5 years old. Obviously, from birth to adulthood, the aging effect is mainly shown on craniofacial growth, while the aging effect from adulthood to elder is reflected on the skin aging, which is consistent with human physiology.



Furthermore, superior to the most existing methods [38, 29, 18, 17] that can only generate images at specific age groups, our UVA is able to realize continuous age translation with 1-year-old interval. The translation results on UTKFace from 0 to 119 years old with 1-year-old interval is presented in the supplementary materials, due to the page limitation.
To evaluate the model generalization, we test UVA on FG-NET and show the translation results in Fig. 8. The left image of each subject is the input and the right one is the translated image with 60 years old. Note that our UVA is trained only on CACD2000 and tested on FG-NET.
4.2.2 Age Generation

Benefiting from the disentangled age-related and age-irrelevant distributions in the latent space, the proposed UVA is able to realize age generation. On the one hand, conditioned on a given facial image, UVA can generate new faces with the same age-related representation as the given image by fixing and sampling . Fig. 10 presents some results under this situation on the Morph and CACD2000 datasets. Our UVA generates diverse faces (such as different genders, appearances and haircuts) with the same age as the input facial image.


On the other hand, by manipulating an arbitrary of age-related distribution and sampling both and , UVA has the ability to generate faces with various ages. Fig. 11 and 12 display the age generation results from 20 to 60 on the CACD2000 and Morph datasets, respectively. Each row has the same age-irrelevant representation and each column has the same age-related representation . We can observe that the age-irrelevant information, such as gender and identity, is preserved across rows, and the age-related information, such as white hairs and wrinkles, is preserved across columns. These demonstrate that our UVA effectively disentangles age-related and age-irrelevant representations in the latent space. More results on the UTKFace and MegaAge-Asian datasets are presented in the supplementary materials, due to the page limitation.
4.2.3 Generalized Abilities of UVA
Benefiting from the disentangling manner, UVA has the abilities to estimate the age-related distribution from the facial images, even if the training dataset is long-tailed. Since the facial images in Morph ages from 16 to 77, when performing age translation with 10 years old, as shown in Fig. 13, it is amazing that UVA can synthesize photorealistic aging images. This observation demonstrates the generalized abilities of our framework, and also indicates that UVA can efficiently and accurately estimate the age-related distribution from the facial images, even if the training dataset performs a long-tailed age distribution.

4.3 Quantitative Evaluation of UVA
4.3.1 Age Estimation
To further demonstrate the disentangling ability of UVA, we conduct age estimation experiments on Morph and MegaAge-Asian, both of which are widely used datasets in age estimation task. We detail the evaluation metrics in the supplementary materials.
For experiment on Morph, following [21], we choose the 80-20 random split protocol and report the mean absolute error(MAE). The results are listed in Table 1, where our model has not been pre-trained on the external data. We can see from Table 1 that the proposed UVA achieves comparative performance with the stat-of-the-art methods.
Methods | Pre-trained | Morph |
---|---|---|
MAE | ||
OR-CNN[20] | - | 3.34 |
DEX[24] | IMDB-WIKI | 2.68 |
Ranking [5] | Audience | 2.96 |
Posterior[37] | - | 2.87 |
SSR-Net[34] | IMDB-WIKI | 2.52 |
M-V Loss[21] | - | 2.514 |
ThinAgeNet [6] | MS-Celeb-1M | 2.346 |
UVA | - | 2.52 |
-
Used partial data of the dataset;
For age estimation on MageAge-Asian, following [37], we reserve 3,945 images for testing and employ the Cumulative Accuracy (CA) as the evaluation metric. We report CA(3), CA(5), CA(7) as [37, 34] in Table 2. Note that our model has not pre-trained on the external data and achieves comparative performance.
Methods | Pre-trained | MegaAge-Asian | ||
---|---|---|---|---|
CA(3) | CA(5) | CA(7) | ||
Posterior[37] | IMDB-WIKI | 62.08 | 80.43 | 90.42 |
MobileNet[34] | IMDB-WIKI | 44.0 | 60.6 | - |
DenseNet[34] | IMDB-WIKI | 51.7 | 69.4 | - |
SSR-Net[34] | IMDB-WIKI | 54.9 | 74.1 | - |
UVA | - | 60.47 | 79.95 | 90.44 |
The age estimation results on Morph and MegaAge-Asian are nearly as good as the state-of-the-arts, which also demonstrates that the age-related representation is well disentangled from the age-irrelevant representation.
4.3.2 Aging Accuracy
We conduct aging accuracy experiments of UVA on Morph and CACD2000, which is an essential quantitative metric to evaluate age translation. For fair comparison, following [33, 18], we utilize the online face analysis tool of Face++ [2] to evaluate the ages of the synthetic results, and divide the testing data of Morph and CACD2000 into four age groups: 30-(AG0), 31-40(AG1), 41-50(AG2), 51+(AG3). We choose AG0 as the input and synthesize images in AG1, AG2 and AG3. Then we estimate the ages of the synthesized images and calculate the average ages for each group.
(a) on Morph | ||||
---|---|---|---|---|
Method | Input | AG1 | AG2 | AG3 |
CAAE[38] | - | 28.13 | 32.50 | 36.83 |
Yang et al.[33] | - | 42.84 | 50.78 | 59.91 |
UVA | - | 36.59 | 50.14 | 60.78 |
Real Data | 28.19 | 38.89 | 48.10 | 58.22 |
(b) on CACD2000 | ||||
Method | Input | AG1 | AG2 | AG3 |
CAAE[38] | - | 31.32 | 34.94 | 36.91 |
Yang et al.[33] | - | 44.29 | 48.34 | 52.02 |
UVA | - | 39.19 | 45.73 | 51.32 |
Real Data | 30.73 | 39.08 | 47.06 | 53.68 |
As shown in Table 3, we compare the UVA with CAAE [38] and Yang et al. [33] on Morph and CACD2000. We observe that the transformed ages by UVA are closer to the natural data than by CAAE [38] and is comparable to Yang et al. [33]. Note that, for age translation among multiple ages, the proposed UVA only needs to train one model, but Yang et al. [33] need to train multiple models for each transformation, such as InputAG1, InputAG2 and InputAG3.
4.3.3 Diversity
In this subsection, we utilize the Inception Distance (FID) [9] to evaluate the quality and diversity of the generated data. FID measures the distance between the generated and the real distributions in the feature space. Here, the testing images are generated from noise with target ages , where and are sampled from and , respectively. The FID results are detailed in Table 4.
Morph | CACD2000 | UTKFace | MegaAge-Asian | |
FID | 17.154 | 37.08 | 31.18 | 32.79 |
4.4 Ablation Study
We conduct ablation study on Morph to evaluate the effectiveness of introspective adversarial learning and age preserving regularization. We introduce the details in the supplemental materials, due to the page limitation.
5 Conclusion
This paper has proposed a Universal Variational Aging framework for continuous age analysis. Benefiting from variational evidence lower bound, the facial images are disentangled into an age-related distribution and an age-irrelevant distribution in the latent space. A conditional introspective adversarial learning mechanism is introduced to improve the quality of facial aging synthesis. In this way, we can deal with long-tailed data and implement three different tasks, including age translation, age generation and age estimation. To the best of our knowledge, UVA is the first attempt to achieve facial age analysis in a universal framework. The qualitative and quantitative experiments demonstrate the superiority of the proposed UVA on five popular datasets, including CACD2000, Morph, UTKFace, MegaAge-Asian and FG-NET. This indicates that UVA can efficiently formulate the facial age prior, which contributes to both photorealistic and interpretable image synthesis for facial age analysis.
References
- [1] Agingbooth. pivi and co. https://itunes.apple.com/us/app/ agingbooth/id357467791?mt=8/, [Online].
- [2] Face++ research toolkit. megvii inc. http://www. faceplusplus.com, [Online].
- [3] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in beta-vae. arXiv preprint arXiv:1804.03599, 2018.
- [4] B.-C. Chen, C.-S. Chen, and W. H. Hsu. Face recognition and retrieval using cross-age reference coding with cross-age celebrity dataset. IEEE Transactions on Multimedia, 17(6):804–815, 2015.
- [5] S. Chen, C. Zhang, M. Dong, J. Le, and M. Rao. Using ranking-cnn for age estimation. In CVPR, 2017.
- [6] B.-B. Gao, H.-Y. Zhou, J. Wu, and X. Geng. Age estimation using expectation of label distribution learning. In IJCAI, 2018.
- [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
- [8] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
- [9] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
- [10] X. Hou, L. Shen, K. Sun, and G. Qiu. Deep feature consistent variational autoencoder. In WACV. IEEE, 2017.
- [11] H. Huang, R. He, Z. Sun, T. Tan, et al. Introvae: Introspective variational autoencoders for photographic image synthesis. In NeurIPS, 2018.
-
[12]
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.
CVPR, 2017. - [13] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- [14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [15] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
- [16] A. Lanitis, C. J. Taylor, and T. F. Cootes. Toward automatic simulation of aging effects on face images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):442–455, 2002.
- [17] P. Li, Y. Hu, R. He, and Z. Sun. Global and local consistent wavelet-domain age synthesis. IEEE Transactions on Information Forensics and Security, 2018.
- [18] P. Li, Y. Hu, Q. Li, R. He, and Z. Sun. Global and local consistent age generative adversarial networks. ICPR, 2018.
- [19] L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. -VAE: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
- [20] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua. Ordinal regression with multiple output cnn for age estimation. In CVPR, 2016.
-
[21]
H. Pan, H. Han, S. Shan, and X. Chen.
Mean-variance loss for deep age estimation from a face.
In CVPR, 2018. -
[22]
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
ICML, 2014. - [23] K. Ricanek and T. Tesafaye. Morph: A longitudinal image database of normal adult age-progression. In FGR. IEEE, 2006.
-
[24]
R. Rothe, R. Timofte, and L. Van Gool.
Deep expectation of real and apparent age from a single image without
facial landmarks.
International Journal of Computer Vision
, 126(2-4):144–157, 2018. - [25] S. Semeniuta, A. Severyn, and E. Barth. A hybrid convolutional variational autoencoder for text generation. arXiv preprint arXiv:1702.02390, 2017.
- [26] X. Shu, J. Tang, H. Lai, L. Liu, and S. Yan. Personalized age progression with aging dictionary. In ICCV, Dec. 2015.
- [27] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NeurIPS, 2015.
- [28] W. Wang, Z. Cui, Y. Yan, J. Feng, S. Yan, X. Shu, and N. Sebe. Recurrent face aging. In CVPR, Jun. 2016.
- [29] Z. Wang, X. Tang, W. Luo, and S. Gao. Face aging with identity-preserved conditional generative adversarial networks. In CVPR, 2018.
- [30] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
- [31] X. Wu, H. Huang, V. M. Patel, R. He, and Z. Sun. Disentangled variational representation for heterogeneous face recognition. In AAAI, 2019.
- [32] H. Yang, D. Huang, Y. Wang, and A. K. Jain. Learning face age progression: A pyramid architecture of gans. In CVPR, 2018.
- [33] H. Yang, D. Huang, Y. Wang, and A. K. Jain. Learning face age progression: A pyramid architecture of gans. CVPR, 2018.
- [34] T.-Y. Yang, Y.-H. Huang, Y.-Y. Lin, P.-C. Hsiu, and Y.-Y. Chuang. Ssr-net: A compact soft stagewise regression network for age estimation. In IJCAI, 2018.
- [35] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
- [36] S. Y. Zhang, Zhifei and H. Qi. Age progression/regression by conditional adversarial autoencoder. In CVPR. IEEE, 2017.
- [37] Y. Zhang, L. Liu, C. Li, et al. Quantifying facial age by posterior of age comparisons. arXiv preprint arXiv:1708.09687, 2017.
- [38] Z. Zhang, Y. Song, and H. Qi. Age progression/regression by conditional adversarial autoencoder. In CVPR, 2017.
- [39] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV, 2017.
6 Supplementary Materials
In this supplementary material, we first introduce the network architecture and the training algorithm of our proposed UVA. Then we depict the metrics used for age estimation experiments. Additional comparisons are conducted to demonstrate the continuous manner of UVA in Section 4. Besides, Section 5 presents the ablation study, followed by additional results of age generation and translation.
6.1 Network Architecture
The network architectures of the proposed UVA are shown in Table 5. All images are aligned to 256256 as the inputs.
Layer | Input | Filter/Stride |
Output Size |
Encoder | |||
Econv1 | |||
Eavgpool1 | Econv1 | ||
Eres1 | Eavgpool1 | ||
Eavgpool2 | Eres1 | ||
Eres2 | Eavgpool2 | ||
Eavgpool3 | Eres2 | ||
Eres3 | Eavgpool3 | ||
Eavgpool4 | Eres3 | ||
Eres4 | Eavgpool4 | ||
Eavgpool5 | Eres4 | ||
Eres5 | Eavgpool5 | ||
Eavgpool6 | Eres5 | ||
Eres6 | Eavgpool6 | ||
Eflatten, fc | Eres6 | - | 1024 |
Esplit | Efc | - | 256,256,256,256 |
Erepara1 | Esplit[0,1] | - | 256 |
Erepara2 | Esplit[2,3] | - | 256 |
Ecat | Erepara1, Erepara2, | - | 512 |
Generator | |||
Gfc,reshape | Ecat | - | |
Gres1 | Gfc | ||
Gupsample1 | Gres1 | , nearest | |
Gres2 | Gupsample1 | ||
Gupsample2 | Gres2 | , nearest | |
Gres3 | Gupsample2 | ||
Gupsample3 | Lres3 | , nearest | |
Gres4 | Gupsample3 | ||
Gupsample4 | Gres4 | , nearest | |
Gres5 | Gupsample4 | ||
Gupsample5 | Lres5 | , nearest | |
Gres6 | Gupsample5 | ||
Gupsample6 | Gres7 | , nearest | |
Gres7 | Gupsample6 | ||
Gconv1 | Gres7 |
6.2 Training Algorithm
The training process of UVA is described in Algorithm 1.
1.3
6.3 Evaluation Metrics of Age Estimation
We evaluate the performance of UVA on age estimation task with the Mean Absolute Error(MAE) and Cumulative Accuracy (CA).
MAE is widely used to evaluate the performance of age estimation models,which is defined as the average distance between the inferred age and the ground-truth:
(17) |
The Cumulative Accuracy (CA) is defined as:
(18) |
where is the number of the test images whose absolute estimated error is smaller than .
6.4 Additional Comparison Results on FG-NET

Additional comparison results with the conditional adversarial autoencoder (CAAE) [38] on FG-NET are shown in Figure 14. Note that our model is trained on UTKFace. We observe that CAAE can only translate the input into specific age groups, including 0-5, 6-10, 11-15, 16-20, 21-30, 31-40, 41-50, 51-60, 61-70, and 71-80, while UVA can perform age translation with the arbitrary age. In addition, CAAE only focuses on the face appearances without hair, while UVA achieves age translation on the whole face. As shown in Fig. 14, the hairline gradually becomes higher as age increases.
6.5 Ablation Study

In this section, we conduct ablation study on Morph to evaluate the effectiveness of the introspective adversarial loss, the age preserving loss and the age regularization term, respectively. We report the qualitative visualization results for a comprehensive comparison. Fig. 15 shows visual comparisons between UVA and its three incomplete variants. We observe that the proposed UVA is visually better than its variants across all ages. Without the or loss, the aging images lack the detailed texture information. Besides, without the loss, the aging faces are obviously blur. The visualization results demonstrate that all three components are essential for UVA.
6.6 Age Generation
Benefiting from the disentangled age-related and age-irrelevant distributions in the latent space, the proposed UVA is able to realize age generation. On the one hand, given a facial image, when fixing age-related distribution and sampling from noise, UVA can generate various faces with the specific age. Fig. 18 shows the age generation results on Morph, CACD2000, MegaAge-Asian and UTKFace. On the other hand, by manipulating of the age-related distribution and sampling from noise, UVA can generate faces with various ages. Fig. 16 shows the age generation results from 10 to 70 on MegaAge-Asian.

6.7 Age Translation


Age generation results with age feature extracted from the given image on Morph, CACD2000, MegaAge-Asian and UTKFace. For each subject, the first column is the input face, while the rest three columns are the generated faces.
As shown in Fig. 17, we observe that UTKFace exhibits a long-tailed age distribution. The age translation results on UTKFace are presented in Fig. 19, 20 and 21. The ages range from 0 to 119 years old with 1-year-old aging interval. Since the images in UTKFace range from 0 to 116 years old, it is amazing that UVA can synthesize photo-realistic aging images even with 117,118 and 119 years old. These results suggest that UVA can effeciently and accurately estimate the age-related distribution from the facial images, even if the training dataset performs a long-tailed age distribution. Considering that generating high-resolution images is significant to enlarge the application field of face aging, in the future, we will build a new facial age dataset to support the age-related study.



Comments
There are no comments yet.