UVA: A Universal Variational Framework for Continuous Age Analysis

03/30/2019
by   Peipei Li, et al.
0

Conventional methods for facial age analysis tend to utilize accurate age labels in a supervised way. However, existing age datasets lies in a limited range of ages, leading to a long-tailed distribution. To alleviate the problem, this paper proposes a Universal Variational Aging (UVA) framework to formulate facial age priors in a disentangling manner. Benefiting from the variational evidence lower bound, the facial images are encoded and disentangled into an age-irrelevant distribution and an age-related distribution in the latent space. A conditional introspective adversarial learning mechanism is introduced to boost the image quality. In this way, when manipulating the age-related distribution, UVA can achieve age translation with arbitrary ages. Further, by sampling noise from the age-irrelevant distribution, we can generate photorealistic facial images with a specific age. Moreover, given an input face image, the mean value of age-related distribution can be treated as an age estimator. These indicate that UVA can efficiently and accurately estimate the age-related distribution by a disentangling manner, even if the training dataset performs a long-tailed age distribution. UVA is the first attempt to achieve facial age analysis tasks, including age translation, age generation and age estimation, in a universal framework. The qualitative and quantitative experiments demonstrate the superiority of UVA on five popular datasets, including CACD2000, Morph, UTKFace, MegaAge-Asian and FG-NET.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

page 11

page 12

page 14

page 15

page 16

10/25/2021

LAE : Long-tailed Age Estimation

Facial age estimation is an important yet very challenging problem in co...
07/27/2018

Fusion Network for Face-based Age Estimation

Convolutional Neural Networks (CNN) have been applied to age-related res...
03/31/2022

Adaptive Mean-Residue Loss for Robust Facial Age Estimation

Automated facial age estimation has diverse real-world applications in m...
12/02/2020

Assessing the Influencing Factors on the Accuracy of Underage Facial Age Estimation

Swift response to the detection of endangered minors is an ongoing conce...
03/03/2021

PML: Progressive Margin Loss for Long-tailed Age Classification

In this paper, we propose a progressive margin loss (PML) approach for u...
09/15/2021

A Unified Framework for Biphasic Facial Age Translation with Noisy-Semantic Guided Generative Adversarial Networks

Biphasic facial age translation aims at predicting the appearance of the...
09/20/2018

A Coupled Evolutionary Network for Age Estimation

Age estimation of unknown persons is a challenging pattern analysis task...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Facial age analysis, including age translation, age generation and age estimation, is one of crucial components in modern face analysis for entertainment and forensics. Age translation (also known as face aging) aims to aesthetically render the facial appearance to a given age. In recent years, with the developments of generative adversarial network (GAN) [7]

and image-to-image translation

[12, 39], impressive progresses [38, 32, 29, 18, 17]

have been achieved on age translation. These methods often utilize a target age vector as a condition to dominate the facial appearance translation. Fig. 

2 summarizes the commonly used frameworks for age translation. As shown in Fig. 2 (a), IPC-GAN [29] directly incorporates the target age with inputs to synthesize facial aging images, while CAAE [38] (shown in Fig. 2 (b)) guides age translation by concatenating the age label with the facial image representations in the latent space. It is obvious that the performance of face aging depends on the accurate age labels. Although previous methods [32, 18, 17] have achieved remarkable visual results, in practice, it is difficult to collect labeled images of continuous ages for the intensive aging progression. Since all the existing datasets perform a long-tailed age distribution, researchers often employ the time span of 10 years as the age clusters for age translation. This age cluster formulation potentially limits the diversity of aging patterns, especially for the younger.

Figure 1: Continuous age translation results on UTKFace. The first column are the inputs and the rest columns are the synthesized faces from 0 to 100 years old. Note that there are only 8 images at age 100 in UTKFace.

Recently, the variational autoencoder (VAE) 

[15] shows the promising ability in discovering interpretability of the latent factors [3, 11, 31]. By augmenting VAE with a hyper-parameter , -VAE [19] controls the degree of disentanglement in the latent space. Benefiting from the disentangling abilities in VAE, this paper proposes a novel Universal Variational Aging (UVA) framework to formulate facial age priors in a disentangling manner. Compared with the existing methods, UVA is more capable of disentangling the facial images into an age-related distribution and an age-irrelevant distribution in the latent space, rather than directly utilizing age labels as conditions for age translation. To be specific, the proposed method introduces two latent variables to model the age-related and age-irrelevant information, and employs the variational evidence lower bound (ELBO) to encourage these two parts to be disentangled in the latent space. As shown in Fig. 2

(c), the age-related distribution is assumed as a Gaussian distribution, where the mean value

is the real age of the input image, while the age-irrelevant prior is set to a normal distribution

. The disentangling manner makes UVA perform more flexible and controllable for facial age analysis. Additionally, to synthesize photorealistic facial images, an extended conditional version of introspective adversarial learning mechanism [11] is introduced to UVA, which self-estimates the differences between the real and generated samples.

Figure 2: Compared with the previous conditional generative models [29, 38], UVA learns disentangled age-related and age-irrelevant distributions, which is more suitable for the long-tailed data.

In this way, by manipulating the mean value of age-related distribution, UVA can easily realize facial age translation with arbitrary age (shown in Fig. 1), whether the age label exists or not in the training dataset. We furter observe an interesting phenomenon that when sampling noise from the age-irrelevant distribution, UVA can generate photorealistic face images with a specific age. Moreover, given a face image as input, we can easily obtain its age label from the mean value of the age-related distribution, which indicates the ability of UVA to achieve age estimation. As stated above, we can implement three different tasks for facial age analysis in a unified framework. To the best of our knowledge, UVA is the first attempt to achieve facial age analysis, including age translation, age generation and age estimation, in a universal framework. The main contributions of UVA are as follows:

  • We propose a novel Universal Variational Aging (UVA) framework to formulate the continuous facial aging mechanism in a disentangling manner. It leads to a universal framework for facial age analysis, including age translation, age generation and age estimation.

  • Benefiting from the variational evidence lower bound, in UVA, the facial images are encoded and disentangled into an age-related distribution and an age-irrelevant distribution in the latent space. An extended conditional introspective adversarial learning mechanism is introduced to obtain photorealistic facial image synthesis.

  • Different from the existing conditional age translation methods, which utilize the age labels/clusters as a condition, UVA tries to estimate an age distribution from the long-tailed facial age dataset. This age distribution estimation provides a new condition means to model continuous ages that contributes to the interpretability of the image synthesis for facial age analysis.

  • The qualitative and quantitative experiments demonstrate that UVA successfully formulates the facial age prior in a disentangling manner, obtaining state-of-the-art resuts on the CACD2000 [4], Morph [23], UTKFace [36], MegaAge-Asian [37] and FG-NET [16] datasets.

2 Related Work

2.1 Variational Autoencoder

Variational Autoencoder (VAE) [15, 22] consists of two networks: an inference network maps the data to the latent variable , which is assumed as a gaussian distribution, and a generative network reversely maps the latent variable to the visible data . The object of VAE is to maximize the variational lower bound (or evidence lower bound, ELBO) of :

(1)
Figure 3: Overview of the architecture and training flow of our approach. Our model contains two components, the inference network and the generative network . , and denote the real sample, the reconstruction sample and the new sample, respectively. Please refer to Section 3 for more details.

VAE has shown promising ability to generate complicated data, including faces [11], natural images [8], text [25] and segmentation [10, 27]. Following IntroVAE [11], we adopt introspective adversarial learning in our method to produce high-quality and realistic face images.

2.2 Age Translation and Generation

Recently, deep conditional generative models have shown considerable ability in age translation [38, 29, 18, 17]. Zhang et al. [38] propose a Conditional Adversarial Autoencoder(CAAE) to transform an input facial image to the target age. To capture the rich textures in the local facial parts, Li et al. [18] propose a Global and Local Consistent Age Generative Adversarial Network (GLCA-GAN) method. Meanwhile, Identity-Preserved Conditional Generative Adversarial Networks (IPCGAN) [29] introduces an identity-preserved term and an age classification term into age translation. Although these methods have achieved promising visual results, they have limitations in modeling continuous aging mechanism.

With the development of deep generative models, i.e., Generative Adversarial Network(GAN)[7], Variational Autoencoder(VAE), face generation has achieved dramatically success in recent years. Karras et al. [13] provide a new progressive training way for GAN, which grows both the generator and discriminator progressively. Huang et al. [11] propose the Introspective Variational Autoencoders(IntroVAE), in which the inference model and the generator are jointly trained in an introspective way.

3 Approach

In this section, we propose a universal variational aging framework for age translation, age generation and age estimation. The key idea is to employ an inference network to embed the facial image into two disentangled variational representations, where one is age-related and another is age-irrelevant, and a generator network to produce photo-realistic images from the re-sampled latent representations. As depicted in Fig. 3, we assign two different priors to regularize the inferred representations, and train and with age preserving regularization in an introspective adversarial manner. The details are given in the following.

3.1 Disentangled Variational Representations

In the original VAE [15], a probabilistic latent variable model is learnt by maximizing the variational lower bound to the marginal likelihood of the observable variables. However, the latent variable is difficult to interpret and control, for each element of is treated equally in the training. To alleviate this, we manually split into two parts, i.e., one part representing the age-related information and another part representing the age-irrelevant information.

Assume and are independent on each other, then the posterior distribution . The prior distribution , where and are the prior distributions for and , respectively. According to Eq. (1), the optimization objective for the modified VAE is to maximize the lower bound(or evidence lower bound, ELBO) of :

(2)

To make and correspond with different types of facial information, and are set to be different distributions. They are both centered isotropic multivariate Gaussian but and , where y is a vector filled by the age label of . Through using the two age-related and age-irrelevant priors, maximizing the above variational lower bound encourages the posterior distribution and to be disentangled and to model the age-related and age-irrelevant information, respectively.

Following the original VAE [15], we assume the posterior and follows two centered isotropic multivariate Gaussian, respectively, i.e., , and . As depicted in Fig. 3, , , and are the output vectors of the inference network . The input of the generator network is the concatenation of and , where and are sampled from and using a reparameterization trick, respectively, i.e., , , where , .

For description convenience, the optimization object in Eq. (2) is rewritten in the negative version:

(3)

where , and denote the three terms in Eq. (2), respectively. They can be computed as below:

(4)
(5)
(6)

where is the reconstruction image of the input , is the age label of , denotes the dimension of and .

3.2 Introspective Adversarial Learning

To alleviate the problem of generating blurry samples in VAEs, the introspective adversarial learning mechanism [11] is introduced to the proposed UVA. This makes the model able to self-estimate the differences between the real and generated samples without extra adversarial discriminators. The inference network is encouraged to distinguish between the real and generated samples while the generator network tries to fool it as GANs.

Different from IntroVAE [11], the proposed method employs a part of the posterior distribution, rather than the whole distribution, to serve as the estimator of the image reality. Specifically, the age-irrelevant posterior is selected to help the adversarial learning. When training , the model minimizes the KL-distance of the posterior from its prior for the real data and maximize it for the generated samples. When training , the model minimizes this KL-distance for the generated samples.

Similar to IntroVAE [11], the proposed UVA is trained to discriminate the real data from both the model reconstructions and samples. As shown in Fig. 3, these two types of samples are the reconstruction sample and the new samples , where , and are sampled from , and , respectively.

The adversarial training objects for and are defined as below:

(7)
(8)

where is a positive margin, is a weighting coefficient, , and are computed from the real data , the reconstruction sample and the new samples , respectively.

The proposed UVA can be viewed as a conditional version of IntroVAE [11]. The disentangled variational representation and the modified introspective adversarial learning makes it superior to IntroVAE in the interpretability and controllability of image generation.

3.3 Age Preserving Regularization

Age accuracy is important to facial age analysis. The most current face aging methods[38, 29, 18, 17, 32] usually utilize an additional pre-trained age estimation network [30] to supervise the generated face images. While in our method, facial age can be estimated using the inference network by computing the mean value of the inferred vector . To better capture and disentangle age-related information, an age regularization term is utilized on the learned representation :

(9)

where denotes the dimension of , is the output vector of the input , is the age label of .

For further supervising the generator to reconstruct the age-related information, a similar age regularization term is also employed on the reconstructed and generated samples in Fig. 3, i.e., and . The age preserving loss is reformulated as:

(10)

where and are computed by from and , respectively.

3.4 Training UVA networks

As shown in Fig. 3, the inference network embeds the facial image into two disentangled distributions, i.e., and , where and are the output vectors of . The generator network produces the reconstruction and the sample , where , and are sampled from , and , respectively. To learn the disentangled variational representations and produce high-fidelity facial images, the network is trained using a weighted sum of the above losses, defined as:

(11)
(12)

where , , , , are the trade-off parameters to balance the important of losses. Noted that the third term in the ELBO loss in Eq. (3) has been contained in the adversarial loss for , i.e., the first term of in Eq. (7).

3.5 Inference and Sampling

By regularizing the disentangled representations with the age-related prior and age-irrelevant prior , our UVA is thus a universal framework for age translation, generation and estimation tasks, as illustrated in Fig. 2 (c).

Age Translation To achieve the age translation task, we concatenate the age-irrelevant variable and a target age variable as the input of the generator ,where is sampled from the posterior distribution over the input face and is sampled from . The age translation result is written as:

(13)

Age Generation For the age generation task, there are two settings. One is to generate image from noise with any ages . To be specific, we concatenate a target age variable and an age-irrelevant variable as the input of , where and are sampled from and , respectively. Then the age generation result is:

(14)

The other is to generate image from noise , which shares the same age-related variable with the input. Specifically, we concatenate the age-related variable and an age-irrelevant variable as the input of the , where is sampled from the posterior distribution over the input face and is sampled from . The age generation result is formulated as:

(15)

Age Estimation In this paper, age estimation is also conducted by the proposed UVA to verify the age-related variable extracting and disentangling performance. We calculate the mean value of -dimension vector as the age estimation result, defined as:

(16)

where is one of the output vectors of the inference network .

4 Experiments

4.1 Datasets and Settings

4.1.1 Datasets

Cross-Age Celebrity Dataset (CACD2000) [4] consists of 163,446 color facial images of 2,000 celebrities, where the ages range from 14 to 62 years old. However, there are many dirty data in it, which leads to a challenging model training. We choose the classical 80-20 split on CACD2000.

Morph [23] is the largest public available dataset collected in the constrained environment. It contains 55,349 color facial images of 13,672 subjects with ages ranging from 16 to 77 years old. Conventional 80-20 split is conducted on Morph.

UTKFace [36] is a large-scale facial age dataset with a long age span, which ranges from 0 to 116 years old. It contains over 20,000 facial images in the wild with large variations in expression, occlusion, resolution and pose. We employ classical 80-20 split on UTKFace.

MegaAge-Asian [37] is a newly released facial age dataset, consisting of 40,000 Asian faces with ages ranging from 0 to 70 years old. There are extremely variations, such as distortion, large-area occlusion and heavy makeup in MegaAge-Asian. Following [37], we reserve 3,945 images for testing and the remains are treated as training set.

FG-NET [16] contains 1,002 facial images of 82 subjects. We employ it as the testing set to evaluate the generalization of UVA.

4.1.2 Experimental Settings

Following [17], we employ the multi-task cascaded CNN [35] to detect the faces. All the facial images are cropped and aligned into 224 224, while the most existing methods [29, 38, 18] only achieve age translation with 128

128. Our model is implemented with Pytorch. During training, we choose Adam optimizer

[14] with of 0.9, of 0.99, a fixed learning rate of and batch size of 28. The trade-off parameters , , , , are all set to 1. Besides, is set to 1000 and is set to 0.5. More details of the network architectures and training processes are provided in the supplementary materials.

4.2 Qualitative Evaluation of UVA

Figure 4: Age translation results on CACD2000. The first column is the input and the rest five columns are the synthesized results.

By manipulating the mean value and sampling from age-related distribution, the proposed UVA can translate arbitrary age based on the input. Fig. 4 and 5 present the age translation results on CACD2000 and Morph, respectively. We observe that the synthesized faces are getting older and older with ages growing. Specifically, the face contours become longer, the forehead hairline are higher and the nasolabial folds are deepened.

Figure 5: Age translation results on Morph. The first column is the input and the rest five columns are the synthesized results.

4.2.1 Age Translation

Figure 6: Age translation results on MegaAge-Asian. The first column is the input and the rest are the synthesized results.

Since both CACD2000 and Morph lack of images of children, we conduct age translation on MegaAge-Asian and UTKFace. Fig. 6 shows the translation results on MegaAge-Asian from 10 to 70 years old. Fig. 7 describes the results on UTKFace from 0 to 115 years old with an interval of 5 years old. Obviously, from birth to adulthood, the aging effect is mainly shown on craniofacial growth, while the aging effect from adulthood to elder is reflected on the skin aging, which is consistent with human physiology.

Figure 7: Age translation results on UTKFace from 0 to 115 years old.
Figure 8: Cross-dataset age translation results on FG-NET.
Figure 9: Comparison with the previous works on FG-NET.

Furthermore, superior to the most existing methods [38, 29, 18, 17] that can only generate images at specific age groups, our UVA is able to realize continuous age translation with 1-year-old interval. The translation results on UTKFace from 0 to 119 years old with 1-year-old interval is presented in the supplementary materials, due to the page limitation.

To evaluate the model generalization, we test UVA on FG-NET and show the translation results in Fig. 8. The left image of each subject is the input and the right one is the translated image with 60 years old. Note that our UVA is trained only on CACD2000 and tested on FG-NET.

Different from the previous age translation works that divide the data into four or nine age groups, our proposed UVA models continuous aging mechanism. The comparison with prior works, including AgingBooth App[1], CDL[26], RFA[28] and Yang et al.[33], is depicted in Fig. 9.

4.2.2 Age Generation

Figure 10: Age generation results on Morph and CACD2000. For each image group, the first column is the input and the rest three columns are the generated faces with the same age-related representation as the input face.

Benefiting from the disentangled age-related and age-irrelevant distributions in the latent space, the proposed UVA is able to realize age generation. On the one hand, conditioned on a given facial image, UVA can generate new faces with the same age-related representation as the given image by fixing and sampling . Fig. 10 presents some results under this situation on the Morph and CACD2000 datasets. Our UVA generates diverse faces (such as different genders, appearances and haircuts) with the same age as the input facial image.

Figure 11: Age generation results on CACD2000. In the left, each row has the same age-irrelevant representation and each column has the same age-related representation. The right shows the most similar image from google for the generated image in the left. (All of the four subjects are generated from noise.)
Figure 12: Age generation results on Morph. In the left, each row has the same age-irrelevant representation and each column has the same age-related representation. The right shows the most similar image from google for the generated image in the left.(All of the four subjects are generated from noise.)

On the other hand, by manipulating an arbitrary of age-related distribution and sampling both and , UVA has the ability to generate faces with various ages. Fig. 11 and 12 display the age generation results from 20 to 60 on the CACD2000 and Morph datasets, respectively. Each row has the same age-irrelevant representation and each column has the same age-related representation . We can observe that the age-irrelevant information, such as gender and identity, is preserved across rows, and the age-related information, such as white hairs and wrinkles, is preserved across columns. These demonstrate that our UVA effectively disentangles age-related and age-irrelevant representations in the latent space. More results on the UTKFace and MegaAge-Asian datasets are presented in the supplementary materials, due to the page limitation.

4.2.3 Generalized Abilities of UVA

Benefiting from the disentangling manner, UVA has the abilities to estimate the age-related distribution from the facial images, even if the training dataset is long-tailed. Since the facial images in Morph ages from 16 to 77, when performing age translation with 10 years old, as shown in Fig. 13, it is amazing that UVA can synthesize photorealistic aging images. This observation demonstrates the generalized abilities of our framework, and also indicates that UVA can efficiently and accurately estimate the age-related distribution from the facial images, even if the training dataset performs a long-tailed age distribution.

Figure 13: Age translation results on Morph. Note that the target age (10 years old) is not existing in the Morph dataset.

4.3 Quantitative Evaluation of UVA

4.3.1 Age Estimation

To further demonstrate the disentangling ability of UVA, we conduct age estimation experiments on Morph and MegaAge-Asian, both of which are widely used datasets in age estimation task. We detail the evaluation metrics in the supplementary materials.

For experiment on Morph, following [21], we choose the 80-20 random split protocol and report the mean absolute error(MAE). The results are listed in Table 1, where our model has not been pre-trained on the external data. We can see from Table 1 that the proposed UVA achieves comparative performance with the stat-of-the-art methods.

Methods Pre-trained Morph
MAE
OR-CNN[20] - 3.34
DEX[24] IMDB-WIKI 2.68
Ranking [5] Audience 2.96
Posterior[37] - 2.87
SSR-Net[34] IMDB-WIKI 2.52
M-V Loss[21] - 2.514
ThinAgeNet [6] MS-Celeb-1M 2.346
UVA - 2.52
  • Used partial data of the dataset;

Table 1: Comparisons with state-of-the-art methods on Morph. Lower MAE is better.

For age estimation on MageAge-Asian, following [37], we reserve 3,945 images for testing and employ the Cumulative Accuracy (CA) as the evaluation metric. We report CA(3), CA(5), CA(7) as [37, 34] in Table 2. Note that our model has not pre-trained on the external data and achieves comparative performance.

Methods Pre-trained MegaAge-Asian
CA(3) CA(5) CA(7)
Posterior[37] IMDB-WIKI 62.08 80.43 90.42
MobileNet[34] IMDB-WIKI 44.0 60.6 -
DenseNet[34] IMDB-WIKI 51.7 69.4 -
SSR-Net[34] IMDB-WIKI 54.9 74.1 -
UVA - 60.47 79.95 90.44
Table 2: Comparisons with state-of-the-art methods on MegaAge-Asian. The unit of CA(n) is . Higher CA(n) is better.

The age estimation results on Morph and MegaAge-Asian are nearly as good as the state-of-the-arts, which also demonstrates that the age-related representation is well disentangled from the age-irrelevant representation.

4.3.2 Aging Accuracy

We conduct aging accuracy experiments of UVA on Morph and CACD2000, which is an essential quantitative metric to evaluate age translation. For fair comparison, following [33, 18], we utilize the online face analysis tool of Face++ [2] to evaluate the ages of the synthetic results, and divide the testing data of Morph and CACD2000 into four age groups: 30-(AG0), 31-40(AG1), 41-50(AG2), 51+(AG3). We choose AG0 as the input and synthesize images in AG1, AG2 and AG3. Then we estimate the ages of the synthesized images and calculate the average ages for each group.

(a) on Morph
Method Input AG1 AG2 AG3
CAAE[38] - 28.13 32.50 36.83
Yang et al.[33] - 42.84 50.78 59.91
UVA - 36.59 50.14 60.78
Real Data 28.19 38.89 48.10 58.22
(b) on CACD2000
Method Input AG1 AG2 AG3
CAAE[38] - 31.32 34.94 36.91
Yang et al.[33] - 44.29 48.34 52.02
UVA - 39.19 45.73 51.32
Real Data 30.73 39.08 47.06 53.68
Table 3: Comparisons of the aging accuracy.

As shown in Table 3, we compare the UVA with CAAE [38] and Yang et al. [33] on Morph and CACD2000. We observe that the transformed ages by UVA are closer to the natural data than by CAAE [38] and is comparable to Yang et al. [33]. Note that, for age translation among multiple ages, the proposed UVA only needs to train one model, but Yang et al. [33] need to train multiple models for each transformation, such as InputAG1, InputAG2 and InputAG3.

4.3.3 Diversity

In this subsection, we utilize the Inception Distance (FID) [9] to evaluate the quality and diversity of the generated data. FID measures the distance between the generated and the real distributions in the feature space. Here, the testing images are generated from noise with target ages , where and are sampled from and , respectively. The FID results are detailed in Table 4.

Morph CACD2000 UTKFace MegaAge-Asian
FID 17.154 37.08 31.18 32.79
Table 4: The FID results on the four datasets.

4.4 Ablation Study

We conduct ablation study on Morph to evaluate the effectiveness of introspective adversarial learning and age preserving regularization. We introduce the details in the supplemental materials, due to the page limitation.

5 Conclusion

This paper has proposed a Universal Variational Aging framework for continuous age analysis. Benefiting from variational evidence lower bound, the facial images are disentangled into an age-related distribution and an age-irrelevant distribution in the latent space. A conditional introspective adversarial learning mechanism is introduced to improve the quality of facial aging synthesis. In this way, we can deal with long-tailed data and implement three different tasks, including age translation, age generation and age estimation. To the best of our knowledge, UVA is the first attempt to achieve facial age analysis in a universal framework. The qualitative and quantitative experiments demonstrate the superiority of the proposed UVA on five popular datasets, including CACD2000, Morph, UTKFace, MegaAge-Asian and FG-NET. This indicates that UVA can efficiently formulate the facial age prior, which contributes to both photorealistic and interpretable image synthesis for facial age analysis.

References

  • [1] Agingbooth. pivi and co. https://itunes.apple.com/us/app/ agingbooth/id357467791?mt=8/, [Online].
  • [2] Face++ research toolkit. megvii inc. http://www. faceplusplus.com, [Online].
  • [3] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in beta-vae. arXiv preprint arXiv:1804.03599, 2018.
  • [4] B.-C. Chen, C.-S. Chen, and W. H. Hsu. Face recognition and retrieval using cross-age reference coding with cross-age celebrity dataset. IEEE Transactions on Multimedia, 17(6):804–815, 2015.
  • [5] S. Chen, C. Zhang, M. Dong, J. Le, and M. Rao. Using ranking-cnn for age estimation. In CVPR, 2017.
  • [6] B.-B. Gao, H.-Y. Zhou, J. Wu, and X. Geng. Age estimation using expectation of label distribution learning. In IJCAI, 2018.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
  • [8] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
  • [9] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  • [10] X. Hou, L. Shen, K. Sun, and G. Qiu. Deep feature consistent variational autoencoder. In WACV. IEEE, 2017.
  • [11] H. Huang, R. He, Z. Sun, T. Tan, et al. Introvae: Introspective variational autoencoders for photographic image synthesis. In NeurIPS, 2018.
  • [12] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    CVPR, 2017.
  • [13] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • [14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [15] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • [16] A. Lanitis, C. J. Taylor, and T. F. Cootes. Toward automatic simulation of aging effects on face images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):442–455, 2002.
  • [17] P. Li, Y. Hu, R. He, and Z. Sun. Global and local consistent wavelet-domain age synthesis. IEEE Transactions on Information Forensics and Security, 2018.
  • [18] P. Li, Y. Hu, Q. Li, R. He, and Z. Sun. Global and local consistent age generative adversarial networks. ICPR, 2018.
  • [19] L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. -VAE: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
  • [20] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua. Ordinal regression with multiple output cnn for age estimation. In CVPR, 2016.
  • [21] H. Pan, H. Han, S. Shan, and X. Chen.

    Mean-variance loss for deep age estimation from a face.

    In CVPR, 2018.
  • [22] D. J. Rezende, S. Mohamed, and D. Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    ICML, 2014.
  • [23] K. Ricanek and T. Tesafaye. Morph: A longitudinal image database of normal adult age-progression. In FGR. IEEE, 2006.
  • [24] R. Rothe, R. Timofte, and L. Van Gool. Deep expectation of real and apparent age from a single image without facial landmarks.

    International Journal of Computer Vision

    , 126(2-4):144–157, 2018.
  • [25] S. Semeniuta, A. Severyn, and E. Barth. A hybrid convolutional variational autoencoder for text generation. arXiv preprint arXiv:1702.02390, 2017.
  • [26] X. Shu, J. Tang, H. Lai, L. Liu, and S. Yan. Personalized age progression with aging dictionary. In ICCV, Dec. 2015.
  • [27] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NeurIPS, 2015.
  • [28] W. Wang, Z. Cui, Y. Yan, J. Feng, S. Yan, X. Shu, and N. Sebe. Recurrent face aging. In CVPR, Jun. 2016.
  • [29] Z. Wang, X. Tang, W. Luo, and S. Gao. Face aging with identity-preserved conditional generative adversarial networks. In CVPR, 2018.
  • [30] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
  • [31] X. Wu, H. Huang, V. M. Patel, R. He, and Z. Sun. Disentangled variational representation for heterogeneous face recognition. In AAAI, 2019.
  • [32] H. Yang, D. Huang, Y. Wang, and A. K. Jain. Learning face age progression: A pyramid architecture of gans. In CVPR, 2018.
  • [33] H. Yang, D. Huang, Y. Wang, and A. K. Jain. Learning face age progression: A pyramid architecture of gans. CVPR, 2018.
  • [34] T.-Y. Yang, Y.-H. Huang, Y.-Y. Lin, P.-C. Hsiu, and Y.-Y. Chuang. Ssr-net: A compact soft stagewise regression network for age estimation. In IJCAI, 2018.
  • [35] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
  • [36] S. Y. Zhang, Zhifei and H. Qi. Age progression/regression by conditional adversarial autoencoder. In CVPR. IEEE, 2017.
  • [37] Y. Zhang, L. Liu, C. Li, et al. Quantifying facial age by posterior of age comparisons. arXiv preprint arXiv:1708.09687, 2017.
  • [38] Z. Zhang, Y. Song, and H. Qi. Age progression/regression by conditional adversarial autoencoder. In CVPR, 2017.
  • [39] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV, 2017.

6 Supplementary Materials

In this supplementary material, we first introduce the network architecture and the training algorithm of our proposed UVA. Then we depict the metrics used for age estimation experiments. Additional comparisons are conducted to demonstrate the continuous manner of UVA in Section 4. Besides, Section 5 presents the ablation study, followed by additional results of age generation and translation.

6.1 Network Architecture

The network architectures of the proposed UVA are shown in Table 5. All images are aligned to 256256 as the inputs.

Layer Input

Filter/Stride

Output Size
Encoder
Econv1
Eavgpool1 Econv1
Eres1 Eavgpool1
Eavgpool2 Eres1
Eres2 Eavgpool2
Eavgpool3 Eres2
Eres3 Eavgpool3
Eavgpool4 Eres3
Eres4 Eavgpool4
Eavgpool5 Eres4
Eres5 Eavgpool5
Eavgpool6 Eres5
Eres6 Eavgpool6
Eflatten, fc Eres6 - 1024
Esplit Efc - 256,256,256,256
Erepara1 Esplit[0,1] - 256
Erepara2 Esplit[2,3] - 256
Ecat Erepara1, Erepara2, - 512
Generator
Gfc,reshape Ecat -
Gres1 Gfc
Gupsample1 Gres1 , nearest
Gres2 Gupsample1
Gupsample2 Gres2 , nearest
Gres3 Gupsample2
Gupsample3 Lres3 , nearest
Gres4 Gupsample3
Gupsample4 Gres4 , nearest
Gres5 Gupsample4
Gupsample5 Lres5 , nearest
Gres6 Gupsample5
Gupsample6 Gres7 , nearest
Gres7 Gupsample6
Gconv1 Gres7
Table 5: Network architectures of UVA.

6.2 Training Algorithm

The training process of UVA is described in Algorithm 1.

1.3

Algorithm 1 Training UVA Model
1: Initialize  network  parameters
2:while not converged do
3:        Random mini-batch from training set
4:       
5:        ,
6:        Samples from prior
7:        ,
8:        ,
9:        ,
10:       
11:       
12:       
13:       
14:       
15:       
16:       
17:                                Perform Adam updates for
18:                                Perform Adam updates for
19:end while

6.3 Evaluation Metrics of Age Estimation

We evaluate the performance of UVA on age estimation task with the Mean Absolute Error(MAE) and Cumulative Accuracy (CA).

MAE is widely used to evaluate the performance of age estimation models,which is defined as the average distance between the inferred age and the ground-truth:

(17)

The Cumulative Accuracy (CA) is defined as:

(18)

where is the number of the test images whose absolute estimated error is smaller than .

6.4 Additional Comparison Results on FG-NET

Figure 14: Comparison with CAAE on FG-NET.

Additional comparison results with the conditional adversarial autoencoder (CAAE) [38] on FG-NET are shown in Figure 14. Note that our model is trained on UTKFace. We observe that CAAE can only translate the input into specific age groups, including 0-5, 6-10, 11-15, 16-20, 21-30, 31-40, 41-50, 51-60, 61-70, and 71-80, while UVA can perform age translation with the arbitrary age. In addition, CAAE only focuses on the face appearances without hair, while UVA achieves age translation on the whole face. As shown in Fig. 14, the hairline gradually becomes higher as age increases.

6.5 Ablation Study

Figure 15: Model comparison: age translation results of UVA and its variants. For each subject, the first row is the synthesis results of UVA, while the second, third and fourth rows are the synthesis results without introspective adversarial loss, age preserving loss and age regularization term, respectively.

In this section, we conduct ablation study on Morph to evaluate the effectiveness of the introspective adversarial loss, the age preserving loss and the age regularization term, respectively. We report the qualitative visualization results for a comprehensive comparison. Fig. 15 shows visual comparisons between UVA and its three incomplete variants. We observe that the proposed UVA is visually better than its variants across all ages. Without the or loss, the aging images lack the detailed texture information. Besides, without the loss, the aging faces are obviously blur. The visualization results demonstrate that all three components are essential for UVA.

6.6 Age Generation

Benefiting from the disentangled age-related and age-irrelevant distributions in the latent space, the proposed UVA is able to realize age generation. On the one hand, given a facial image, when fixing age-related distribution and sampling from noise, UVA can generate various faces with the specific age. Fig. 18 shows the age generation results on Morph, CACD2000, MegaAge-Asian and UTKFace. On the other hand, by manipulating of the age-related distribution and sampling from noise, UVA can generate faces with various ages. Fig. 16 shows the age generation results from 10 to 70 on MegaAge-Asian.

Figure 16: Age generation results on MegaAge-Asian. Each row has the same age-irrelevant representation and each column has the same age-related representation. The images in MegaAge-Asian are low-resolution and limited-quality, leading to blurred synthesized results.

6.7 Age Translation

Figure 17: Age distribution in UTKFace.
Figure 18:

Age generation results with age feature extracted from the given image on Morph, CACD2000, MegaAge-Asian and UTKFace. For each subject, the first column is the input face, while the rest three columns are the generated faces.

As shown in Fig. 17, we observe that UTKFace exhibits a long-tailed age distribution. The age translation results on UTKFace are presented in Fig. 19, 20 and 21. The ages range from 0 to 119 years old with 1-year-old aging interval. Since the images in UTKFace range from 0 to 116 years old, it is amazing that UVA can synthesize photo-realistic aging images even with 117,118 and 119 years old. These results suggest that UVA can effeciently and accurately estimate the age-related distribution from the facial images, even if the training dataset performs a long-tailed age distribution. Considering that generating high-resolution images is significant to enlarge the application field of face aging, in the future, we will build a new facial age dataset to support the age-related study.

Figure 19: Age translation results on UTKFace from 0 to 119 years old with 1-year-old interval. The image in the red box is the input (18 years old). The other images are the age translation results from 0 to 119 years old.
Figure 20: Age translation results on UTKFace from 0 to 119 years old with 1-year-old interval. The image in the red box is the input (44 years old). The other images are the age translation results from 0 to 119 years old.
Figure 21: Age translation results on UTKFace from 0 to 119 years old with 1-year-old interval. The image in the red box is the input (85 years old). The other images are the age translation results from 0 to 119 years old.