Face aging [fu2010age], also called face age progression, is to synthesize faces of a person under different ages. It is one of the key techniques for a variety of applications, including looking for the missing children, cross-age face analysis, movie entertainment and so on.
Considerable research effort [yang2018learning, shu2015personalized, wang2016recurrent, he2019s2gan, liu2019attribute, zhang2017age] has been devoted to generating realistic aged faces, in which deep generative adversarial network (GAN) has become one of the leading approaches in face age progression. For example, Zhang et al. [zhang2017age]
proposed conditional adversarial autoencoder framework (CAAE) for age progression, and Wanget al. [wang2018face] proposed identity-preserved conditional generative adversarial network, in which the perceptual loss is introduced to keep identity information. Facial attributes are very important to synthesize the authentic and rational face images [lu2018attribute]. For the changeable attributes, they may change in different conditions. For example, working conditions, chronic medical conditions and lifestyle habits may affect individual aging processes. For the unchanged facial attributes, some researchers have found that the unpaired training data may lead to unnatural changes of facial attributes [liu2019attribute]. Motivated by that, we propose a controllable face aging to synthesize face images with the desired attributes (e.g., race and gender).
In parallel, style-based generative adversarial network [karras2019style, gatys2016image], which render the content image in the style of another image, has drawn much interest. The style-based generator can not only generate impressively high-quality images but also offer control over the style of the synthesized images [liu2019few, huang2017arbitrary]. However, the style-based GAN has two limitations. First, it transfers the style from one image to another, but not the common age pattern [wang2018face]. Second, the style images may contain other unwanted facial attributes, such as eyeglasses and smiling. If we use these style images, the style-based GAN may generate the face images contain other unwanted attributes (please refer to SubSection 4.3 for more details). Hence, it is still hard to control the style of the synthesized face image if we do not have the perfect style images that only have the desired attributes as inputs.
In this paper, we propose an attribute disentanglement GAN for controllable face aging. Our framework is based on the style GAN as shown in Figure 1. We first encode an image with the desired attribute into a latent and individual embedding. Instead of directly using that individual embedding, we learn an attribute disentanglement network, which disentangles and distils the knowledge of the individual embedding. This process aims to learn the common embedding that contains information about the age pattern of all images with facial attribute. Finally, we feed the learned common embedding to the decoder via the adaptive instance normalization (AdaIn) [huang2017arbitrary] blocks, which can control the aging process with the desired attribute in an explicit manner.
The main contributions of our work are three-fold. First, we propose an attribute disentanglement GAN for controllable face aging. Compared to the existing GAN-based face aging models, our method offers more control over the attributes of the generated face images at different levels. Second, we propose an attribute disentanglement method to obtain more reliable age pattern, which can remove the effect of other unwanted attributes. Finally, we qualitatively and quantitatively evaluate the usefulness of the proposed method.
2 Related Work
Face Aging Traditional face aging approaches [fu2010age, shu2015personalized] can be divided into two categories, physical model-based approaches and prototype-based methods. The physical model-based methods [suo2012concatenational] focus on modelling the aging mechanisms, e.g., skin’s anatomy structure and muscle change. The prototype-based methods [kemelmacher2014illumination] use the differences between the average faces of age groups as the aging pattern. Recently, generative adversarial network [goodfellow2014generative] based methods have also been widely studied for solving the aging progression problems. For example, Wang et al. [wang2016recurrent]
introduced a recurrent neural network for face aging. Antipovet al. [antipov2017face] proposed to apply conditional GAN for face aging, and [zhang2017age] is an auto-encoder conditional GAN. Song et al. [song2018dual] present a dual conditional GANs for simultaneously rendering a series of age-changed images of a person. In [yang2018learning], it is a pyramid architecture of GANs for aging accuracy and identity permanence. Although tremendous progress have been made by GAN-based methods, limited attention has paid for face aging with controllable attributes. In this paper, we propose a user-controllable approach for face aging.
Style Transfer Another similar work is style transfer [karras2019style, johnson2016perceptual, gatys2016image], which synthesizes image whose contents are from one input image and the style is from another artistic style image. Gatys et al. [gatys2016image]
firstly used the feed-forward neural network to obtain impressive style transfer results. Lately, Johnsonet al. [johnson2016perceptual]
proposed to use the perceptual loss functions for training a feed-forward network. A novel adaptive instance normalization (AdaIn) layer[huang2017arbitrary]
, which simply adjusts the mean and variance of the content input to match the style of another input, is able to do arbitrary style transfer in real-time. Inspired by that, Karras et al.[karras2019style] propose a style-based GAN to control the image synthesis process. However, the style transfer focus on transferring the style of one image to another image, while face aging requires to transfer the age pattern to input face [wang2018face]. Hence, it is hard to directly applied the style transfer to face aging. In this paper, we propose a style disentanglement module, which can remove the unwanted attribute and learn the common age pattern, making the available style transfer for face aging.
3 Proposed method
In this paper, we propose an attribute disentanglement GAN for controllable face aging. Similar to the style transfer, our method has two inputs: an input face image and desired attributes, i.e., ages with desired facial attributes. For ease of presentation, we only consider two most common attributes: race and gender. Please note that our method can be easily extended to other attributes for more control of aging process. Let to represent the label of the desired attributes, e.g., represents a white male in years old. We denote the face dataset as , where is the -th input image and is the corresponding label, is the number of training face images.
In testing, we use a generator and an attribute disentanglement to render the face image with aging effects in the target attributes as shown in the right side of Fig.1.
In training, we introduce an individual attribute encoder and a discriminator for training the and . Please note that is used to obtain the individual age pattern while learns the common age pattern. To train attribute encoder , we also introduce the style images as , which contain the desired attributes. The attribute encoder maps the style image into the individual embedding as . The discriminator is used to distinguish the generated face images from the real face images. We describe each in detail in the following.
Generator Fig. 1 show the architecture of the generator. Similar to other GAN-based methods for face aging, it firstly uses multiple down-sampling convolutional layers to learn the high-level feature maps, and then the feature maps go through multiple up-sampling convolutional layers to generate the output image. The main different is that we learn an affine transformation before the up-sampling layers to encode the attribute embedding (e.g., or ) into the generator via the AdaIN [huang2017arbitrary] operations. More precisely, suppose that represents feature maps before a up-sampling layer, where are the height, weight, channel, respectively. denotes as the -th channel. For each channel, the AdaIn operates as
are the mean and standard deviation,is the attribute embedding, e.g., is the output of the or . For ease presentation, we denote as , where is the synthesized face image, is an input image, encodes the attributes into the common embedding of age pattern.
Attribute Encoder and Attribute Disentanglement The attribute encode takes an individual style image as input and encode it into an individual embedding . Please Note that it only works in the learning stage and can be removed after finishing training procedure. The learns the age pattern directly from the attributes. We use to represent the attributes and the input of the , where and are the weight and heigh of the input images and is the number of attributes. For examples, suppose that there are age groups, genders and races, we have . We use one-hot code for the attributes, in which only one feature map is filled with one and others are filled with zero. To generate more diverse images, we also add a noise channel in , and finally obtain a code . The attribute disentanglement takes as input and obtains common embedding with the help of the .
Discriminators Following [liu2019few], we use multiple binary classifications (each binary classification for one attribute) instead of one multi-classification problem. Liu et al. [liu2019few] showed that multiple binary classifications perform better than a hard multi-class classification. We denote as the -th binary adversarial classification for the attribute , which distinguishes the generated face images from the real images.
In this paper, we propose a two-stage method to alternatively train the four modules. In the first stage, we learn and three modules, which learns to render the input face with the facial attributes of another face image. In the second stage, we further learn the attribute disentanglement , which disentangles the unwanted attributes in the individual face image and obtain the common age pattern.
Updating via individual attribute translation. There are two input images: one input face and one style image , with their corresponding labels are and , respectively. In this stage, we learn to render the input face image in the style of another face image , similar to the existing style-based GAN. We follow [liu2019few] and the loss function is denoted as
where the GAN loss is formulated as
and the reconstruction loss is defined as
where the loss require to reconstruct the input image when both the input face image and style image are the same image.
The feature matching loss learns to minimize the features’ distance, in which the two features are extracted from the output image and the style image . We use the second-last layer of discriminator , denoted as , to extract the features. It is used to regularize the training [liu2019few], which is formulated as
Updating via common attribute translation. We fixed the parameters of and update the attribute disentanglement . The optimization object is defined as
where the first three terms are the same as the first stage except that we use instead of as the attribute embedding. That is in the first stage the output of the generator is , now it becomes . To transfer the knowledge from the to , we introduce a new disentanglement loss , which is defined as
where and .
Please note that there are many style images and one attribute conditional , thus, can learn the common age pattern of all style face images. This process can depress the unwanted attributes and obtain more reliable common embedding.
4.1 Datasets and Implementation Details
MORPH [ricanek2006morph] is a large-scale face dataset, which consists of 55,134 images of more than 13,000 individuals. There are an average of 4 images per individual, and the ages range from 16 to 77. Following the settings in [liu2019attribute], we divide the images of MORPH into four age groups, i.e., 30-, 31-40, 41-50, 51+.
UTKFace [zhang2017age] 111https://susanqq.github.io/UTKFace/ consists of over 20,000 face images with annotations of age, ethnicity and gender. Their ages range from 0 to 116 years old. We follow the setting in [zhang2017age] and divide face images into ten age groups, i.e., 0-5, 6-10, 11-15, 16-20, 21-30, 31-40, 41-50, 51-60, 61-70, and the rest.
In our experiments, all images are resized to be
and the RMSProp is chosen to be the optimizer. The learning rate and batch size are setted to beand , respectively. The maximum iterations in the two stages are set to 100,000 and 50,000, respectively. The parameters in the proposed architecture are all initialized with the kaiming initialization. Due to the space limitation, more details of the implementation can be found in the supplementary material, in which we provide the source code for our implementation.
4.2 Comparison with Prior Work
In this set of experiments, we compare the performance of the proposed method with several state-of-the-art prior work: HFA [yang2016face], GLCA-GAN [li2018global], CAAE [zhang2017age], IPCGAN [wang2018face], PAG-GAN [wang2018face] and Attribute-aware GAN [liu2019attribute]. Some baselines aim to learn the unchanged facial attributes for preserving the identity information, e.g, the attribute-aware GAN [liu2019attribute]. To make a fair comparison, we show that the proposed method can generate high-quality face images and also preserve the unchanged facial attributes. For example, given an input with and the target age is , then we can synthesize the age face image via where as shown in Fig. 1.
|Wang et al. [wang2018face]||1|
|Li et al. [li2018global]||1|
|Liu et al. [liu2019attribute]|
|Yang et al. [yang2018learning]|
The comparison results are shown in Fig. 2 and Fig. 3. We can see that our method outperforms some baselines, e.g, HFA, CAAE and GLCA-GAN, and achieves comparable results with the PAG-GAN and Attribute-aware GAN. These comparison results show that the proposed method can also generate the high-quality face images. Table 1 shows the complexity analysis of different methods. Our method only needs one model to preserve the facial attributes for age groups, while the PAG-GAN and Attribute-aware GAN need to train models. Form the above results, we can see that our method achieves comparable results while gaining more flexibility for attribute control.
Facial Attribute Consistency We also evaluate the performance of facial attributes preservation which follows the settings of [liu2019attribute]. We randomly sample 2,000 images to compute the preservation rate. The results of competitors are directly cited from [liu2019attribute] for fair comparison. The comparison results are shown in Table 2 and Table 3. We can see that our method performances better than the baselines for preserving the facial attributes.
We also show some results in Fig. 6. Besides, to demonstrate the controllability and flexibility of our method, we generate the face images with different attributes as shown in Fig.4. In our method, face images of different attributes can be synthesized via only changing the condition attribute , e.g., European to African, male to female, young to old, and so on. As can been seen that our model is able to conditionally synthesize different face images of different races, ages and genders.
|Yang et al. [yang2018learning]||95.96||93.77||92.47|
|Liu et al. [liu2019attribute]||97.37||97.21||96.07|
|Yang et al. [yang2018learning]||95.83||88.51||87.98|
|Liu et al. [liu2019attribute]||95.86||94.10||93.22|
4.3 Ablation Study
In this set of our experiments, we do ablation study to clarify the impact of the proposed attribute disentanglement on the final performance. Without the attribute disentanglement, it requires two inputs: an input image and a style image . Then, we can generate the image as .
Fig. 5 shows some examples, where the second row is the style images and the third row is the generated images that take the test face and the style image as inputs, respectively. Two observations can be observed. 1) The generated face images of always look like their style images, e.g., the second column and the last column. While our proposed attribute disentanglement can learn the common age pattern, e.g, the first row, which can make the style transfer available for face aging. 2) The generated face images without may contain other unwanted facial attributes. For example, in the penultimate column, even the input image and the style image belong to the same race and gender, the skin color of the generated face is not the wanted attributes. And our proposed method can solve the problem and provide fine control over the generated face images.
In this paper, we proposed a controllable face aging method based on the attribute disentanglement generative adversarial network. In the proposed aging architecture, a face image and desired attributes are used as inputs. Then we proposed attribute encoder and attribute disentanglement two modules to learn the latent embedding that contains the common age pattern of the desired facial attributes. Finally, we used the adaptive instance normalization layer to render the input image with the style of the common embedding. The experimental results showed that our proposed method can achieve comparable performance with more flexibility for attribute control.