High Resolution Face Age Editing

05/09/2020 ∙ by Xu Yao, et al. ∙ 0

Face age editing has become a crucial task in film post-production, and is also becoming popular for general purpose photography. Recently, adversarial training has produced some of the most visually impressive results for image manipulation, including the face aging/de-aging task. In spite of considerable progress, current methods often present visual artifacts and can only deal with low-resolution images. In order to achieve aging/de-aging with the high quality and robustness necessary for wider use, these problems need to be addressed. This is the goal of the present work. We present an encoder-decoder architecture for face age editing. The core idea of our network is to create both a latent space containing the face identity, and a feature modulation layer corresponding to the age of the individual. We then combine these two elements to produce an output image of the person with a desired target age. Our architecture is greatly simplified with respect to other approaches, and allows for continuous age editing on high resolution images in a single unified model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 12

page 21

page 22

page 23

page 24

page 25

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

25 35 45 55 65
Figure 1: Age editing results on images

.We propose a single deep age transformer network able to perform both face aging and de-aging, producing high quality images that are sharp and with little artifacts. Using the face images indicated by a yellow frame as input, our network can output a photo-realistic image of the same person at any required target age in the range {20, …, 69}.

Learning to manipulate face age is an important topic both in industry and academia. In the movie post-production industry, many actors are retouched in some way, either for beautification or texture editing. More specifically, synthetic aging or de-aging effects are usually generated by makeup or special visual effects. Although impressive results can be obtained digitally, as in the recent Martin Scorcese’s movie The Irishman, the underlying processes are extremely time consuming. Thus, robust, high-quality algorithms for performing automatic age modification are highly desirable. Nevertheless, editing faces is an intrinsically difficult task. Indeed, the human brain is particularly good at perceiving faces’ attributes in order to detect, recognize or analyze them, for instance to infer identity or emotions. Consequently, even small artifacts are immediately perceived and ruin the perception of results. For this reason, our goal is to produce artifact-free, sharp and photorealistic results on high-resolution face images.

With the success of Generative Adversarial Networks (GANs) [Goodfellow2014] in high quality image generation, GAN-based models have been widely used for image-to-image translation [wang2018pix2pixHD, zhu2017unpaired]. Despite having set new standards for natural image synthesis, GANs are known to suffer from two major flaws : an abundance of small artifacts and strong instability of the training process. The latest face aging studies [he2019s2gan, Liu_2019_CVPR, song2018dual, wang2018face, zhang2017age] also adopt GAN-based models. Specifically, they divide face datasets into different age groups, feed young images into the generator, and rely on the discriminator to map output images to older age distributions. There are multiple limitations to this approach. Firstly, as can be expected, these approaches inherit the drawbacks of GAN-based methods - blurry background, small parasite structures, instability of training. Secondly, as the aging effect is generated by matching the output image distribution to the target group, these methods are limited to coarse aging/de-aging. To achieve fine-grained transformation, a separate model needs to be trained between each pair of ages.

In this work, we propose an encoder-decoder architecture for the problem of face age editing with high visual quality on high resolution images. In order to address the aforementioned limitations, namely the tendency to produce visual artifacts and training instability, we endeavour to keep the architecture as simple as possible. Firstly, we use a single network for both aging and de-aging. This is reasonable since the encoder part of our model is assumed to encode identity, emotion or details in the input image that are not related to age, so that the same latent space can be used for both tasks of aging and de-aging. Secondly, we rely on a feature modulation layer, that is compact, acts directly on the latent space and allows for continuous age transitions. Thirdly, unlike in competing methods where the discriminator used during adversarial training is conditioned on the target age, we use a discriminator which is not conditioned and concentrates solely on the photorealism of the output images to reduce editing artifacts. The discriminator can be considered as a regularizer which imposes photorealism other than a traditional discriminator trying to match two distributions. Thanks to this design, our model achieves efficient disentanglement of age attributes and face identity. We present experimental results on high resolution images with qualitative and quantitative evaluations. In particular, these experiments provide clear evidence that the visual quality achieved by our results outperforms state of the art methods. Experiments on alternative datasets further illustrate the generalization capacity of the method.

2 Related Works

Face aging  The survey work [fu2010age]

gives an exhaustive overview of the traditional age synthesis algorithms. In this work, we are more interested in deep learning based methods, which have made impressive progress on face aging tasks during the last few years. A conditional GAN 

[mirza2014conditional] model is first introduced for face aging task by [antipov2017face, zhang2017age]. They encode the face image to the latent space, manipulate the latent code, and decode it to an aged face with the generator. However, the identity information is damaged during this process. This is further improved by [wang2018face, yang2018learning], by adding an identity preserving term to the objective. Despite the improvement, their results are over-smoothed compared with the input images. To capture texture details, wavelet-based generative models are introduced by [li2019global, Liu_2019_CVPR]. Their complex models increase the training difficulty and still yield strong artifacts. All the aforementioned models only enable face aging from one age group to another, e.g., from 20s to 40s, lacking flexibility. Recently,  [he2019s2gan] proposed an encoder-decoder network, in which a personalized aging basis is synthesized and an age-specific transform is applied. Their model also relies on a conditional discriminator to distinguish aging patterns between age groups. Different from other methods, our model is designed for age editing with a random target age. Moreover, our approach produces much less artifacts, making age editing on images of high resolution () possible.

Image-to-image translation  Face aging can be considered as an image-to-image translation problem, ie translating images between young age and old age domains. An optimization based method is proposed by [upchurch2017deep]

, showing the possibility to use linear interpolation of deep features from pretrained convnets to transform images. GAN based methods 

[pix2pix2016, zhu2017unpaired, huang2018multimodal] further enable real-time translation, by training a feed forward generator. Existing image-to-image translation studies [chen2019semantic, choi2018stargan, lample2017fader, pumarola2018ganimation, qian2019make, xiao2018elegant] on face images also yield impressive results in manipulating facial attributes. Lample et al[lample2017fader]

design an autoencoder architecture to reconstruct images, and isolate single image characteristics in a latent component via a discriminator. These characteristics can then be modified directly in the latent space. Choi

et al[choi2018stargan] propose a method to perform image-to-image translation for multiple domains using only a single model. Pumarola et al[pumarola2018ganimation] introduce an attention based model, which enables face animation by simple interpolation.

High-resolution image synthesis  In spite of the considerable progress of recent methods, manipulating/editing natural images of high resolution has not yet been achieved. Nevertheless, in another task - image generation, high quality results at high resolution are now available. Image generation at resolution is first achieved by [karras2018progressive], with a progressive growing of GAN architectures. The quality of their results is further improved by StyleGAN [karras2019style, karras2019analyzing], which learns a separation of high-level attributes automatically during the training. Based on this work, Shen et al[shen2019interpreting] propose an effective way to interpret the latent space learned by the generator and achieve high visual fidelity face manipulation on synthesized images. However, according to our experiments, only a fraction of natural images can be accurately reconstructed with a latent code, which makes this type of method impractical. In contrast, our proposed method achieves face age editing on images, with great simplicity of architecture and loss design. The age editing is achieved only by an auxiliary modulating network, which could be potentially generalized to other face manipulation tasks.

3 Method

Figure 2: Training process: each input image is edited by the age transformer using the initial age (reconstruction task) and the target age (editing task). The reconstructed image should be identical to the input image. The edited image is further passed in a discriminator

that ensures photorealism of the transformed image, and an age-classifier

that ensures age-accurate transformation. The age transformer contains three sub networks: an encoder, a modulating network and a decoder. The encoder maps the input image to an age-invariant deep feature space. The modulating network maps a target age to a

-dimensional modulating vector. This vector is used to modulate each channel of the encoded features, hence applying the desired age transformation. The modulated features are finally passed in the decoder to obtain the transformed image. Two skip connections between the encoder and the decoder in order to preserve the age irrelevant details better.

In this section, we present the face age editing problem and present our proposed model in detail. Figure 2 illustrates our proposed age transformer and training procedure.

3.1 Overview

Let be an image drawn randomly from a face dataset. We denote by the age of the person in . Our goal is to transform so that the person in this image looks like someone at years old. We want the aged version of to share many age-unrelated characteristics with : identity, emotion, haircut, background, etc. That is to say: the facial attributes not relevant to age, as well as the background, need to be preserved during age transformation. Therefore, we assume that a face aging model and a face de-aging model can share most of their parameters. In this setting, we consider a single age transformer and assume that can transform any face image to any target age. The inputs of our model are the face image and the target age . The output is denoted by , which depicts at the target age .

3.2 Age transformer

The proposed age transformer shown in Figure 2

employs an auto-encoder architecture and is made of an encoder, a feature modulation block and a decoder. The encoder consists of three strided convolutional layers (the first one of stride 1, the other two of stride 2) and four residual blocks 

[he_resnet16], while the decoder contains two nearest-neighbour upsampling layers and three convolutional layers, similar to the architecture used in [johnson2016perceptual, zhu2017unpaired]. The main difference compared to these works is our feature modulation block, in which the output features of the encoder are modulated by an age-specific vector (see details below). This idea is inspired by recent works on style transfer [dumoulin17, huang17] which show the possibility to represent different styles using the parameters of normalization layers.

  • Encoder  The face image is the input of the encoder. The output features are denoted by , where is the number of channels and is the product of the two spatial dimensions.

  • Feature modulation for age selection  The target age is encoded as an one-hot vector, denoted by , and passed to the modulating network. This network consists of a single fully connected layer whith a sigmoid activation. It outputs a modulation vector , which is used to re-weight the features before passing them into the decoder and obtaining the face image at the desired age. The modulated features are , where is the diagonal matrix with diagonal .

  • Decoder  The decoder takes the modulated features as input and two skip connections, used to preserve the finer details of the input image. The final output is denoted by .

3.3 Training

As illustrated in Figure 2, we train our age transformer with an age classifier that ensures age-accurate transformation and a discriminator that preserves photorealism.

The initial age of

is easy to estimate using a pretrained age classifier,

e.g., [rothe2015dex]. We thus do not use an age-annotated dataset for training. The original age range of the training dataset is denoted by . At test time, the target age can be chosen as any age in . At training time, it would seem reasonable to chose any value in uniformly at random. However, we noticed that the artifacts appearing during large age transformations were better corrected when selecting a target age far enough from during training. We propose to sample from the set at training time, where is a predefined constant representing the minimum age transformation interval. We denote by

the uniform distribution over

.

Classification loss  To measure the age of , we use the same age classifier as the one used to estimate . During training, we freeze the weights of this classifier. The classifier, denoted by , takes

as input and generates a discrete probability distribution over the set of ages

. The classification loss satisfies

(1)

where denotes the training image distribution over , denotes the categorical cross-entropy loss, and is the one-hot vector encoding .

Adversarial loss  To enforce better photorealism of the modified images , we adopt an adversarial loss built using PatchGAN [pix2pix2016] with the LSGAN objective [mao2017least]. Unlike the latest works on face aging [he2019s2gan, Liu_2019_CVPR, song2018dual, wang2018face, zhang2017age], our discriminator is used to distinguish between real and manipulated images without taking the age information into account. In our work, the aging and de-aging effects is obtained solely with the age classification loss.

The discriminator is denoted by . The architecture of is the same as proposed in [pix2pix2016]. We use a patch size for images. The modified image should be indistinguishable from real samples. Therefore, the losses we use are

(2)

when training , and

(3)

when training . We apply regularization [Mescheder2018ICML] with on the discriminator.

Reconstruction loss  When the age transformer receives and as inputs, the generated output image should be identical to the input image. Hence, we minimize the following reconstruction loss:

(4)

Full loss  We train the age transformer and the discriminator by minimizing the full objective

(5)

where and are weights balancing the influence of each loss.

4 Experiments

In this section, we introduce our training setup and present the experimental results. We further evaluate the quality of our results using quantitative metrics.

4.1 Data augmentation with synthetic images

Our training dataset is built upon FFHQ [karras2019style], a high resolution dataset which contains face images at resolution. The dataset includes large variations in age, ethnicity, pose, lighting, and image background. However, the dataset contains only unlabeled raw images collected from Flickr.

To obtain the age information, we use an age classifier pretrained on IMDB-WIKI [rothe2015dex]. We observe that FFHQ contains much more samples of young faces than of old ones. This data imbalance is challenging since the aging and de-aging tasks would not be treated equally during training: most of faces being young, the age transformer would be trained to perform aging much more often than de-aging, failing to yield satisfying de-aging results. To compensate this imbalance in the age distribution, we propose to perform data augmentation using StyleGAN - a state-of-the-art high resolution image generation model [karras2019style]. We use the StyleGAN model pretrained on FFHQ to generate synthetic images. A quick visual inspection shows that most of the generated images have no significant artifacts and are nearly indistinguishable from real images by a human. Therefore, we use them for data augmentation to obtain a quasi-uniform age distribution over : for any age bin with less than samples in the original FFHQ dataset, we complete this bin with some of the generated synthetic face images; for any age bin with more than samples, we select randomly face images from the original FFHQ dataset. The age-equalized dataset contains images over the range .

4.2 Implementation details

Our model is implemented in PyTorch 

[paszke2017automatic]. We take of the equalized dataset as our training set and the rest as test set. For the age transformer and the discriminator, spectral normalisation [miyato2018spectral]

is applied on all the convolution layers except the last one of the age transformer. All the activation layers use Leaky ReLU 

[maas2013rectifier] with a negative slope of .

We consider age transformation only in the age range . The constant is set to . We have observed that the most significant artifacts appear when the gap between the source and target age is large. By choosing large enough, we force the discriminator to suppress these artifacts during adversarial training. The weights and are set to and , respectively. We use Adam optimizer with a learning rate of . The age transformer is updated once after each discriminator update. Our model is trained for epochs to achieve face age editing on high resolution images. The first epochs are trained on images with a batch size of . The next epochs are trained on images, for which we reduce the batch size to , learning rate to and to .

4.3 Qualitative evaluation

25 35 45 55 65
Figure 3: Age editing results on images on FFHQ [karras2019style]. On each row, the yellow frame indicates the original image. Each column corresponds to a target age of: , , , , . Our approach yields visually satisfying results without introducing significant artifacts. Only age relevant features are modified, while the identity, haircut, emotion and background are perfectly preserved.
Figure 4: Continuous face age editing results on FFHQ [karras2019style]. As can be observed, the difference between two adjacent results is nearly invisible, which demonstrates the smoothness of the aging process.

Figure 9 presents age editing results on input images in different age groups. Our approach yields visually satisfying results with sharp details (best viewed when zooming on the results) and without introducing significant artifacts. Only the age relevant facial features are modified, while the identity, haircut, emotion and background are well preserved. This is all the more satisfying that no mask has been used to isolate the face from the rest of the image. Figure 4 presents age editing results with a smooth evolution of the target age. The difference between two adjacent results is nearly invisible, which illustrates the smoothness of the aging process.

Input 31-40 41-50 51+

IPCGAN

Ours

Input 31-40 41-50 51+
(a) Comparison with IPCGAN.
Input 51+ Input 51+ Input 51+ Input 51+

PAGGAN

Ours

(b) Comparison with PAGGAN.
Figure 5: Comparison with IPCGAN [wang2018face] and PAGGAN [yang2018learning] on CACD [chen2014cross]. For each subfigure in (a), the top row corresponds to the aging results of IPCGAN. The second row shows the images generated by our method. For each subfigure in (b), the top row corresponds to the aging results of PAGGAN. The second row shows the images generated by our method.

We compare our method to the two most recent state-of-the-art methods on face aging for which the official codes are released - IPCGAN [wang2018face] and PAGGAN [yang2018learning]. We also compare our results to those obtained with FaderNet [lample2017fader], which allows one to manipulate several facial attributes including the age.

Figure 5 present the face aging results of IPCGAN, PAGGAN and our method on CACD [chen2014cross]. The output size of each method is: for IPCGAN, for PAGGAN, for our method. IPCGAN generates satisfying aging results and preserves well the identity of input images. However, as can be seen e.g. in Figure 5(a) row 1 column 4, the generated image presents noticeable artifacts. PAGGAN generates impressive aging effects but also introduce colored artifacts as shown in Figure 5(b) row 1 column 2. IPCGAN and PAGGAN both degrade the quality of input images. Our method is able to generate consistent aging effects, and preserve well the fine details of the input images.

Input () Fader () PAGGAN () IPCGAN () Ours ()
Figure 6: Comparison of face aging results on CelebA-HQ [karras2018progressive]. The first column are the input images. The second to fifth column are outputs from Fader Network [lample2017fader], PAG-GAN [yang2018learning], IPC-GAN [wang2018face] and our method. Our results reach the highest resolution without introducing significant artifacts. Our method preserves the background better compared to other techniques, see for instance the letters on the third row. In addition, compared to other techniques, our method leads to a result without artefacts nor blur.

Generalisation capacity for images in unseen dataset  For fair comparison and also to reduce the possible effect of overfitting on the training data, we evaluate all methods on a dataset not viewed at training time by any of the methods. We chose CelebA-HQ [karras2018progressive], a high resolution version of the CelebA dataset. The input images are at resolution, and are further downsampled at the resolution at which each method was trained using their official codes. The output size of each method is: for PAGGAN, for IPCGAN, for FaderNet, and for our method. We compare only the face aging results from young age group to old age group, since PAGGAN and IPCGAN are trained only for aging. Figure 6 shows the results obtained with the different methods. FaderNet [lample2017fader] introduces little modifications. PAGGAN [yang2018learning] generates satisfying age progression effects. However, noticeable artifacts are present on the face edges and hairs. IPCGAN [wang2018face] is limited to low resolution and thus introduces a strong degradation on the quality of the image. In comparison to these results, our approach introduces much less artifacts and preserves the fine details of the face and the background better.

4.4 Quantitative evaluation

Gender Smiling Emotion Preservation(%)
Method Predicted Age Blur Preservation(%) Preservation(%) Neutral Happiness
FaderNet [lample2017fader] 44.34 11.40 9.15 97.60 95.20 90.60 92.40
PAGGAN [yang2018learning] 49.07 11.22 3.68 95.10 93.10 90.20 91.70
IPCGAN [wang2018face] 49.72 10.95 9.73 96.70 93.60 89.50 91.10
Ours 54.77 8.40 2.15 97.10 96.30 91.30 92.70
Table 1:

Quantitative evaluation using online face recognition API 

[megvii2013face++]
. We compare our method against three methods: Fader Network [lample2017fader], PAGGAN [yang2018learning] and IPCGAN [wang2018face]. Images are transferred to the oldest age group () for all the methods. The second column presents the average predicted age. The third column indicates the blurriness of the results (lower value means less blurry). The fourth column is the gender preservation rate, meaning to which percentage the original gender is preserved. The fifth column refers to expression preservation - smiling preservation rate. The last two columns indicate the emotion preservation rate

Quantitative evaluation of image-to-image translation tasks is still an open question and there is no universal metric to measure photorealism or quantify artifacts in an image. The recent works [he2019s2gan, Liu_2019_CVPR, yang2018learning] on face aging use an online face recognition API to estimate the age and the identity preservation accuracy of the modified images. We thus employ a similar evaluation process.

In our evaluation, the first images with true “Young” label of the CelebA-HQ dataset are extracted as test images. Using this test set, we make a quantitative comparison with FaderNet [lample2017fader], IPCGAN [wang2018face] and PAGGAN [yang2018learning]. Each image is transferred to the oldest age group using their official released models. For IPCGAN and PAGGAN, the oldest age group refer to and respectively. For FaderNet, the old attribute is set to be the default largest value for aging in their official code. To have a fair comparison with groupwise methods, and since is considered as the oldest age group, we choose a target age of (the mean of the age range ) for our age transformer.

Thus we get 1000 modified images for each method. We further evaluate these output images using the online face recognition API of Face++ [megvii2013face++]. From the detect API, we obtain the following interesting metrics: age, gender, blurriness (whether the face is blurry or not, larger values means blurrier), smiling and emotion estimation. The emotion estimation contains a series of emotions: sadness, neutral, disgust, anger, surprise, fear and happiness. With a preliminary analysis on the results, of the input images are classified as neutral or happiness. Thus we just keep these two terms for emotion preservation comparison. We have also compared the identity preservation rate using the API to compare the modified images with the original inputs. However, since all methods achieve a nearly accuracy, this metric is not reported here.

Table 1 shows the quantitative evaluation results. All the methods are given the oldest age group as aging target, and we notice that our method has the highest average predicted age. The gender preservation rate is calculated by comparing the estimated gender with the original CelebA annotations. Using this metric, FaderNet achieves the best performance, followed by our method. For expression preservation (smiling) and emotion preservation (neutral, happiness), our approach yields the best results. It is to be noted however that all methods have similar results. For the blur evaluation, results are much more contrasted. Our method performs much better in generating sharper images, which is in agreement with the visual comparisons.

4.5 Discussion

Input 25 65 Input 25 65
(a)
(b)
(c)
Figure 7: Face age editing results with different types of discriminator. (a) Conditional discriminator. (b) Two separate discriminators. One receives images only from old age groups, the other receives images from young age groups. (c) Our proposed method - using one single discriminator. Comparing to the results in (a) and (b), the proposed method (c), which uses a single discriminator, generates reliable face aging/de-aging effets with the least artifacts.

Ablation study on discriminator  We have explored three different types of discriminators to train the age transformer. Figure 7 presents the face age editing results corresponding to the different settings.

  • Conditional discriminator. We adopt a patch discriminator [pix2pix2016] with a label projection applied on the features before the last convolutional layer, similar to the settings in [miyato2018cgans]. The discriminator is conditioned on four age groups: -, -, -, -. At the training stage we find it essential to give the same number of real and fake images from each class to the discriminator to make the training successful. If we sample a target age from the set at training time, the discriminator will receive more manipulated images in the youngest and oldest group. Thus it tends to classify all the images in these two groups as fake. The conditional discriminator is very sensitive to the original data distribution and needs much more hyper-parameter fine-tuning to converge. Figure 7(a) presents the age editing results with conditional discriminator. Strong artifacts can be observed in the aging results.

  • Two separate discriminators. One discriminator receives manipulated and real images with a desired age lies in the old age group (-), while the other one takes manipulated and real images in the young age group (-). With this setting, the task of generating aging/de-aging effects is shared among the classifier and the discriminators. Although the results in 7(b) are better than those in 7(a), over-smoothing artifacts are perceived in the de-aging results and colored artifacts appear in the aging results.

  • One single discriminator. This is our proposed method. The discriminator can be considered as a regularizer which imposes photorealism, as it takes all the manipulated and real images as input. The generation of aging/de-aging effects is solely dictated by the age classifier. We are able to achieve high resolution results only with this last setting.

Input Reconstructed Input Reconstructed Input Reconstructed
Figure 8: Images reconstructed from a latent code optimization. We analyze the possibility of encoding natural images to the latent space of StyleGAN [karras2019style], through optimization in the latent space minimizing the distance between the generated image and the input image. Each image is then reconstructed from this optimized latent code. The relatively low quality of the reconstruction strongly suggest that editing performed in the latent space cannot lead to a sharp and artifact-free result.

Image reconstructed from a latent code optimization  As mentioned in Section 2, the recent work of Shen et al. [shen2019interpreting] proposes an effective way to manipulate the latent code of an image generator to achieve high visual quality manipulation of synthesized images. It is therefore tempting to manipulate the latent code directly to produce face manipulation (and thus age editing) on natural images with this approach. However, finding such a latent code for an arbitrary face image is still a challenging problem. According to our experiments using StyleGAN [karras2019style], only a fraction of natural face images can be accurately reconstructed from the latent code 111The latent code is obtained through optimization in the latent space by finding a latent code that minimizes the distance between the generated image and the input image. by [stylegan-encoder]. Consequently, this type of method is impractical until a better StyleGAN encoder is made available. Figure 8 is meant to support this claim, where reconstruction results of natural face images can be assessed. We notice that the reconstructed images have painting-like artifacts, blurry backgrounds, and sometimes fail to preserve the identity of the person in the input image. Indeed, StyleGAN is much more efficient at sampling random faces from the latent space than at approximating a given face image. This is due to the fact that a GAN is not necessarily invertible. Hence, an editing method based on this latent code reconstruction will struggle to handle correctly natural images and to achieve the high visual quality of our method.

Weakly supervised training  To the best of our knowledge, our work is the first to use unlabeled data for training among recent face aging studies [he2019s2gan, Liu_2019_CVPR, song2018dual, wang2018face, zhang2017age]. A classifier pretrained on IMDB-WIKI [rothe2015dex], a low resolution face dataset, is used to provide age information. Moreover, the discriminator in our method is used only to distinguish real and manipulated images. Relying solely on the classifier, we successfully extract the age specific features and further realize age transform on high resolution images. This reveals the capacity of the classifier, even trained on low quality images. Our method could be potentially generalized to other face attributes manipulation tasks, by using a separate pair of modulating network and classifier for each attribute.

5 Conclusion

In this paper, we have proposed an age transformer architecture, enabling continuous face age editing with a single network, which we have endeavoured to keep as simple as possible. We believe that this approach, combined with an encoder-decoder architecture, rather than relying on a complex GAN, is the best path towards high quality, high resolution face editing results. We have demonstrated the capacity of our model to produce photorealistic and sharp results, without introducing significant artifacts, on images of resolution . The proposed feature modulation block appears to achieve efficient separation of age and identity information. Given the performance achieved, this design can be potentially useful for other face attribute manipulation tasks.

References

Appendix 0.A Network architecture

Table 2

presents the hyperparameters of the proposed network architecture. The discriminator is a

patch discriminator. Each element of the output feature map corresponds to a receptive field of on the original input image.

Appendix 0.B Age classifier

To obtain the age information of FFHQ dataset [karras2019style], we use the age classifier [rothe2015dex], which has been pretrained on IMDB-WIKI. This dataset contains face images of celebrities collected from the IMDB and Wikipedia websites. The dataset mostly covers the age interval, and has only very few samples for the younger and older age intervals. Consequently, the age classifier might yield less accurate age estimation for faces of people younger than years old or much older than years old. We therefore choose to use images in the age range for training. We pass the images of FFHQ dataset into the age classifier and observe that FFHQ contains much more samples of young faces than of old ones. We then augment the dataset with synthetic images generated by StyleGAN [karras2019style] to achieve a quasi-uniform age distribution over the age range , as described in section of the paper.

Appendix 0.C Additional results

In this section, we present supplementary results on images.

0.c.1 Results on FFHQ dataset

More age transform results on images of FFHQ dataset are presented in Figure 9 and 10.

0.c.2 Comparison with other methods

In Figure 13, we show additional comparison of face aging results on Celeba-HQ [karras2018progressive]. As mentioned in the paper, we compare our method against the two most recent state-of-the-art methods on face aging for which the official codes are released - PAGGAN [yang2018learning] and IPCGAN [wang2018face]. We also compare our results to those obtained with Fader Network [lample2017fader], which allows one to manipulate several facial attributes including the age. Each input image is transformed to the oldest age group using their official released models. For IPCGAN and PAGGAN, the oldest age group refer to and respectively. For Fader Network, the age attribute is set to be the default largest value for aging in their official code. To have a fair comparison with groupwise methods, and since is considered as the oldest age group, we choose a target age of (the mean of the age range ) for our age transformer.

Operation Kernel size Stride Channel
Age transformer
Encoder
Convolution
Convolution
Skip connection 1
Convolution
Skip connection 2
Residual block
Residual block
Residual block
Residual block
Modulation layer
Decoder
Concatenation with skip connection 2
Upsampling
Convolution
Concatenation with skip connection 1
Upsampling
Convolution
Convolution
Discriminator
Convolution
Convolution
Convolution
Convolution
Convolution
Convolution
Upsampling mode Nearest (scale factor = )
Padding mode Reflection
Normalization InstanceNorm for age transformer
BatchNorm for discriminator
Activation LeakyReLU (negative slope = )
Table 2: Hyperparameters of the proposed network architecture. The input size is

. For the age transformer, except the last one, each convolution is followed by an instance normalization and a LeakyReLU activation. For the discriminator, except the first and the last one, each convolution is followed by a batch normalization and a LeakyReLU activation.

25 35 45 55 65
Figure 9: Age transformation on images. On each row, the yellow frame indicates the original image. Each column corresponds to a target age of: , , , , . Our approach yields visually satisfying results without introducing significant artifacts. Only age relevant features are modified, while the identity, haircut, emotion and background are perfectly preserved.
25 35 45 55 65
Figure 10: Age transformation on images. On each row, the yellow frame indicates the original image. Each column corresponds to a target age of: , , , , . Our approach yields visually satisfying results without introducing significant artifacts. Only age relevant features are modified, while the identity, haircut, emotion and background are perfectly preserved.
25 35 45 55 65
Figure 11: Age transformation on images. On each row, the yellow frame indicates the original image. Each column corresponds to a target age of: , , , , . Our approach yields visually satisfying results without introducing significant artifacts. Only age relevant features are modified, while the identity, haircut, emotion and background are perfectly preserved.
25 35 45 55 65
Figure 12: Age transformation on images. On each row, the yellow frame indicates the original image. Each column corresponds to a target age of: , , , , . Our approach yields visually satisfying results without introducing significant artifacts. Only age relevant features are modified, while the identity, haircut, emotion and background are perfectly preserved.
Input () Fader () PAGGAN () IPCGAN () Ours ()
Figure 13: Comparison of face aging results on CelebA HQ [karras2018progressive]. The first column are the input images. The second to fifth column are outputs from Fader Network [lample2017fader], PAG-GAN [yang2018learning], IPC-GAN [wang2018face] and our method. Our results reach the highest resolution without introducing significant artifacts. Our method preserves the background better compared to other techniques, see for instance the letters on the third row. In addition, compared to other techniques, our method leads to results that are free of artefacts and blur.
Input () Fader () PAGGAN () IPCGAN () Ours ()
Figure 14: Comparison of face aging results on CelebA HQ [karras2018progressive]. The first column are the input images. The second to fifth column are outputs from Fader Network [lample2017fader], PAG-GAN [yang2018learning], IPC-GAN [wang2018face] and our method. Our results reach the highest resolution without introducing significant artifacts. Our method preserves the background better compared to other techniques, see for instance the letters on the third row. In addition, compared to other techniques, our method leads to results that are free of artefacts and blur.