Personalized and Occupational-aware Age Progression by Generative Adversarial Networks

11/26/2017 ∙ by Siyu Zhou, et al. ∙ 0

Face age progression, which aims to predict the future looks, is important for various applications and has been received considerable attentions. Existing methods and datasets are limited in exploring the effects of occupations which may influence the personal appearances. In this paper, we firstly introduce an occupational face aging dataset for studying the influences of occupations on the appearances. It includes five occupations, which enables the development of new algorithms for age progression and facilitate future researches. Second, we propose a new occupational-aware adversarial face aging network, which learns human aging process under different occupations. Two factors are taken into consideration in our aging process: personality-preserving and visually plausible texture change for different occupations. We propose personalized network with personalized loss in deep autoencoder network for keeping personalized facial characteristics, and occupational-aware adversarial network with occupational-aware adversarial loss for obtaining more realistic texture changes. Experimental results well demonstrate the advantages of the proposed method by comparing with other state-of-the-arts age progression methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Face age progression [27, 5], also called face aging, is to predict the future looks of a person. It is one of the key techniques for a variety of applications, including looking for the missing person, cross-age face analysis [23] and so on. Recently, many research efforts have been devoted to generating realistic aged faces, which can be roughly divided into two categories: physical model-based age progression [29, 30] and prototype-based age progression [1, 13]

. The physical model-based methods model the facial patterns and physical mechanisms of aging. While, the prototype-based age progression methods transfer the differences between prototypes (e.g., the average face of each group) into the individual faces. Deep learning methods have also been applied in face age progression due to their powerful feature representations. Wang et al. 

[34] showed a recurrent face aging framework and Zhang et al. [35] proposed conditional adversarial autoencoder framework (CAAE) for age progression.

However, existing works always generate only one future look for a person, while totally ignore the future look of a person may change in different occupations. For example, a 20-years-old young man chooses actor/farmer as his career. When he is 50-years-old, the look of the farmer and the look of the actor may have some differences even for the same person. Different occupations may have different appearances [18] as shown in Figure 1. In this paper, we explore the impact of occupations on the age progression. Note that, for different occupations, the most perceptible difference is skin texture. Hence, we only focus on skin aging 111The face aging process can be divided into two stages [30, 5]: child growth and adult aging. Shape change is the most prominent factor during the child growth, while the most greatest change is skin aging (texture change) during adult aging. We only focus on adult aging. in this work.

We firstly introduce occupational face aging dataset (OFAD). Figure 1 shows some example faces. OFAD is a comprehensively annotated dataset that contains five kinds of occupations: actor, singer, doctor, teacher and farmer. Each occupation includes two range of ages, i.e., 30-50 and 50-80. To the best of our knowledge, it is the first face aging dataset with occupational information. We believe this dataset is benefit for the researches in face progression under different occupations.

Figure 1: Examples of OFAD. According to different occupational information, we collect different images in two age groups (30-50, 50-80). The occupation from left to right is actor, singer, doctor, teacher and farmer. From left to right, we found that deeper and deeper the facial wrinkles and skin colors are. This contrast is more obvious in the old age groups, because the face has been affected by working environment longer.

We further present a realistic image generation for aging progression under different occupations via the proposed occupational-aware adversarial face aging network which is referred as OAFA. Different from previous approaches which only have one output for one age group, our OAFA can generate several outputs of different occupations.

OAFA has three major components: 1) the generator/encoder aims to generate different future looks for a young face of different occupations, 2) the decoder brings the future looks back to the young face, and 3) the discriminator aims to encourage generator/encoder to generate high quality images of different occupations. The three major components make up two networks: personalized network and occupational-aware adversarial network. The personalized network is an autoencoder network which is formed by the generator/encoder and the decoder. We propose a personalized loss to make the original face can be regenerated by the future looks to keep personalized facial characteristics. The occupational-aware adversarial network consists of the generator/encoder and the discriminator components. The proposed occupational loss includes two terms: conditional adversarial loss [6, 20] and triplet rank loss [15, 21], which aims to obtain the visually plausible texture changes, i.e., skin aging, for different occupations.

Our contributions can be summarized as following:

  • We introduce an occupational face aging dataset which includes several occupations. This helps to explore the effects of the occupations in face age progression problem.

  • We propose an occupational-aware face aging adversarial network to generate multi outputs for occupational-aware age progression problem, which can model the personalities and occupational characteristics of the persons in the face aging process.

  • The empirical results demonstrate the superiority of the proposed method over the state-of-the-arts baseline methods, in which more aging details (textures) and more realistic face images are generated.

2 Related Work

Many face age progression approaches have been proposed to model the dynamic aging process, which can be mainly divided into two categories, physical model-based [5] and prototype-based methods [31]. The physical model-based methods [16, 25, 26, 29, 30]

simulate face aging by modelling the aging mechanisms, e.g., skins, muscles, wrinkle, etc, via employing the parametric model. However, these methods are computationally expensive and require a long age span of each individual. Unfortunately, collecting a wide range of ages of the same person is very difficult or even unlikely, and few of face aging datasets satisfy this requirement.

The prototype-based methods [1, 13]

use non-parametric model. They firstly divide faces into groups by age and then the average face is computed for each age group. The average face is referred as prototype and the differences between prototypes are viewed as the aging pattern. The main problem of prototype-based methods is that they may ignore the personalized information, e.g., wrinkles. To persevere the personality, Shu et al. 

[28] presented an age progression method based on dictionaries. Each group has one dictionary, and two neighbouring groups are linked together for learning the aging pattern. Moreover, the personalized layer is proposed to keep the personalized information.

Deep learning methods have also been proposed for solving the age progression problem. Wang et al. [34]

proposed a recurrent face aging framework based on a recurrent neural network, which can age the face gradually and keep the personalized information by memorizing the previous faces. Zhang et al. 

[35] presented a conditional adversarial autoencoder network (CAAE) for learning a face manifold. Their method is based on the conditional generative adversarial networks (CGAN) [6, 20], which shows impressive results in image generation.

However, almost all the existing works do not consider that person’s appearance may be different under different occupations. To facilitate the researches, we introduce an occupational face aging dataset for exploring the effects of the occupations. We also propose a new occupational-aware adversarial face aging network for age progression. The most similar work to ours is CAAE [35]. Both CAAE and our method utilize autoencoder and CGAN to generate high quality images. The main difference between the CAAE and our method is that we propose personalized network and occupational-aware adversarial network to explicitly pursuit the common age pattern in different ages and occupations. As a result we can obtain more aging details, e.g., the wrinkles and blemishes are more obvious. Different to our method, CAAE assumes the face images lie in a manifold and the autoencoder network is proposed to learn the manifold.

Figure 2: OAFA network for age progression. a) is the overview of OAFA. is a generative/encoder network that predicts the future look of an input faces with a certain occupational condition . Then maps the generated face back to the input face for keeping personality. The discriminator encourages

to generate old faces that 1) indistinguishable from the real old images and 2) distinguishable from other generated faces of different occupations. The details of architectures of the three networks with corresponding kernel size (k), feature maps (n), stride (s) and residual block (R) are shown in b), c) and d), respectively.

3 The Occupational Face Aging Dataset

3.1 Data Statistics

We collect a dataset of people in 5 different occupations for analysing the occupational effects, which is referred to occupational face aging dataset (OFAD). OFAD consists of over 2,000 diverse face images which are divided into five occupations (actor, singer, doctor, teacher, and farmer). Each occupation contains over 200 male images and 200 female images of different races, and all images have obvious texture information. The age range of these occupational images is between 30 and 80. We divide it into two groups: middle age group (30-50) and old age group (50-80). Some example images are shown in Figure 1. We also collect about 200 images of persons without occupational information as input for training, in which the ages of these persons are in the range of 15 to 45.

3.2 Image Collection

Image search engines such as Google and Bing are common sources for constructing face aging datasets. In addition to these sources, we also collect face images from two available databases, CACD [2] and FGNET [16].

Collecting Images from Image Search Engines. We download face images from two representative image search engines: Google and Bing, and each of them contains a great number of high-quality face images. In order to collect images with accurately occupational information, we use a combination of descriptive words that contain age information and occupation name as keywords. For example, we use “retired doctor” as keywords to download doctors’ faces for old age group (50-80).

Collecting Images from CACD. The CACD dataset contains more than 160,000 images of 2,000 celebrities from 16 to 62 years-old. According to CACD, most of the celebrities’ names are crawled from IMDb.com, which is one of the largest online movie database and contains profiles of millions of movies and celebrities, so we download 30-50 and 50-62 years-old face images as the train set of actor.

Collecting Images from FGNET. To evaluate the performance, faces in FGNET are used as a test set, which contains 1,002 images of 82 people with age range from 0 to 69. It includes the ground truth images for evaluation.

4 Our Approach

In this section, we introduce Occupational-aware Adversarial Face Aging network (OAFA), which learns the human aging process under different occupations.

We first introduce some notations. We define as the young persons’ images. And denotes as a set of middle-age face images and is elder face images. The and have occupational information. Afterwards, we only discuss how to generate the looks of elder people. The generation for middle-age is similar. Let , where denotes a set of images of the persons who have the -th occupation and is the number of images. The data distributions are denoted as and where in our paper.

As illustrated in Figure 2, our architecture contains three components: the generative/encoder network , the decoder network , and the discriminative network . Given a young face image , it goes through the multi convolutional layers and is encoded into high-level feature maps which denote as . Then, the feature maps conditioned on certain occupation denote as , where is one-hot occupational label for the -th occupation. Finally, these conditioned feature maps are encoded into a future look for a certain occupation, . Note that only changing , we can generate multiple outputs for different occupations. For ease of representation, we define . In addition, we have a adversarial discriminator which aims to distinguish the generated images from the real elder images . And a decode function which is to reconstruct its own input, i.e., .

4.1 Loss Function

Our objective contains two type of terms: 1) personalized loss for keeping human identity information and 2) occupational-aware adversarial loss for obtaining the skin changes in different occupations.

4.1.1 Personalized Loss

The primary principle of face age progression is to preserve the personality of the input faces. For example, given a young face image and the generated old face image with the -occupation , the generated face image and young face image should be recognized as the same person.

To achieve this goal, we utilize the autoencoder approach [33, 10]. It includes an encoder and a decoder, in which the encoder learns a representation for an input data and the decoder reconstructs the representation back to its own input. We require that the generated face image should be able to be reconstructed back to the original image as the CycleGAN [36], i.e., . With this, the generated face image is one of the representation of the input young face image, which helps preserve the features of young image and keep the personality of the young face. Thus, the personalized loss can be formulated as

(1)

The personalized loss limits the space of possible mapping function because the generated images should be reconstructed back to the original images. Hence, the generated images can’t be far away from the source domain.

4.1.2 Occupational-aware Adversarial Loss

The second principle is to preserve the common age pattern under different occupations, e.g., the generated face image for farmer should be recognized as a farmer’s face. We propose occupational-aware adversarial loss to address this problem.

(2)

Inspired by the impressive results in image generation of the conditional generative adversarial network (CGAN) [6, 20], we adopt it for our human aging process under different occupations. The objective can be expressed as:

(3)

where but . tries to generate image that looks similar to images from the -th occupation, and tries to distinguish the real occupational image and the generated image . minimizes this objective while aims to maximize it. Note that we also add to distinguish from other generated faces of different occupations.

To further make the generated images of different occupations look different to each other, we add a triplet rank loss [15, 21] that is defined as

(4)

where is the margin. We explicitly require that the should be closer to the images of the -th occupation than other occupations. It is to make the generate image of -th occupation looks similar to the target domain and help to distinguish the multiple output images.

4.1.3 Full Objective

Our full objective is

(5)

where , , and control the importance of the three objectives. The final optimization problem can be formulated as

(6)

4.2 Network Architecture

Generator/Encoder . Our generator network follows the architectural proposed by Johnson et al. [12] and SRGAN [17]. It consists of two parts, and . is a small network with three convolutional layers, which learns the feature maps that facilitate the following image generation. The kernel size of the first convolutional layer is and the last two conovlutional layers with filter kerners. Each convolutional layer is followed by one instance-normalization layer [32]

and one ReLU layer. We use two strides in the last two convolutional layers which makes the size of the output be half of the input. Given a input image

, the final output of network is

. The occupational information is one-hot vector, e.g.,

indicates for actor. We resize the one-hot vector into a cube, e.g., where the values in the -th channel are all one, and other channels are zero. Then it is concatenated to the output of and used as the condition.

Given the high-level feature maps and occupational label as input, is a residual network [8]

to generate the realistic face image. Residual connections are powerful method 

[9], which make the very deep network easily to be trained. Followed the design of [8, 12], each residual network consists of two convolutional layers with

filter kernels and 128 feature maps, each convolutional layer followed by one instance-normalization layers and one ReLU activation function. There are 12 residual networks in total.

After the residual network, we use a bilinear interpolation method 

[7] to upsample the input instead of deconvoluions, since deconvoluions tend to introduce characteristic artifacts [22, 3]. Bilinear interpolation is one of the basic resampling techniques and used to produce a reasonably realistic image. Two bilinear interpolations are used to increase the size of the feature maps, each bilinear interpolation followed by a convolutional layer, one instance-normalization layer and one ReLU layer. The last convolutional layer with kernel size and followed by one instance-normalization layer and one Tanh layer.

Decoder . The decoder is an inverse generator. The only difference is that we remove the occupational conditions.

Discriminator . Our discriminator adopt PatchGAN [11] as our basic framework. the input of consists of an old image and an condition vector. All LReLU are leaky with slop 0.2.

5 Experiments

In the section, we compare the proposed OAFA against several baselines both qualitatively and quantitatively.

5.1 Implementation of OAFA

As with previous works, we normalize the value of each pixel of the input images into , because the normalized pixel value will make the training easier and achieve faster convergence. Similarly, when we concat one-hot vector, we also put its values in the specification between -1 to 1. The value of 0 in the one-hot vector corresponds to -1 and the value of 1 corresponds to 1. The output of the proposed architecture is also in range by using of the Tanh layer.

In training, the hyper-parameters are set as , , . The three networks are updated alternatively with a mini-batch size of 1 through the stochastic gradient solver, i.e., ADAM [14] (,

). After nearly 200 epochs, high-quality results can be obtained. During testing, only

is active. Give a young face and a certain occupational condition, will generate the corresponding aging face.

Generally speaking, training GANs is a difficult issue in practice because of the instability of GANs learning [24]. Least square loss performs more stably during training [19], and we replace the negative log likelihood objective in Eq LABEL:3.

Figure 3: Comparison to the real occupations. The first column is input faces. The second to sixth columns are our results for five occupations (It’s hard to find a same farmer’s young and old face, we thus replace it with an agricultural scientist). The last column is the ground truth faces.
Figure 4: Comparison to the ground truth. The first column is input faces. The second to sixth columns are our results for five occupations (acotr, singer, doctor, teacher, and farmer). The last column is the ground truth faces. Obviously, our method show good performance on details of aging (facial wrinkles, white hair or beard, and skin color) and can preserve personal characteristics.
Figure 5: Examples of the quantitative comparison to the ground truth. The two images are indicated as the same person in the first row. The second row is not sure, and the third row is not similar.
Figure 6: Comparision to CAAE. The first column is input faces. The second to fifth columns are our results for (30-50) and (50-80) age groups, each age group has two pictures, which are teachers and doctors, because the two occupations have the aging characteristics of ordinary people. The sixth to ninth column is the results of CAAE (We cite these picture from the original paper directly, please ignore the red boxes). We can see that the generated faces of CAAE look smooth and young even for 71-80 years-old while the details of skin aging can be clearly seen from our generated images.
Figure 7: Comparison to RFA. The first column is input faces. The second to sixth columns are our results for five occupations (acotr, singer, doctor, teacher, and farmer). The last column is the result of RFA.
Figure 8: Examples of the quantitative comparison to the prior work. The first row indicates our method is better than prior works. The second row is not sure, and third row is worse.
Figure 9: Comparison results of using and not using triplet rank loss in occupational-aware adversarial loss. Without triplet loss, the occupational information can’t be obtain accurately.
Figure 10: Effects of the personalized loss. As the coefficients increase, the aging effect and occupation information are less obvious.

5.2 Qualitative and Quantitative Comparison

In this subsection, we evaluate the performance of the proposed method. Following the CAAE [35], we qualitatively and quantitatively compare with ground truth and the best results from prior works [35, 34].

5.2.1 Comparison with Ground Truth

To qualitatively evaluate the performance, we compare our results with the real face images with occupational information in Figure 3. We also compare the face images in FGNET [16, 4] in Figure 4. We can see that OAFA can well obtain different textures of different occupations, e.g. wrinkles, hair, blemishes, etc.

For the comparison of quantity, we found 100 volunteers to do the test. Each participant was shown a sequence of paired images: the original image, our generated image (we randomly selected one in five results), the ground truth. Participants were asked whether the last two images are the same person or not sure. There are 240 paired images of 48 subjects from FGNET in total. First, we randomly selected 40 pairs of images to let participants understand the test process. Then we randomly selected 100 paired images from the rest of the images for testing. 95 valid test results are received, with 60.32% indicating that the generated face image is the same person as the ground truth, 30.57% indicating they are not, and 9.11% not sure. Some test results are shown in Figure 5. Qualitative and quantitative comparisons show that our method can obtain realistic images.

5.2.2 Comparison with State-of-the-arts

We select CAAE [35] and RFA [34] as our baselines. For fair comparison, we use the same input without preprocess to generate images and all results of baselines are directly cited from their original papers. Figure 6 and Figure 7 show the comparison results. We can see that our method can generate more realist, older images and obtain more skin textures. For example, even 80-years old, the generated faces of CAAE look very young. However, the details of skin aging can be clearly seen in the faces generated by our method.

For the comparison of quantity, we also found 100 volunteers to do the test. Each participant was shown a sequence of paired images, the original image, our generated images (we also randomly selected one in five results), and the image generated by prior work (we use the code 222https://github.com/ZZUTK/Face-Aging-CAAE of CAAE to generate these image), and asked them which method perform better or not sure. The test set consists of 740 paired images of 148 subjects from FGNET, and some examples were shown in Figure 8. We randomly selected 40 paired images to let participants understand the test process. Then from the rest of the images, we randomly selected 100 paired images for testing. 92 valid test results are received, with 61.35% indicating that our method is better, 13.06% indicating our method is worse, and 25.59% not sure.

Note that there is no any pre-process to our input images.

5.3 Effect of the Occupational-aware Adversarial Loss

For exploring the effects of triplet rank loss, we compare some examples that using and not using the triplet rank loss, i.e., with and without triplet rank loss. The other two losses are fixed. Figure 9 shows the results. We can observe that using the triplet rank loss gives better results.

5.4 Effect of the Personalized Loss

For exploring the effects of personalized loss, we also show some exampled results that using different values of . Figure 10 shows the results which the is in the range of . We can see the generated face images look more like the input young faces with larger weight of personalized loss.

6 Conclusion

In this paper, we proposed an occupation-aware face aging progression method via the conditional generative adversarial network. In the proposed deep adversarial architecture, an input young face image conditioned on an occupation goes through the generator network and the future look is generated. Then, we present personalized network and occupational-aware adversarial network for preserving personality and generate more realistic images for skin changes , respectively. Empirical evaluations on both the qualitative and quantitative comparisons demonstrate the appealing performance of our method.

In future work, we plan to study shape change, e.g., child growth, to obtain more realistic face images.

References