ELEGANT: Exchanging Latent Encodings with GAN for Transferring Multiple Face Attributes

03/28/2018 ∙ by Taihong Xiao, et al. ∙ Peking University 2

Recent studies on face attribute transfer have achieved great success, especially after the emergence of generative adversarial networks (GANs). A lot of image-to-image translation models are able to transfer face attributes given an input image. However, they suffer from three limitations: (1) failing to make image generation by exemplars; (2) unable to deal with multiple face attributes simultaneously; (3) low-quality generated images, such as low resolution and sensible artifacts. To address these limitations, we propose a novel model that receives two images of different attributes as inputs. Our model can transfer the exactly same type of attributes from one image to another by exchanging certain part of their encodings. All the attributes are encoded in the latent space in a disentangled manner, which enables us to manipulate several attributes simultaneously. Besides, it can learn the residual images so as to facilitate training on higher resolution images. With the help of multi-scale discriminators for adversarial training, it can even generate high-quality images with finer details and less artifacts. We demonstrate the effectiveness of our model in overcoming the above three limitations by comparing with other methods on the CelebA face database.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 6

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

c@       c

(a) removing bangs
(b) adding bangs

&

Figure 3: Results of ELEGANT in transferring the bangs attribute. Out of four images in a row, the bangs style of the first image is transferred to the last one.

c@       c

(a) feminizing
(b) virilizing

&

Figure 6: Results of ELEGANT in transferring the gender attribute.

c@       c

(a) removing eyeglasses
(b) adding eyeglasses

&

Figure 9: Results of ELEGANT in transferring the eyeglasses attribute. In each row, the type of eyeglasses in the first image is transferred to the last one.

c@       c

(a) removing smile
(b) adding smile

&

Figure 12: Results of ELEGANT in transferring the smiling attribute. In each row, the style of smiling of the first image is transplanted into the last one.

c@       c

(a) black hair to non-black
(b) non-black hair to black

&

Figure 15: Results of ELEGANT in transferring the black hair attribute. In each row, the color of the first image turns to be the color of the third one, apart from turning the color of the third image into black.

The task of transferring face attributes is a type of conditional image generation. A source face image is modified to contain the targeted attribute, while the person identity should be preserved. As an example shown in Fig. 3, the bangs attribute is manipulated (added or removed) without changing the person identity. For each pair of images, the right image is purely generated from the left one, without the corresponding images in the training set.

A lot of methods have been proposed to accomplish this task, but they still suffer from different kinds of limitations.

Gardner et al[3]

has proposed a method called Deep Manifold Traversal that was able to approximate the natural image manifold and compute the attribute vector from the source domain to the target domain by using maximum mean discrepancy (MMD) 

[6]. By this method, the attribute vector is a linear combination of the feature representations of training images extracted from VGG-19 [22] network. However, it suffers from unbearable time and memory cost, and thus is not useful in practice.

Under the Linear Feature Space assumptions [1], one can transfer face attribute in a much simpler manner [24]: adding an attribute vector to the original image in the feature space, and then obtaining the solution in the image space inversely from the computed feature. For example, transferring a no-bangs image to a bangs image would be formulated as , where

is a mapping (usually deep neural networks) from the image space to the feature space, and the attribute vector

can be computed as the difference between the cluster centers of features of bangs images and no-bangs images. The universal attribute vector is applicable to a variety of faces, leading to the same style of bangs in the generated face images. But there are many styles of bangs. Fig. 3 would be a good illustration of different styles of bangs. Some kinds of bangs are thick enough to cover the entire forehead, some tend to go either left or right side, exposing the other half forehead, and some others may divide from the middle, etc.

To address the diversity issue, the Visual Analogy-Making [19] has used a pair of reference images to specify the attribute vector. Such a pair of images consists of two images of the same person where one has one certain attribute and the other one does not. This method could increase the richness and diversity of generated images, however, it is usually hard to obtain a large quantity of such paired images. For example, if transferring the attribute gender over face images, we need to obtain both male and female images of a same person, which is impossible. (See Fig. 6)

Recently, more and more methods based on GANs [5] have been proposed to overcome this difficulty [18, 31, 10]

. The task of face attribute transfer can be viewed as a kind of image-to-image translation problem. Images with/without one certain attribute lies in different image domains. The dual learning approaches

[7, 21, 11, 28, 32] have been further exploited to map between source image domain and target image domain. The maps between the two domains are continuous and inverse to each other under the cycle consistency loss. According to the Invariance of Domain Theorem 111https://en.wikipedia.org/wiki/Invariance_of_domain, the intrinsic dimensions of two image domains should be the same. This leads to a contradiction, because the intrinsic dimensions of two image domains are not always the same. Taking transferring eyeglasses (Fig. 9) as an example, domain contains face images wearing eyeglasses, and domain contains face images wearing no eyeglasses. The intrinsic dimension of is larger than that of due to the variety of eyeglasses.

Some other methods [23, 15, 30]

are actually the variants of combinations of GAN and VAE. These models employ the autoencoder structure for image generation instead of using two maps interconnecting two image domains. They successfully bypass the problem of unequal intrinsic dimensions. However, most of these models are limited to manipulating only one face attribute each time.

To control multiple attributes simultaneously, lots of conditional image generation methods [18, 13, 2, 29] receive image labels as conditions. Admittedly, these models could transfer several attributes at the same time, but fail to generate images by exemplars, that is, generating images with exactly the same attributes in another reference image. Consequently, the style of attributes in the generated image might be similar, thus lacking of richness and diversity.

BicycleGAN [33] introduces a noise term to increase the diversity, but fails to generate images of specified attributes. TD-GAN [25] and DNA-GAN [27] can generate images by exemplars. But TD-GAN requires explicit identity information in the label so as to preserve the person identity, which limits its application in many datasets without labeled identity information. DNA-GAN suffers from the training difficulty on high-resolution images. There also exist many other methods [14], however, their results are not visually satisfying, either low-resolution or lots of artifacts in the generated images.

2 Purpose and Intuition

As discussed above, there are many approaches to transferring face attributes. However, most of them suffer from one or more following limitations:

  1. Incapability of generating image by exemplars;

  2. Being unable to transfer multiple face attributes simultaneously;

  3. Low quality of generated images, such as low-resolution or artifacts.

To overcome these three limitations, we propose a novel model integrated with different advantages for multiple face attribute transfer.

To generate images by exemplars, a model must receive a reference for conditional image generation. Most of previous methods [18, 17, 13, 2]

use labels directly for guiding conditional image generation. But the information provided by a label is very limited, which is not commensurate with the diversity of images of that label. Various kinds of smiling face images can be classified into

smiling, but cannot be generated inversely from the same label smiling. So we set the latent encodings of images as the reference as the encodings of an image can be viewed as a unique identifier of an image given the encoder. The encodings of reference images are added to inputs so as to guide the generation process. In this way, the generated image will have exactly the same style of attributes in the reference images.

For manipulating multiple attributes simultaneously, the latent encodings of an image can be divided into different parts, where each part encodes information of a single attribute [27]. In this way, multiple attributes are encoded in a disentangled manner. When transferring several certain face attributes, the encodings parts corresponding to those attributes should be changed.

To improve the quality of generated images, we adopt the idea of residual learning [8, 21] and multi-scale discriminators [26]. The local property of face attributes is unique in the task of face attributes transfer, contrast to the task of image style transfer [4], where the image style is a holistic property. Such property allows us to modify only a local part of the image so as to transfer face attributes, which helps alleviate the training difficulty. The multi-scale discriminators can capture different levels of information that is useful for the generation of both holistic content and local details.

3 Our Method

In this section, we formally propose our method ELEGANT, the abbreviation of Exchanging Latent Encodings with GAN for Transferring multiple face attributes.

3.1 The ELEGANT Model

The ELEGANT model receives two sets of training images as inputs: a positive set and a negative set. In our convention, the image from the positive set has the attribute, whereas the image from the negative set does not. As shown in Fig. 16, image has the attribute smiling and image does not. The positive set and negative set need not to be paired. (The person from the positive set needs not to be the same as the one from the negative set.)

All of transferred attributes are predefined. It is not naturally guaranteed that each attribute is encoded into different parts. Such disentangled representations have to be learned. We adopt the iterative training strategy: training the model with respect to a particular attribute each time by feeding with a pair of images with opposite attribute and go over all attributes repeatedly.

When training ELEGANT about the -th attribute smiling at this iteration, a set of smiling images and another set of non-smiling images are collected as inputs. Formally, the attribute labels of and are required to be in this form and , respectively.

An encoder was then used to obtain the latent encodings of images and , denoted by and , respectively.

(1)

where (or

) is the feature tensor that encodes the smiling information of image

(or ). In practice, we split the tensor (or ) into parts along with its channel dimension. Once obtained and , we exchange the -th part in their latent encodings so as to obtain novel encodings and .

(2)

We expect that is the encoding of the non-smiling version of image , and the encodings of the smiling version of image . As shown in Fig. 16, and are both reference images for each other, and are generated by swapping the latent encodings.

Then we need to design a reasonable structure to decipher the latent encodings into images. As discussed in Sec. 2, it would be much better to learn the residual images rather than the original image. So we recombine the latent encodings and employ a decoder to do this job.

(3)
(4)

where and are residual images, and are reconstructed images, and are images of novel attributes, denotes the concatenation of encodings and . The concatenation could be replaced by difference of two encodings, but we still use the form of concatenation, because the subtraction operation could be learnt by the .

Figure 16: The ELEGANT model architecture.

Besides, we use the U-Net[20] structure for better visual results. The structures of and are symmetrical, and their intermediary layers are connected by shortcuts, as displayed in Fig. 16. These shortcuts bring the original images as a context condition, so as to generate seamless novel attributes.

The and together act as the generator. We also need discriminators for adversarial training. However, the receptive field of a single discriminator is limited when the input image size becomes large. To address this issue, we adopt multi-scale discriminators [26]: two discriminators having identical network structures whereas operating at different image scales. We denote the discriminator operating at a larger scale by and the other one by . has a smaller receptive field compared with . Therefore, is specialized in guiding the and to produce finer details, whereas is adept in handling the holistic image content so as to avoid generating grimaces.

The discriminators should also receive image labels as conditional inputs. There are attributes in total. The output of discriminators in each iteration reflects how real-looking the generated images are with respect to one attribute. It is necessary to let discriminators know which attribute they are dealing with in each iteration. Mathematically, it would be a conditional form. For example, represents the output score by for image given its label . We should pay attention to the attribute labels of and , since they have novel attributes.

(5)
(6)

where differs from only in the -th element, by replacing with , since we do not expect to have the -th attribute. The same applies to and .

3.2 Loss Functions

The multi-scale discriminators and receive the standard adversarial loss

(7)
(8)
(9)

When minimizing , we are actually maximizing the scores for real images and minimizing scores for fake images in the meantime. This drives and to discriminate the fake images from the real ones.

As for the and , there are two types of losses. The first type is the reconstruction loss,

(10)

which measures how well the original input is reconstructed after a sequence of encoding and decoding. The second type is the standard adversarial loss

(11)

which measures how realistic the generated images are. The total loss for the generator is

(12)

4 Experiments

Figure 17: Interpolation results of different bangs. The top-left is the original one, and those at the other three corners are reference images of different styles of bangs. The rest 16 images in the center are interpolation results.

In this section, we carry out different types of experiments to validate the effectiveness of our method in overcoming three limitations. First of all, we introduce the dataset and our model in details.

The CelebA [16] dataset is a large-scale face database including 202599 face images of 10177 identities, each with 40 attributes annotations and 5 landmark locations. We use the 5-point landmarks to align all face images and crop them into . All of the following experiments are performed at this scale.

The encoder is equipped with 5 layers of Conv-Norm-LeakyReLU block, and the decoder has 5 layers of Deconv-Norm-LeakyReLU block. The multi-scale discriminators uses 5 layers of Conv-Norm-LeakyReLU blocks followed by a fully connected layer. All networks are trained using Adam [12] initialized with learning rate 2e-4, and . All input images are normalized into the range , and the last layer of decoder is clipped into the range using , since the maximum difference between the input image and the output image is 2. After adding the residual to the input image, we clip the output image value into to avoid the out-of-range error.

It is worth mentioning that the Batch-Normalization (BN) layer should be avoided. ELEGANT receives two batches of images with opposite attribute as inputs, thus the moving mean and moving variance of two batches of images in each layer should make a big difference. If using BN, these running statistics in each layer will always oscillate. To overcome this issue, we replace the BN by

-normalization, where and are learnable parameters. Without computing moving statistics, ELEGANT converges stably and swaps face attributes effectively.

4.1 Face Image Generation by Exemplars

(a) bangs
(b) smiling
Figure 20: Face image generation by exemplars. The yellow and green box are the input images outside the training data and the reference images, respectively. Images in the red and blue box are the results of ELEGANT and other models.

In order to demonstrate that our model can generate face images by exemplars, we choose UNIT [15], CycleGAN [32] and StarGAN [2] for comparison. As shown in Fig. 20, ELEGANT can generate different face images with exactly the same style of attribute in the reference images, whereas other methods are only able to generate a common style of attribute for any input images. (The style of bangs is the same in each column in the blue box.)

An important drawback of StarGAN should be pointed out here. StarGAN could be trained to transfer multiple attributes, but when transferring only one certain attribute, it may change other attributes. For example, in the last column of Fig. 20(a), Fei-Fei Li and Andrew Ng become younger when adding bangs to them. This is because StarGAN requires an unambiguous label for the input image, and these two images are both labeled as 1 in the attribute young. However, both of them are middle-aged and cannot be simply labeled as either young or old.

The mechanism of exchanging latent encodings in the ELEGANT model effectively addresses this issue. ELEGANT focuses on the attribute that we are dealing with and does not require labels for the input images at testing phase. Moreover, ELEGANT could learn the subtle difference between different bangs style in the reference images, as displayed in Fig. 17.

4.2 Dealing with Multiple Attributes Simultaneously

We compare ELEGANT with DNA-GAN [27], because both of them are able to manipulate multiple face attributes and generate images by exemplars. Two models are performed on the same face images and reference images with respect to three attributes. As shown in Fig. 24, the ELEGANT is visually much better than DNA-GAN, particularly in producing finer details (zooming in for a closer look). The improvement compared with DNA-GAN is mainly the result of the residual learning and multi-scale discriminators.

(a) Bangs and Smiling
(b) Smiling and Mustache
(c) Bangs and Mustache
Figure 24: Multiple Attributes Interpolation. The left and right columns are results of ELEGANT and DNA-GAN, respectively. For each picture, the top-left, bottom-left and top-right images are the original image, reference images of the first and the second attributes. The original image gradually owns two different attributes of the reference images in two directions.

Residual learning reduces training difficulty. DNA-GAN suffers from unstable training, especially on high resolution images. On one hand, this difficulty comes from an imbalance between the generator and discriminator. At the early stage of DNA-GAN training, the generator outputs nonsense so that the discriminator could easily learn how to tell generated images from real ones, which would break the balance quickly. However, ELEGANT adopts the idea of residual learning, thus the outputs of the generator are almost the same as original images at the early stage. In this way, the discriminator cannot be well trained so fast, which would help stabilize the training process. On the other hand, the burden of the generator becomes heavier than that of the discriminator as the image size goes larger. Because the output space of the generator gets larger (e.g., ), whereas the discriminator only needs to output a number as usual. However, ELEGANT effectively reduces the dimension of generator’s output space by learning residual images, where a small number of pixels need to be modified.

Multi-scale discriminators improve the quality of generated images. One discriminator operating at the smaller input scale can guide the overall image content generation, and the other operating at the larger input scale can help the generator to produce finer details. (Already discussed in Sec. 3.1)

Moreover, DNA-GAN utilizes an additional part to encode face id and background information. It is a good idea, but brings the problem of trivial solutions: two input images can be directly swapped so as to satisfy the loss constraints. Xiao et al[27] have proposed the so called annihilating operation to address this issue. But this operation leads to a distortion on the parameter spaces, which brings additional difficulty to training. ELEGANT learns the residual images that account for the changes so that the face id and background information are automatically preserved. Moreover, it removes the annihilating operation and the additional part in the latent encodings, which makes the whole framework more elegant and easy to understand.

4.3 High-quality Generated Images

As displayed in Fig. 3 6 9 12 15, we present the results of ELEGANT with respect to different attributes in a large size for a close look. Moreover, we use the Fréchet Inception Distance [9] (FID) to measure the quality of generated images. FID measures the distance of two distributions by

(13)

where and are means and covariance matrices of two distributions. As shown in Table 4.3, we compute the FID between the distribution of real images and generated images with respect to different attributes. ELEGANT achieves competitive results compared with other methods.

The FID score is only for reference due to two reasons. ELEGANT and DNA-GAN can generate images by exemplars, which is much more general and difficult than other types of image translation methods. So it would be still unfair to them using any kind of qualitative measures. Besides, the reasonable qualitative measure for GAN is undetermined.

?c?r—r—r—r—r—r—r—r—r—r? [1.5pt]- FID & bangs   & smiling   & mustache   & eyeglasses   & male  
&   &   &   &  &   &  &   &  &   &  

UNIT & 135.41 & 137.94 & 120.25 & 125.04 & 119.32 & 131.33 & 111.49 & 139.43 & 152.16 & 154.59
CycleGAN &
27.81 & 33.22 & 23.23 & 22.74 & 43.58 & 55.49 & 36.87 & 48.82 & 60.25 & 46.25
StarGAN & 59.68 & 71.07 & 51.36 & 78.87 & 99.03 & 176.18 & 70.40 & 142.35 & 70.14 & 206.21
DNA-GAN & 79.27 & 76.89 & 77.04 & 72.35 & 126.33 & 127.66 & 75.02 & 75.96 & 121.04 & 118.67
ELEGANT & 30.71 & 31.12 & 25.71 & 24.88 & 37.51 & 49.13 & 47.35 & 60.71 & 59.37 & 56.80
[1.5pt]-

5 Conclusions

We have established a novel model ELEGANT for transferring multiple face attributes. The model encodes different attributes into disentangled parts and generate images with novel attributes by exchanging certain parts of latent encodings. Under the observation that only local part of the image should be modified to transfer face attribute, we adopt the residual learning to facilitate training on high-resolution images. A U-Net structure design and multi-scale discriminators further improve the image quality. Experimental results on CelebA face database demonstrate that ELEGANT successfully overcomes three common limitations existing in most of other methods.

Acknowledgement. This work was supported by High-performance Computing Platform of Peking University.

References

Table 1: FID of Different Methods with respect to five attributes. The () represents the generated images by adding (removing) the attribute.

5 Conclusions

We have established a novel model ELEGANT for transferring multiple face attributes. The model encodes different attributes into disentangled parts and generate images with novel attributes by exchanging certain parts of latent encodings. Under the observation that only local part of the image should be modified to transfer face attribute, we adopt the residual learning to facilitate training on high-resolution images. A U-Net structure design and multi-scale discriminators further improve the image quality. Experimental results on CelebA face database demonstrate that ELEGANT successfully overcomes three common limitations existing in most of other methods.

Acknowledgement. This work was supported by High-performance Computing Platform of Peking University.

References

References