Controllable Generative Adversarial Network

08/02/2017 ∙ by Minhyeok Lee, et al. ∙ Korea University 0

Although it is recently introduced, in last few years, generative adversarial network (GAN) has been shown many promising results to generate realistic samples. However, it is hardly able to control generated samples since input variables for a generator are from a random distribution. Some attempts have been made to control generated samples from GAN, but they have not shown good performances with difficult problems. Furthermore, it is hardly possible to control the generator to concentrate on reality or distinctness. For example, with existing models, a generator for face image generation cannot be set to concentrate on one of the two objectives, i.e. generating realistic face and generating difference face according to input labels. Here, we propose controllable GAN (CGAN) in this paper. CGAN shows powerful performance to control generated samples; in addition, it can control the generator to concentrate on reality or distinctness. In this paper, CGAN is evaluated with CelebA datasets. We believe that CGAN can contribute to the research in generative neural network models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Network (GAN) is a neural network structure, which has been introduced for generating realistic samples. GAN consists of two modules, a generator and a discriminator. A generator produces fake samples from random noises, while a discriminator attempts to distinguish between these fake samples and real samples. The generator tries to deceive the discriminator by learning from errors which are the output of discriminator with fake samples. By such an adversarial and competitive learning, latent variables of the samples are mapped onto random variables which are the input of generators. After adequate learning iterations of such a process, the generator can generate realistic samples from random noises.

While it is introduced recently, GAN has shown many promising results not only for generating realistic samples (r1, ; r2, ; r3, ), but also for machine translation (r4, )

and image super-resolution

(r5, ).

However, for generating realistic samples, we can hardly control the GAN because the random distribution is used for the input variables of generators. While vanilla GAN can generate realistic samples from a random noise, the relationship between inputs of the generator and features of generated samples is not obvious. In the last few years, there have been several attempts to control generated samples by GAN (r6, ; r7, ; r8, ; r9, ; r10, ). One of the most popular methods to control GAN is the conditional GAN (r10, ). Conditional GAN inputs labels into the generator and the discriminator so that they work under the conditions.

The current conditional GAN mainly focuses on generating realistic samples, rather than making difference between generated samples according to input labels. As results, it is difficult to generate samples with detailed features while conditional GAN is successful for major features. For example, with CelebA dataset (r11, ) which consists of 202,559 celebrity face images labeled with 40 different features, conditional GAN only works with major features, such as smiling or bangs. In order to make the generator work with detailed features, such as pointy nose or arched eyebrows, we have to control the generator to more focus on generating different samples according to input labels.

In this paper, we propose a novel architecture of the generative model to control generated samples, called Controllable Generative Adversarial Network (ControlGAN). ControlGAN is composed of three players, a generator/decoder, a discriminator and a classifier/encoder. The generator in ControlGAN plays the games with the discriminator and the classifier simultaneously in our method; the generator aims to deceive the discriminator and be classified correctly by the classifier.

ControlGAN has two main advantages compared to existing models. First, ControlGAN can be trained to focus more on input labels so that ControlGAN can generate samples with detailed feature where conditional GAN can hardly generate. Second, ControlGAN uses an independent network for mapping the features into corresponding input labels while the discriminator conducts such a work in conditional GAN and other conditional variants of GAN (acgan, ). Consequently, the discriminator can more concentrate on its own objective, which is the discrimination between fake samples and original samples, so that the quality of generated samples can be enhanced.

Figure 1: Comparison between ControlGAN and conditional GAN for generating face images with detailed features. Each row is generated with the same input noise . The images in the right most column are generated with , and the images in the left most columns are generated with . The intermediate images are generated with interpolated and extrapolated label values

ControlGAN is applied to the CelebA dataset and LSUN dataset (lsun, ) in this paper. As shown in Figure 1, we demonstrate that ControlGAN can effectively generate face images according to input labels. Furthermore, we demonstrate the ControlGAN also works with extrapolated label values. To evaluate such a zero-shot learning of ControlGAN, we test the ControlGAN with untrained label values in Section 4.3.

2 Background

2.1 A brief review of generative adversarial networks and its conditional variants

GAN is a neural network structure for learning to generate samples and mapping latent variables of a dataset. Given a dataset where , if the samples are not orthogonal to each other, latent variables exist. For example, for a face image dataset, latent variables can be the attribute of human faces, such as shape of face, sharpness of nose, and color of eyes.

Auto-encoder(AE) is a neural network structure to encode samples into latent variables to decrease dimensionality of a dataset, i.e. . The objective of generative models is the inverse function of the AE (), which means, given latent variables, the models aim to generate samples. Therefore, the structure of generative models using neural network architecture is similar to the decoder of AE. The problem of generative models is how to find these latent variables and learn to generate samples.

GAN solves such a problem by a competitive learning process between the generator and the discriminator. First, a generator generates samples from randomly initialized variables. Then, a discriminator learns to distinguish between the generated samples and real samples. Simultaneously, the generator learns to deceive the discriminator by losses of the generated samples. By repeating such a process, the generator can learn to generate realistic samples and embed the latent variables into input variables of the generator.

However, since the relation between generated samples and input variables is not obvious, the generated samples cannot be controlled as we desire to. For example, we cannot control GAN to generate face image samples of a smiling old woman having blond hair; we have to select the face images from randomly generated samples when we use vanilla GAN. To address such a problem, conditional variants of GAN have been studied.

Conditional GAN is the most popular GAN structure to control the generated samples from a generator. Conditional GAN takes label inputs for the generator and the discriminator to work the generator under the condition of the input labels.

Several studies have been conducted for using a classifier to address the problem. Auxiliary Classifier GAN (AC-GAN) (acgan, ) uses a classifier as the discriminator of GAN structure. Triple-GAN (triplegan, ) uses the classification results as an input for discriminator. However, such methods commonly use a classifier that is attached to a discriminator. Therefore, since the discriminator decides the condition of samples, the methods can hardly handle the limitation of conditional GAN.

2.2 The limitation of conditional GAN

While the conditional GAN is the most popular GAN structure to generate conditional samples, conditional GAN frequently fails to generate detailed features. For example, while conditional GAN can generate a face image sample with a condition that is easily distinguishable, such as ’Blond Hair’, conditional GAN can hardly generate face images with some detailed labels such as ’Arched Eyebrows’, ’Big Lips’, ’Mouth Slightly Open’, ’Wearing Earrings’ and ’Wearing Lipstick’, as shown in Figure 2.

Figure 2: Generated face images by conditional GAN with ten different labels.

Such a failure occurs because the discriminator decides whether the label/condition is correct. The main objective of discriminator is the discrimination between fake and real samples. Therefore, if a condition (or a label) is very rare in a dataset or is far from the center of sample distribution where the samples densely exist, the probability that the discriminator decides the samples with such conditions are fake samples increases.

3 Methods

3.1 Controllable generative adversarial networks

ControlGAN is composed of three neural network structures, which are a generator/decoder, a discriminator and a classifier/encoder. Figure 3 illustrates the architecture of ControlGAN. Three-player game is conducted in ControlGAN where the generator tries to deceive the discriminator, which is the same as vanilla GAN, and simultaneously aim to be classified corresponding class by the classifier. The generator and the classifier can be interpreted as a decoder-encoder structure because labels are commonly used for inputs for the generator and outputs for the classifier.

Figure 3: The concept of ControlGAN. (a) The architecture of ControlGAN. (b) An illustration of the concept of ControlGAN. The green dashed line denotes the classifier and the orange dashed line denotes the discriminator. The grey figures denote samples labeled with different class. The generator (blue region) tries to learn the sample distribution and be classified to correct labels, simultaneously.

ControlGAN minimizes the following equations:

θ_D=arg min{ α⋅L_D (t_D,D(x;θ_D)) + (1-α) ⋅L_D((1-t_D),D(G(z,l;θ_G);θ_D)) }, θ_G=arg min{ γ_t ⋅L_C (l,G(z,l;θ_G)) + L_D (t_D,D(G(z,l;θ_G);θ_D)) }, θ_C=arg min{ L_C (l,x;θ_C) },

where is the binary representation of labels of sample and input data for the generator, is the label for discriminator which we set to one in this work, and denotes a parameter for the discriminator.

ControlGAN forces features to be mapped onto corresponding inputted into the generator. The parameter decides how much the generator focus on the input labels for the generator.

It is important to maintain the equilibrium between the two objectives of the generator in ControlGAN since the ControlGAN aims to optimize a decoder-encoder structure and a GAN structure simultaneously. Suppose a well-trained conditional generator which perfectly learned a true distribution exists, then a set of generated samples from the generator has same classification loss with the original dataset:

E=LC(l,G(z,l;θG))LC(l,x) = 1     if    G(z,l) = P(X)

If a generator is trained to concentrate on the input labels, the value in (3) would be less than one, and otherwise the value would be more than one.

ControlGAN controls whether to concentrate on learning the distribution of a dataset or learning to generate samples according to input labels by the parameter that maintains the classification loss of generated samples constantly. The is a learning parameter to maintain which is changed by time step , i.e. the iteration process, and is calculated as follows:

γ_t=γ_t-1 + r⋅{L_C (l,G(z,l;θ_G)) - E⋅L_C (l,x) },

where is a learning rate parameter for .

Such a concept of the equilibrium parameter is similar to that of Boundary Equilibrium GAN (BEGAN) (r3, ). BEGAN employs an equilibrium parameter to maintain the balance between the generator and the discriminator. In this work, the equilibrium parameter is used for the balance between the learning of the GAN structure and the decoder-encoder structure.

3.2 ControlDCGAN structure for the applications

In this work, we used ControlDCGAN structure which is a combination of the architecture of ControlGAN and Deep Convolutional Generative Adversarial Network (DCGAN) (dcgan, )

. We used the residual modules for the generator, the discriminator and the classifier. The batch normalization or dropout is not used to evaluate the vanilla ControlGAN. The generator, the discriminator and the classifier consist of 19, 22 and 22 hidden layers, respectively. The size of generated samples from the generator is

. The architecture we used for the application is summarized in Table 1.

Description Generator Discriminator Classifier
Input Concatenate(, ) or or
Fully connected FC() None None
Convolutional None Conv(5, 5, 64) Conv(5, 5, 64)
Residual module 1 Deconv(3, 3, 64) Conv(3, 3, 64) Conv(3, 3, 64)
Deconv(3, 3, 64) Conv(3, 3, 64) Conv(3, 3, 64)
Deconvolutional Deconv(5, 5, 64) None None
Pooling None AveragePool(2, 2) AveragePool(2, 2)
Residual module 2 Deconv(3, 3, 64) Conv(3, 3, 64) Conv(3, 3, 64)
Deconv(3, 3, 64) Conv(3, 3, 64) Conv(3, 3, 64)
Deconvolutional Deconv(5, 5, 64) None None
Pooling None AveragePool(2, 2) AveragePool(2, 2)
Residual module 3 Deconv(3, 3, 64) Conv(3, 3, 64) Conv(3, 3, 64)
Deconv(3, 3, 64) Conv(3, 3, 64) Conv(3, 3, 64)
Deconvolutional Deconv(5, 5, 64) None None
Pooling None AveragePool(2, 2) AveragePool(2, 2)
Fully connected None FC(128) FC(128)
Fully connected None FC(1) FC(Num. of classes)
Table 1: The architecture of each module of ControlDCGAN for the applications.

denotes the stride for convolutional or deconvolutional layers.

4 Results and Discussion

4.1 Generating multi-label image samples of celebrity face using CelebA dataset

In this section, ControlGAN was trained over the CelebA dataset (r11, ). The CelebA dataset contains celebrity face images with multiple labels for each image. For example, a sample can have multiple labels of ‘Attractive’, ‘Blond Hair’, ‘Mouth Slightly Open’ and ‘Smiling’.

We used Adam optimizer with learning rate of and

to train the model. The learning rate decreases after 30 epochs, and the model was trained 20 more epochs with the decreased learning rate. The equilibrium parameter

was set to 0.05, 0.5 and 1.0. The learning rate parameter was set to 0.01, and

was set to 0.5. As for the inputs for the generator, the 500-dimensional uniform distribution, i.e.

, and the binary encoded labels were employed. We used leaky ReLU activation

function for the generator, the discriminator and the classifier.

A pre-training process was conducted for the classifier instead of a simultaneous training of the GAN structure and the decoder-encoder structure. After the two epochs of pre-training, the classifier was fixed and no more training had been conducted during the GAN structure training process.

Figure 4: Comparison between conditional GAN and ControlGAN. Images in each row are generated with a same input noise . Each column denotes the corresponding labels. Two labels are used together in the last two column. All images are the size of .

Figure 4 is a comparison between the ControlGAN and the conditional GAN. As shown in Figure 4, there are very little differences between labels/columns in condtional GAN. Generally, generated face images by ControlGAN follows the input labels well compared to the conditional GAN. For example, with the ‘Arched Eyebrows’ label, all image samples generated by ControlGAN follows the label while the face images generated by conditional GAN hardly show the difference.

As we described in the previous section, ControlGAN has an advantage for generating label-focused samples by choosing a low value of . By selecting a low value of , which means the generated samples have a low value of classification error, we can make the generator more focus on the input labels. Figure 4 shows the comparison between ControlGAN with different . As shown in Figure 4, with low , the generated images are significantly label-focused. In , the images generated with the condition of ’Pale Skin’, correspond to an unreal level where similar genuine samples rarely or do not exist in the CelebA dataset.

Such a property is the main advantage of ControlGAN since it proves that ControlGAN can generate samples beyond the training set. We will describe further such a zero-shot property of ControlGAN in Section 4.3

4.2 Room image generation with LSUN dataset

In this section, in order to demonstrate generalizability of ControlGAN, ControlGAN was trained with a different dataset, which is a large scale scene dataset called LSUN (lsun, ). The dataset consists of ten different places, such as a bedroom and a restaurant. Among the places, we selected four different labels corresponding to indoor house rooms, i.e. ’Bedroom’, ’Dining room’, ’Kitchen’ and ’Living room’.

Figure 5: size of conditional room images generated by ControlGAN.

The architecture we used for the application is the same as the structure of the previous section. We used the learning rate of , and is set to 0.05, 0.5 and 1.0. A pre-trained classifier was also used. The classifier was trained for 0.1 epoch. The generator and discriminator were trained for one epoch.

As shown in Figure 5, ControlGAN can learn the features of each room, and successfully generate room images according to the labels. According to our expectation, the equilibrium parameter decides the degree to concentrate on the input labels as same as the previous application. However, we found that ControlGAN can generate much clearer images with a low . Such a property is conjectured because the classifier can assist the GAN training by forcibly mapping the features of labels onto the input , as we described in Section 2 and Figure 3.

4.3 Interpolation & extrapolation of labels

In order to demonstrate that the ControlGAN learns the features and does not just memorize the training set, the input labels are interpolated and extrapolated in this section. Note that the labels were one-hot or binary encoded in the training process, therefore the interpolated values had never been trained.

For a further demonstration of the effectiveness of ControlGAN, we used the extrapolated values between [-1.0, 3.0]. Since the input labels were binary or one-hot encoded, and the training had been conducted with only the two values of 0.0 and 1.0, we expected to obtain a half-smile face image with the value of 0.5 for the ’Smiling’ label and a perfect-smile face image with the value of 3.0 if the ControlGAN learns the features well.

Figure 6: Interpolation and extrapolation of labels. Note that the model was trained only with and . The values of , and in this figure have never been trained in the training process. is set to 0.05.

As shown in Figure 6, ControlGAN conducts well with interpolated and extrapolated values of labels. Interestingly, the generated face image with label corresponds to a frown or angry face, which is an untrained feature in the training process. Likewise, corresponds to the dark skin which is not contained in the labels of CelebA dataset.

Such a result implies the ControlGAN can learn the attributes of input labels. One can easily conjecture that the opposite of a smiling face might be a frown or angry face; ControlGAN can do such a conjecture as well while it have never been trained.

5 Conclusion

In this paper, we proposed a generative model, called ControlGAN, which can effectively control generated samples. ControlGAN consists of three modules, i.e. a generator/decoder, a discriminator and a classifier/encoder. By mapping corresponding features into input labels, generated samples can be controlled according to the labels.

While ControlGAN is a simple architecture, which is a combination of the vanilla GAN and a decoder-encoder structure, we demonstrated that ControlGAN works well to generate conditional samples. We employed DCGAN architecture to demonstrate the applications; however, the quality of samples can be enhanced with a state-of-art GAN structure, such as StackGAN (r1, ), WGAN (wgan, ) and BEGAN (r3, ).

Furthermore, we demonstrated the ControlGAN conducts with zero-shot values, by feeding interpolated and extrapolated values to the generator. Since the proposed architecture shows powerful performance to control generated samples, we expect that ControlGAN can contribute to the research in generative models.

References

Appendix A List of Appendix

Figure 7: Comparison between conditional GAN and ControlGAN with , and .

Figure 8: size of conditional room images generated by ControlGAN.

Figure 9: Interpolation of the ’Smiling’, ’Young’, ’Pale Skin’ and ’Male’ labels.

Figure 10: Interpolation of the ’Attractive’, ’Big Lips’, ’Big Nose’ and ’Bushy Eyebrows’ labels.

Figure 11: Interpolation of the ’Chubby’, ’Heavy Makeup’, ’Mouth Slightly Open’ and ’Pointy Nose’ labels.

Figure 7: Comparison between conditional GAN and ControlGAN with , and . Images in each row are generated with a same input noise . Each column denotes the corresponding labels. Two labels are used together in the last two column. All images are the size of .
Figure 8: size of conditional room images generated by ControlGAN. is set to 0.05. Note that the model was trained for one epoch.
Figure 9: Interpolation of the ’Smiling’, ’Young’, ’Pale Skin’ and ’Male’ labels. Note that the model was trained only with and . The values of , and in this figure had never been trained in the training process. is set to 0.05.
Figure 10: Interpolation of the ’Attractive’, ’Big Lips’, ’Big Nose’ and ’Bushy Eyebrows’ labels. Note that the model was trained only with and . The values of , and in this figure had never been trained in the training process. is set to 0.05.
Figure 11: Interpolation of the ’Chubby’, ’Heavy Makeup’, ’Mouth Slightly Open’ and ’Pointy Nose’ labels. Note that the model was trained only with and . The values of , and in this figure had never been trained in the training process. is set to 0.05.