Human Annotations Improve GAN Performances

11/15/2019 ∙ by Juanyong Duan, et al. ∙ University of Minnesota National University of Singapore 19

Generative Adversarial Networks (GANs) have shown great success in many applications. In this work, we present a novel method that leverages human annotations to improve the quality of generated images. Unlike previous paradigms that directly ask annotators to distinguish between real and fake data in a straightforward way, we propose and annotate a set of carefully designed attributes that encode important image information at various levels, to understand the differences between fake and real images. Specifically, we have collected an annotated dataset that contains 600 fake images and 400 real images. These images are evaluated by 10 workers from the Amazon Mechanical Turk (AMT) based on eight carefully defined attributes. Statistical analyses have revealed different distributions of the proposed attributes between real and fake images. These attributes are shown to be useful in discriminating fake images from real ones, and deep neural networks are developed to automatically predict the attributes. We further utilize the information by integrating the attributes into GANs to generate better images. Experimental results evaluated by multiple metrics show performance improvement of the proposed model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 5

page 6

page 7

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, generative adversarial networks (GANs) [13] have achieved impressive success in various applications [49, 46, 43, 28, 29]

. A vanilla GAN contains a generator that maps a low-dimension latent code into the target space, for example, the image space. Instead of estimating the likelihood of the generated sample, it employs a discriminator to judge how difficult it is to discriminate from real samples. The generator and the discriminator are optimized jointly in an adversarial manner until equilibrium state has been reached.

However, the training dynamics of GAN are usually unstable and the generator may output images that collapse to limited modes or with low quality. In this work, we aim to study whether human annotations will improve the quality of the generated images. To achieve this, we have collected human annotated data from the Amazon Mechanical Turk (AMT) on 1000 images, including 600 generated images (fake images), and for reference, 400 realistic images (real images). Figure 1 shows sample images from our dataset. We have defined eight attributes, namely, “color”, “illuminance”, “object”, “people”, “scene”, “texture”, “realism”, and “weirdness”. Each image has been annotated by 10 workers with a scale of 1 to 5 on the eight attributes. We have further analyzed the annotations and identified key features that contribute to image quality evaluation. Furthermore, we have also constructed deep neural networks to predict attributes of images to mimic human annotations. We find that integrating these attributes can improve the quality of generated images as evaluated by multiple metrics.

(a) Real images
(b) Fake images
Figure 1: Sample images from our dataset.

Our core insight is that we can leverage human annotations to encode image samples by treating them as prior knowledge to help the discriminator figure out fake images. Concretely, our contributions are twofold:

  • We collect a new dataset of 1000 images with annotations from human to study the differences between real and fake images with a thorough analysis. In addition, we train the attribute net that mimics human subjects to annotate the attributes of new images.

  • We propose a new paradigm that includes the attribute net in the adversarial networks to improve the quality of generated images. Our paradigm can be applied to different GAN architectures and experiments have shown improvements in multiple metrics against baseline models.

2 Related Work

Generated Adversarial Networks (GANs)

GANs are a class of generative models that transform a known distribution (e.g. the normal distribution) to an unknown distribution (e.g. the image distribution). The two components, a generator and a discriminator, serve for different purposes. Informally, the generator is trained to fool the discriminator; thus we would like to improve the error rate of the discriminator during training, while the discriminator is trained to identify the generated samples from real samples. GANs have achieved impressive success in various applications, such as image generation

[38, 47], image editing [48], representation learning [31]

, super-resolution

[26], and domain transferring [49, 39, 18].

Variants of GAN

The vanilla GAN has many drawbacks, such as easily collapsing to a single point and generating nonsensical outputs. Radford et al. [38] propose Deep Convolutional GANs (DCGANs) to produce more meaningful results. Although many attempts have been made to use convolutional nets to scale up GANs, such efforts were unsuccessful. They summarize several architecture guides for stable deep convolutional net-based GANs. Experiments on several benchmark datasets prove the effectiveness of the proposed guidelines. Another variant of vanilla GAN focuses on the training objective. Mao et al. [30] propose a least square loss that mitigates the gradient vanishing problem when updating the generator.

Inspired by theories in optimal transport, Arjovsky et al. [2]

proposes the Wasserstein GAN which considers optimizing GAN as minimizing the Wasserstein distance between the distributions of real and fake images. The loss function requires the neural networks to be 1-Lipschitz. To achieve this, all the weights of the models are clipped within a range, but this method makes most of the weights fall on the boundary of the clipping interval. To make the weights distribute more naturally, Gulrajani et al.

[14] use a gradient penalty to regularize the 1-Lipschitz property on both the generator and the discriminator. Experiments have shown that Wasserstein GAN outperforms previous GAN variants and the gradient penalty performs better than weight clipping.

Recently, two large scale models, BigGAN [3] and StyleGAN [22] are proposed to synthesis images with high fidelity. BigGAN scales up traditional GAN with orthogonal regularization to the generator for finer control over the trade-offs between sample fidelity and variety. StyleGAN adopts style transfer methods by changing styles of latent codes to improve the control of strength of image features at different scales. However, both models require huge consumptions of computational power.

3 Dataset

In this section, we introduce the method used to generate the fake stimuli, including training details. We also describe the protocol for collecting data from AMT.

3.1 Network Architecture

The generator’s architecture is shown in Figure 2(a). We followed the implementation in [14] to generate decent samples. The network uses 4 blocks of residual modules. Each residual module has an upsample convolution layer that uses a sub-pixel CNN [41] for feature map upscaling. The shortcut path for each residual module is another upsample convolution layer that has the same output dimension as the one in the main path. At the end of the network, we use an additional layer of convolution and a function to output the final image. The discriminator’s architecture is shown in Figure 2(b)

. The discriminator also has 4 residual modules. It uses average pooling to downscale the feature maps by a factor of 2 in the width and height dimensions. The shortcut path in the discriminator contains an average pooling layer and a convolution layer. The average pooling layer downscales feature maps by a factor of 2, while the convolution layer outputs feature maps that have the same dimension as the main path. At the end of the net, a linear transformation transforms the output to a single scalar that represents the probability that the sample came from the data distribution rather than the noise distribution.

(a) Generator’s architecture.
(b) Discriminator’s architecture.
Figure 2: Generator and discriminator architectures. k3n512s1

means that the kernel size is 3, output dimension is 512 and stride is 1 for both upsample convolution layer and convolution layer. (a) Generator’s architecture. (b) Discriminator’s architecture.

The vanilla GAN may produce poor results due to unstable training, which may occur quite often [33, 36, 40, 2]. Therefore, we use the Wasserstein GAN (WGAN) with gradient penalty paradigm [14]. The objective function is defined by

(1)

where is drawn from the generator distribution , is drawn from the real image distribution , and

is a linear interpolation with a random number

. is the gradient penalty coefficient.

3.2 Training Details

The training procedure follows [14]. We initialized the weights with He’s normal initialization [15]. We set the gradient penalty coefficient to 10 and updated the discriminator once every 5 iterations. All the models were optimized with an Adam optimizer with and . We set the initial learning rate

and batch size 64. We trained the models for 200000 iterations. We sampled real images from the small ImageNet dataset with a resolution of 64 by 64 pixels

[43, 8].

3.3 Attributes to Annotate

To accurately compare the quality of fake stimuli with real stimuli, we define a set of 8 attributes from different aspects. The attributes are listed in Table 1.

Pixel-level image attributes, such as color, intensity, and orientation, are low-level features for saliency detection and are biologically plausible [19]. These features may be important to influence the quality of fake images. However, the stimuli are very small (64 by 64 pixels), so detailed information cannot be perceived easily. We only include color here. On the other hand, illuminance is an important attribute that differs from real images to computer generated images [34, 32]. We asked annotators to describe their feelings about the illuminance of the stimuli.

Human attention tends to be drawn by objects relating to humans, such as faces [21, 4], emotion [1], and crowds [20]. It is also easily attracted by moving objects [24, 45]. In addition, a key criteria is whether GANs can generate human recognizable objects in the image. Hence, we include object and human in the attribute list.

The classification of indoor/outdoor scenes is an important problem in computer vision

[35]. It has a well defined constraint that an outdoor scene is inside a man-made structure [6]

. The same semantic object may not be helpful to classify whether an image is indoor or outdoor, such as an indoor swimming pool and an outdoor swimming pool. In addition, some outdoor images are easier to generate, such as sky, ocean, and grassland. As a result, we include scenes in the attribute list to ask subjects to identify whether the image is perceived as indoor or as outdoor image.

In some previous image synthesis models, repeated patterns may be observed frequently in synthesized images [37]. It is possible for GANs to generate such patterns, so we define such features as texture and include it in our attribute list.

Realism is included as well. We also add weirdness in the set. Weirdness is defined as any unnatural features or objects, which might be common in generated images. Perception may be affected by the size of image [9]. Real images may also contain objects that could be perceived strange if the image is small.

Attribute Description
Color colorfulness, pixel value distribution
Illuminance light effect, shadows, brightness
Object objects in the image excluding humans, like car, animal, or furniture
People humans in the image
Scene outdoor rather than indoor scene
Texture repeated pattern
Realism overall naturalness, real or computer generated
Weirdness any unnatural feature,such as strange objects
Table 1: List of attributes and descriptions

3.4 Stimuli and Data Collection

Figure 3: User interface in the AMT task.

The dataset contains 1000 images. We used the generator trained above to produce 600 fake images with a resolution of 64 by 64 pixels. Another 400 real images were randomly selected from the small ImageNet dataset. All the images from this dataset are downsampled from the original ImageNet dataset. Images from small ImageNet are similar to CIFAR-10 images [42], but have greater variety [43, 8]. We requested workers from AMT to annotate the images. The user interface for the task is shown in Figure 3.

Each worker was asked to annotate a total of 20 images in one assignment. Participants were asked to judge whether the keyword (attribute) best describes the stimuli. There are five choices, “definitely yes” (5), “probably yes” (4), “not sure” (3), “probably no” (2), and “definitely no” (1). Each choice was then converted to the numerical score in the parentheses. To ensure the quality of annotation, we have set up two standards. Firstly, each participant must have an overall approval rate better than . Secondly, at the end of the annotation, participants needed to annotate five extra images that were selected from the images they had annotated in current assignment. In addition, participants were not allowed to see their previous selections. If the score difference between two annotations is large (i.e.

), the current assignment would be rejected. For each image, we collected annotations from 10 participants which all met the requirements. We calculated the score vector by averaging the annotations from the 10 participants for each image.

4 Data Analysis

In this section, we summarize the statistical and factor analyses of the the annotated data.

4.1 Statistical Analysis

4.1.1 Group means

We examined the sample mean of each attribute between real and fake images. We applied z-test on both groups of images. Figure 4 and Table 2

summarize the mean and standard deviation for each attribute.

Real images have higher scores in illuminance, object, people, and realism features, while fake images have higher scores in texture and weirdness features. There are no significant differences in color and scene features between real and fake images.

Figure 4: Bar plot of group means. ‘*’ indicates statistically different ().

This observation shows that WGANs can generate colorful images that have similar color spectra as real images. However, WGANs are less capable of generating meaningful objects or humans. Fake images tend to be perceived as repeated patterns. Not surprisingly, fake images are rated more weird and less real.

Attribute Real Fake -value -value
Color 0.87 0.39
Illuminance 17.38
Object 13.44
People 3.59
Realism 28.85
Scene -0.21 0.83
Texture -9.28
Weirdness -30.00
Table 2: Summary of statistical results. Group means, standard deviations, values, and values are reported.
(a) All images.
(b) Real images.
(c) Fake images.
Figure 5: Correlation matrices between attributes.

4.1.2 Correlation

We analyzed relationships between attributes by computing Pearson correlation coefficients. We first computed correlation with all the images, and then separately for real images and fake images. Figure 5 shows the results. A blank cell indicates insignificant correlation (). We observe that illuminance is highly correlated with realism (). Note the correlation between illuminance and realism within each group. These two attributes are also moderately correlated ( for real images, for fake images.) This shows that illuminance is an important factor for realism. This result is consistent with previous findings [34, 32, 10, 11]. We can also observe that object and realism are also correlated ( for all images). The result accords with previous findings that images with more objects are perceived to be more realistic [11, 25, 7]. We notice that for real images, object and human are negatively correlated (). This is because the stimuli are very small, so a single image is not likely have both objects and human. But for fake images, the correlation between them is insignificant. This indicates that WGANs are not likely to generate meaningful objects or humans. As expected, realism is negatively correlated with weirdness ().

4.2 Factor Analysis

Factor analysis (FA) is a statistical method that describes observed variables by latent, unobserved factors. Factor analysis is similar to principal component analysis (PCA) in that both are feature dimension reduction methods. However, components in PCA must be orthogonal to maximize the total variance, but the factors in FA are not necessarily orthogonal so that they can correlate with each other. We applied exploratory factor analysis (EFA) followed by confirmatory factor analysis (CFA) on the whole dataset. EFA identifies latent factors as linear combination of observed variables, while CFA tests how well the model fits the data. Attributes with poor fits, or loadings, are eliminated.

The model parameters and structure are presented in Figure 6. To estimate the fit of the model, two common indices are used. The first one is the Comparative Fit Index (CFI), which compares a chi-square for the fit of a target model to the chi-square for the fit of an independence model, i.e., one in which the variables are uncorrelated. Higher CFIs indicate better model fit. Values that approach 0.90 indicate acceptable fit [11, 23]

. Another model fit metric is Root Mean Square Error of Approximation (RMSEA), which estimates the amount of error of approximation per model degree of freedom and takes sample size into account. Smaller RMSEA values suggest better model fit. A value of 0.10 or less is indicative of acceptable model fit

[11, 23]. Our CFA model has acceptable fit, CFI = 0.97, RMSEA = 0.117.

As indicated in Figure 6, we identified 2 latent factors. “Latent factor 1” is measured by realism, illuminance, and object. “Latent factor 2” is measured by weirdness and texture. The result is consistent with the analysis in 4.1. Realism, illuminance, and object are important factors for realism, while weirdness and texture are characteristics for fake images. Hence, “latent factor 1” is associated with reality while “latent factor 2” is associated with fakeness.

Figure 6: Results for factor analysis. The numbers are parameters estimated in the model.

5 Improving GAN with Attributes

In this section, we show that the annotated attributes can be used to improve the quality of generated images. We first train an attribute net to mimic human subjects to annotate the attributes of images and then describe the structure of our model and explain how it utilizes these attributes. Quantitative results show that our model outperforms baseline models. We also provide some qualitative examples and conclude with a brief discussion.

5.1 Models

Figure 7: Proposed model. The input is a random vector of dimension 128. The generator takes it as input and output a sample in the image space. Then the sample is fed into the attribute net and the discriminator simultaneously, whose outputs are concatenated together. The a linear layer computes the final scalar output which is used to compute the loss. The detailed architectures of the generator, the discriminator, and the attribute nets may vary and are discussed in Section 5.

Our model is shown in Figure 7. It consists of a generator, a discriminator, and an attribute net. The generator accepts a random sample drawn from a prior, e.g. normal distribution, and outputs an image sample, which then fed into the discriminator and the attribute net simultaneously. The attribute net takes an image as input and outputs the attributes. The outputs of the discriminator and the attribute net are concatenated together, which then feed into a fully connected layer to compute the final output. All the components in our model can be implemented in different ways.

Attribute Net

We implement the attribute net with three models, VGG-16, ResNet50, and DenseNet169. In addition, we also conduct an experiment that substitutes the attribute net with random noise. The purpose is to check the effectiveness of different network architectures. We impose random noise to prove that only semantic vectors can improve the quality of generated images.

We train all the attribute nets on the annotated data. The dataset is randomly split into training and validation sets, each containing 500 images. All the images are resized to

pixels and normalized. The loss function is the mean squared error between predicted values and annotated values. We train the model for maximum 300 epochs. Training is stopped either the maximum epoch is reached or the loss plateaus on the validation set. All the models are optimized with the stochastic gradient descent method with a mini batch size 16.

Adversarial Net

We test the attribute net on three GAN variants, WGAN, DCGAN, and LSGAN. For a fair comparison, we use the original loss function and network architecture for each GAN.

5.2 Datasets and Evaluation Metrics

As our annotations are obtained on the down-sampled ImageNet dataset, we first evaluate our model on it. We also evaluate our model on the CIFAR-10 dataset, which consists of 60000 tiny images with a resolution of . This dataset is widely used for GAN studies. Although we did not collect annotations from CIFAR-10, we wish to evaluate whether the annotated information is transferable.

We adopt three evaluation metrics that are commonly used in previous works, namely inception score

[40], mode score [5], and the Fréchet Inception Distance (FID) [16]

. The Inception Score evaluates the KL divergence between th e conditional label distribution computed by the Inception model pretrained on the ImageNet dataset and the distribution of category labels. The Mode Score is an improved version of the inception score. It has an additional term which computes the KL divergence between the marginal label distribution from generated samples and the data label distribution. Finally, the Fréchet Inception Distance (FID) is the distance between the two Gaussian random variables

and , where is a predefined feature function. Let and be the empirical means, and and be the empirical covariance of and respectively. Then the Fréchet distance is defined as

5.3 Quantitative Results

GAN Type Inception Score Mode Score FID
WGAN-WC [2]
WGAN-GP [14]
LSGAN [30]
DCGAN [38]
CTGAN (as reported in [44]) - -
WGAN+VGG
WGAN+ResNet
WGAN+DenseNet
WGAN+Random Noise
DCGAN+VGG
LSGAN+VGG
Table 3: Quantitative results of unsupervised training on ImageNet. Best results are shown in bold. For inception score and mode score, higher score represents better quality. Lower FID indicates the fake distribution is closer to the real distribution.

Quantitative results for models trained on ImageNet are summarized in Table 3.

5.3.1 Training with Different Attribute Nets

Figure 8: MSE Loss Curve of Attribute Net. VGG predicts the attributes more accurate than other two models.

We observe that VGG achieves the best performance among different architectures of the attribute net. VGG learns feature representations in an hierarchical way, meaning higher layers learn an ensemble of features from lower layers. ResNet and DenseNet are designed to learn residuals, making successive layers refine previous layers. This paradigm of learning may quickly learn representations for object classifications, but may perform poorly in transfer learning or fine tuning tasks. In other tasks like style transfer

[27, 17, 12], VGG is more popular than ResNet for extracting features because VGG requires fewer parameter tuning tricks and converges faster than ResNet and DenseNet. In our case, we are finetuning the pretrained model on a small dataset, therefore VGG may perform better than other two models. Figure 8 shows the RMSE loss when each model reaches stopping time. VGG16 has least RMSE loss, which indicates it predicts more accurate attributes.

5.3.2 Training with Different GAN Variants

We examine whether our proposed attributes are useful for different types of GANs. The results show that for three types of GANs, WGAN, DCGAN, and LSGAN, integrating attributes will lead to better performance. This is possibly because the proposed attributes represent higher level semantics that might not be learned directly from images. Additional information might make the discriminator discriminate fake samples easier.

To ensure that only vectors with semantic means could improve the performance of GANs, we assign a random vector drawn from normal distribution to each image. As shown in Table 3, the inception score drops significantly for this case.

This phenomenon indicates that only when the input vector has some semantic means can the discriminator performs better. A random vector has little information about the input sample and thus may interfere the prediction of the discriminator, which causes the drops in evaluation metrics.

5.3.3 Training on Different Datasets

To examine whether the attributes are generalizable to other datasets, we train the model on CIFAR-10. However, as shown in Table 4, the inception score is lower than the WGAN-GP model, which indicates that the attributes may distribute inconsistently among different datasets.

One possible reason is that image resolutions for two datasets are different. Therefore, the attribute net failed to compute the correct attribute scores for real images from CIFAR-10 and fake images generated by GAN. Consequently, the attribute scores may interfere the prediction of the discriminator, similar as a random vector.

Model Inception Score
WGAN-GP
WGAN+VGG
Table 4: CIFAR-10 Results

5.4 Qualitative Results

We show some images generated by the WGAN+VGG combination (Figure 9). More samples are shown in the supplemental materials. We analyze the generated images qualitatively from three aspects, pixels, diversity, and reality.

Pixels

We examine pixel values by two factors, color and sharpness. We observe that the color looks natural and diverse in all images. The color distribution is consistent with natural images. Sharpness indicates that the difference between adjacent pixels is large. A sharper image looks less blurry and edges or boundaries can be figured out more easily, so images containing recognizable objects are usually sharp. As we can see from the sample images, almost all of them look sharp, which means that our model captures this feature successfully.

Diversity

A common failure of GANs is mode collapse, meaning that the same image is generated for different latent vectors. Hence, diversity is an important factor to evaluate the performance of GANs. From the generated samples we observe that they are quite diverse and we can hardly find same images.

Reality

Reality indicates whether the image contains recognizable objects. Unfortunately, we find that many images contain meaningless color blotches. But we can also find a few images have distorted dogs and cats, like the second and the seventh images in the first row of Figure 9.

Figure 9: Qualitative Results.

6 Conclusion and Future Works

In this project, we built a new dataset of annotated images to characterize generated images. Comprehensive analyses show that real images contain more semantic objects, have better illuminance, and are perceived more real than fake images. Fake images tend to be perceived as being more weird and more like repeated patterns. Further, a DNN is trained to predict attributes automatically. We integrate the trained attribute net into the discriminator of GAN to improve its performance. For future studies, a larger dataset could be built with more structured attributes for a more comprehensive study. Moreover, annotated attributes may be used for conditioned training or disentangled feature learning.

References

7 Supplementary Materials

7.1 Training Different Attribute Nets

We implement attribute nets with three settings, VGG-16, ResNet50, and DenseNet169. In addition, we also conduct an experiment that substitutes the attribute net with random noise. The purpose is to check the effectiveness of different network architectures. We impose random noise to prove that only semantic vectors can improve the quality of generated images.

We train all the attribute nets on the annotated data. The dataset is randomly split into training and validation sets, each containing 500 images. All the images are resized to pixels and normalized. The loss function is the mean squared error between predicted values and annotated values. We train the model for maximum 300 epochs. Training is stopped either the maximum epoch is reached or the loss plateaus on the validation set. All the models are optimized with the stochastic gradient descent method with a mini batch size 16.

VGG-16.

The base model is a VGG net with 16 layers pretrained on ImageNet. The fully connected layer is replaced with a two-layer feed-forward neural network. The output dimension of each layer is 512 and 8 respectively. We use an initial learning rate of 0.02 and learning rate decay of 0.0001. The momentum is 0.9. The learning rate is halved for every 50 epochs. The size of a mini batch for each iteration is 8. The model is trained on an NVIDIA Titan Black GPU with 6 GB memory. The training takes about 12 hours to complete.

ResNet-50. The base model is a residual net with 50 layers pretrained on ImageNet. We change the output dimension of the last layer to 8. We use an initial learning rate of 0.01. The momentum is 0.9. The learning rate is halved for every 50 epochs. The size of a mini batch for each iteration is 8. The model is trained on an NVIDIA Titan Black GPU with 6 GB memory. The training takes about 10 hours to complete.

DenseNet-169. The base model is a dense net with 169 layers pretrained on ImageNet. We change the output dimension of the last layer to 8. We use an initial learning rate of 0.01. The momentum is 0.9. The learning rate is halved for every 50 epochs. The size of a mini batch for each iteration is 8. The model is trained on an NVIDIA Titan Black GPU with 6 GB memory. The training takes about 10 hours to complete.

Random Noise. Finally, we disable the attribute net and replace its output with a random noise vector drawn from .

7.2 Training with Different Types of GANs

We test the attribute net on three GAN variants, WGAN, DCGAN, and LSGAN. For a fair comparison, we use the original loss function and network architecture for each GAN. We train all the models on the tiny ImageNet dataset. We use the training set that consists of 1.28 million images to train the model. To monitor overfitting, we compute convergence curves of the discriminator’s value on both the training set and a test set that contains 50 thousand images. All the images have a resolution of pixels. They are normalized without resizing before fed to the discriminator. We call the original GAN variant the vanilla model and the GAN variant with attribute net the modified model.

WGAN. The architectures of the generator and discriminator are the same as shown in Figure 2. The objective functions for training are defined as Equation (1) with gradient penalty term 10. We use the gradient penalty to regularize the generator and discriminator while keeping the attribute net fixed. We train the model for iterations. We use the Adam method with momentum terms and to optimize the model. The initial learning rate is 0.0001 and is kept constant during training. We update the discriminator once for every generator iteration. We use a mini batch size of 32 for each iteration. The model is trained on an NVIDIA Titan X GPU with 12GB memory. The training takes about 3 days for the vanilla model and 4 days for the modified model.

DCGAN. Let k3n256s2 denote a convolutional block with filters and stride . d256 denotes a convolutional layer with 256 filters and stride 2. fc44512 denotes a fully connected layer with 44512 filters and the output is reshaped to a tensor. The architecture of DCGAN is defined as following.

  • Generator: fc44512, d256, d128, d64, d3, tanh.

  • Discriminator: k5n64s2, k5n128s2, k5n256s2, k5n512s2, fc1

The objective function to optimize is

We use the Adam optimizer to train the model. Momentum terms are set to 0.5 and 0.999 respectively. We train the model for 200000 iterations. The initial learning rate is 0.0002 and is kept constant during training. We update the discriminator once for every generator iteration. We use a mini batch size of 32 for each iteration. The model is trained on an NVIDIA Titan X GPU with 12GB memory. The training takes about 2.5 days for the vanilla model and 3.5 days for the modified model.

LSGAN. LSGAN uses the same network architecture as DCGAN, but with different objective functions to optimize:

and

We use RMSProp to train the model. We train the model for 200000 iterations. The initial learning rate is 0.0001 and is kept constant during training. We update the discriminator once for every generator iteration. We use a mini batch size of 32 for each iteration. The model is trained on an NVIDIA Titan X GPU with 12GB memory. The training takes about 2.5 days for the vanilla model and 3 days for the modified model.

7.3 Training on Different Datasets

We also evaluate the attribute on the CIFAR-10 dataset as well. CIFAR-10 contains 60,000 images with a size of . We use the training set that contains 50,000 images to train the discriminator, and the remaining 10,000 for validation purpose. We evaluate WGAN with VGG as attribute net on the CIFAR-10 dataset. We train the model for 200000 iterations. We use gradient penalty to regularize norms of gradients. The regularization coefficient is 10. Initial learning rate is 0.0001 and is kept constant during training. We update the discriminator once for every generator iteration. We use a mini batch size of 32 for each iteration. The model is trained on an NVIDIA Titan X GPU with 12GB memory. The training takes about 2 days.

Figure 10: More samples generated by our model (WGAN+VGG16).
Figure 11: More samples generated by our model (WGAN+VGG16).
(a) WGAN
(b) DCGAN
(c) LSGAN
Figure 12: Training Curve. 12(a) shows that when attribute net is VGG the model converges fastest. When the attribute vector is replaced by a random vector, the inception score drops significantly. 12(b) and 12(c) show that adding an attribute net also improves the performance of other variants of GANs.