. A vanilla GAN contains a generator that maps a low-dimension latent code into the target space, for example, the image space. Instead of estimating the likelihood of the generated sample, it employs a discriminator to judge how difficult it is to discriminate from real samples. The generator and the discriminator are optimized jointly in an adversarial manner until equilibrium state has been reached.
However, the training dynamics of GAN are usually unstable and the generator may output images that collapse to limited modes or with low quality. In this work, we aim to study whether human annotations will improve the quality of the generated images. To achieve this, we have collected human annotated data from the Amazon Mechanical Turk (AMT) on 1000 images, including 600 generated images (fake images), and for reference, 400 realistic images (real images). Figure 1 shows sample images from our dataset. We have defined eight attributes, namely, “color”, “illuminance”, “object”, “people”, “scene”, “texture”, “realism”, and “weirdness”. Each image has been annotated by 10 workers with a scale of 1 to 5 on the eight attributes. We have further analyzed the annotations and identified key features that contribute to image quality evaluation. Furthermore, we have also constructed deep neural networks to predict attributes of images to mimic human annotations. We find that integrating these attributes can improve the quality of generated images as evaluated by multiple metrics.
Our core insight is that we can leverage human annotations to encode image samples by treating them as prior knowledge to help the discriminator figure out fake images. Concretely, our contributions are twofold:
We collect a new dataset of 1000 images with annotations from human to study the differences between real and fake images with a thorough analysis. In addition, we train the attribute net that mimics human subjects to annotate the attributes of new images.
We propose a new paradigm that includes the attribute net in the adversarial networks to improve the quality of generated images. Our paradigm can be applied to different GAN architectures and experiments have shown improvements in multiple metrics against baseline models.
2 Related Work
Generated Adversarial Networks (GANs)
GANs are a class of generative models that transform a known distribution (e.g. the normal distribution) to an unknown distribution (e.g. the image distribution). The two components, a generator and a discriminator, serve for different purposes. Informally, the generator is trained to fool the discriminator; thus we would like to improve the error rate of the discriminator during training, while the discriminator is trained to identify the generated samples from real samples. GANs have achieved impressive success in various applications, such as image generation[38, 47], image editing , representation learning 26], and domain transferring [49, 39, 18].
Variants of GAN
The vanilla GAN has many drawbacks, such as easily collapsing to a single point and generating nonsensical outputs. Radford et al.  propose Deep Convolutional GANs (DCGANs) to produce more meaningful results. Although many attempts have been made to use convolutional nets to scale up GANs, such efforts were unsuccessful. They summarize several architecture guides for stable deep convolutional net-based GANs. Experiments on several benchmark datasets prove the effectiveness of the proposed guidelines. Another variant of vanilla GAN focuses on the training objective. Mao et al.  propose a least square loss that mitigates the gradient vanishing problem when updating the generator.
Inspired by theories in optimal transport, Arjovsky et al. 
proposes the Wasserstein GAN which considers optimizing GAN as minimizing the Wasserstein distance between the distributions of real and fake images. The loss function requires the neural networks to be 1-Lipschitz. To achieve this, all the weights of the models are clipped within a range, but this method makes most of the weights fall on the boundary of the clipping interval. To make the weights distribute more naturally, Gulrajani et al. use a gradient penalty to regularize the 1-Lipschitz property on both the generator and the discriminator. Experiments have shown that Wasserstein GAN outperforms previous GAN variants and the gradient penalty performs better than weight clipping.
Recently, two large scale models, BigGAN  and StyleGAN  are proposed to synthesis images with high fidelity. BigGAN scales up traditional GAN with orthogonal regularization to the generator for finer control over the trade-offs between sample fidelity and variety. StyleGAN adopts style transfer methods by changing styles of latent codes to improve the control of strength of image features at different scales. However, both models require huge consumptions of computational power.
In this section, we introduce the method used to generate the fake stimuli, including training details. We also describe the protocol for collecting data from AMT.
3.1 Network Architecture
The generator’s architecture is shown in Figure 2(a). We followed the implementation in  to generate decent samples. The network uses 4 blocks of residual modules. Each residual module has an upsample convolution layer that uses a sub-pixel CNN  for feature map upscaling. The shortcut path for each residual module is another upsample convolution layer that has the same output dimension as the one in the main path. At the end of the network, we use an additional layer of convolution and a function to output the final image. The discriminator’s architecture is shown in Figure 2(b)
. The discriminator also has 4 residual modules. It uses average pooling to downscale the feature maps by a factor of 2 in the width and height dimensions. The shortcut path in the discriminator contains an average pooling layer and a convolution layer. The average pooling layer downscales feature maps by a factor of 2, while the convolution layer outputs feature maps that have the same dimension as the main path. At the end of the net, a linear transformation transforms the output to a single scalar that represents the probability that the sample came from the data distribution rather than the noise distribution.
means that the kernel size is 3, output dimension is 512 and stride is 1 for both upsample convolution layer and convolution layer. (a) Generator’s architecture. (b) Discriminator’s architecture.
The vanilla GAN may produce poor results due to unstable training, which may occur quite often [33, 36, 40, 2]. Therefore, we use the Wasserstein GAN (WGAN) with gradient penalty paradigm . The objective function is defined by
where is drawn from the generator distribution , is drawn from the real image distribution , and
is a linear interpolation with a random number. is the gradient penalty coefficient.
3.2 Training Details
The training procedure follows . We initialized the weights with He’s normal initialization . We set the gradient penalty coefficient to 10 and updated the discriminator once every 5 iterations. All the models were optimized with an Adam optimizer with and . We set the initial learning rate
and batch size 64. We trained the models for 200000 iterations. We sampled real images from the small ImageNet dataset with a resolution of 64 by 64 pixels[43, 8].
3.3 Attributes to Annotate
To accurately compare the quality of fake stimuli with real stimuli, we define a set of 8 attributes from different aspects. The attributes are listed in Table 1.
Pixel-level image attributes, such as color, intensity, and orientation, are low-level features for saliency detection and are biologically plausible . These features may be important to influence the quality of fake images. However, the stimuli are very small (64 by 64 pixels), so detailed information cannot be perceived easily. We only include color here. On the other hand, illuminance is an important attribute that differs from real images to computer generated images [34, 32]. We asked annotators to describe their feelings about the illuminance of the stimuli.
Human attention tends to be drawn by objects relating to humans, such as faces [21, 4], emotion , and crowds . It is also easily attracted by moving objects [24, 45]. In addition, a key criteria is whether GANs can generate human recognizable objects in the image. Hence, we include object and human in the attribute list.
The classification of indoor/outdoor scenes is an important problem in computer vision. It has a well defined constraint that an outdoor scene is inside a man-made structure 
. The same semantic object may not be helpful to classify whether an image is indoor or outdoor, such as an indoor swimming pool and an outdoor swimming pool. In addition, some outdoor images are easier to generate, such as sky, ocean, and grassland. As a result, we include scenes in the attribute list to ask subjects to identify whether the image is perceived as indoor or as outdoor image.
In some previous image synthesis models, repeated patterns may be observed frequently in synthesized images . It is possible for GANs to generate such patterns, so we define such features as texture and include it in our attribute list.
Realism is included as well. We also add weirdness in the set. Weirdness is defined as any unnatural features or objects, which might be common in generated images. Perception may be affected by the size of image . Real images may also contain objects that could be perceived strange if the image is small.
|Color||colorfulness, pixel value distribution|
|Illuminance||light effect, shadows, brightness|
|Object||objects in the image excluding humans, like car, animal, or furniture|
|People||humans in the image|
|Scene||outdoor rather than indoor scene|
|Realism||overall naturalness, real or computer generated|
|Weirdness||any unnatural feature,such as strange objects|
3.4 Stimuli and Data Collection
The dataset contains 1000 images. We used the generator trained above to produce 600 fake images with a resolution of 64 by 64 pixels. Another 400 real images were randomly selected from the small ImageNet dataset. All the images from this dataset are downsampled from the original ImageNet dataset. Images from small ImageNet are similar to CIFAR-10 images , but have greater variety [43, 8]. We requested workers from AMT to annotate the images. The user interface for the task is shown in Figure 3.
Each worker was asked to annotate a total of 20 images in one assignment. Participants were asked to judge whether the keyword (attribute) best describes the stimuli. There are five choices, “definitely yes” (5), “probably yes” (4), “not sure” (3), “probably no” (2), and “definitely no” (1). Each choice was then converted to the numerical score in the parentheses. To ensure the quality of annotation, we have set up two standards. Firstly, each participant must have an overall approval rate better than . Secondly, at the end of the annotation, participants needed to annotate five extra images that were selected from the images they had annotated in current assignment. In addition, participants were not allowed to see their previous selections. If the score difference between two annotations is large (i.e.
), the current assignment would be rejected. For each image, we collected annotations from 10 participants which all met the requirements. We calculated the score vector by averaging the annotations from the 10 participants for each image.
4 Data Analysis
In this section, we summarize the statistical and factor analyses of the the annotated data.
4.1 Statistical Analysis
4.1.1 Group means
summarize the mean and standard deviation for each attribute.
Real images have higher scores in illuminance, object, people, and realism features, while fake images have higher scores in texture and weirdness features. There are no significant differences in color and scene features between real and fake images.
This observation shows that WGANs can generate colorful images that have similar color spectra as real images. However, WGANs are less capable of generating meaningful objects or humans. Fake images tend to be perceived as repeated patterns. Not surprisingly, fake images are rated more weird and less real.
We analyzed relationships between attributes by computing Pearson correlation coefficients. We first computed correlation with all the images, and then separately for real images and fake images. Figure 5 shows the results. A blank cell indicates insignificant correlation (). We observe that illuminance is highly correlated with realism (). Note the correlation between illuminance and realism within each group. These two attributes are also moderately correlated ( for real images, for fake images.) This shows that illuminance is an important factor for realism. This result is consistent with previous findings [34, 32, 10, 11]. We can also observe that object and realism are also correlated ( for all images). The result accords with previous findings that images with more objects are perceived to be more realistic [11, 25, 7]. We notice that for real images, object and human are negatively correlated (). This is because the stimuli are very small, so a single image is not likely have both objects and human. But for fake images, the correlation between them is insignificant. This indicates that WGANs are not likely to generate meaningful objects or humans. As expected, realism is negatively correlated with weirdness ().
4.2 Factor Analysis
Factor analysis (FA) is a statistical method that describes observed variables by latent, unobserved factors. Factor analysis is similar to principal component analysis (PCA) in that both are feature dimension reduction methods. However, components in PCA must be orthogonal to maximize the total variance, but the factors in FA are not necessarily orthogonal so that they can correlate with each other. We applied exploratory factor analysis (EFA) followed by confirmatory factor analysis (CFA) on the whole dataset. EFA identifies latent factors as linear combination of observed variables, while CFA tests how well the model fits the data. Attributes with poor fits, or loadings, are eliminated.
The model parameters and structure are presented in Figure 6. To estimate the fit of the model, two common indices are used. The first one is the Comparative Fit Index (CFI), which compares a chi-square for the fit of a target model to the chi-square for the fit of an independence model, i.e., one in which the variables are uncorrelated. Higher CFIs indicate better model fit. Values that approach 0.90 indicate acceptable fit [11, 23]
. Another model fit metric is Root Mean Square Error of Approximation (RMSEA), which estimates the amount of error of approximation per model degree of freedom and takes sample size into account. Smaller RMSEA values suggest better model fit. A value of 0.10 or less is indicative of acceptable model fit[11, 23]. Our CFA model has acceptable fit, CFI = 0.97, RMSEA = 0.117.
As indicated in Figure 6, we identified 2 latent factors. “Latent factor 1” is measured by realism, illuminance, and object. “Latent factor 2” is measured by weirdness and texture. The result is consistent with the analysis in 4.1. Realism, illuminance, and object are important factors for realism, while weirdness and texture are characteristics for fake images. Hence, “latent factor 1” is associated with reality while “latent factor 2” is associated with fakeness.
5 Improving GAN with Attributes
In this section, we show that the annotated attributes can be used to improve the quality of generated images. We first train an attribute net to mimic human subjects to annotate the attributes of images and then describe the structure of our model and explain how it utilizes these attributes. Quantitative results show that our model outperforms baseline models. We also provide some qualitative examples and conclude with a brief discussion.
Our model is shown in Figure 7. It consists of a generator, a discriminator, and an attribute net. The generator accepts a random sample drawn from a prior, e.g. normal distribution, and outputs an image sample, which then fed into the discriminator and the attribute net simultaneously. The attribute net takes an image as input and outputs the attributes. The outputs of the discriminator and the attribute net are concatenated together, which then feed into a fully connected layer to compute the final output. All the components in our model can be implemented in different ways.
We implement the attribute net with three models, VGG-16, ResNet50, and DenseNet169. In addition, we also conduct an experiment that substitutes the attribute net with random noise. The purpose is to check the effectiveness of different network architectures. We impose random noise to prove that only semantic vectors can improve the quality of generated images.
We train all the attribute nets on the annotated data. The dataset is randomly split into training and validation sets, each containing 500 images. All the images are resized to
pixels and normalized. The loss function is the mean squared error between predicted values and annotated values. We train the model for maximum 300 epochs. Training is stopped either the maximum epoch is reached or the loss plateaus on the validation set. All the models are optimized with the stochastic gradient descent method with a mini batch size 16.
We test the attribute net on three GAN variants, WGAN, DCGAN, and LSGAN. For a fair comparison, we use the original loss function and network architecture for each GAN.
5.2 Datasets and Evaluation Metrics
As our annotations are obtained on the down-sampled ImageNet dataset, we first evaluate our model on it. We also evaluate our model on the CIFAR-10 dataset, which consists of 60000 tiny images with a resolution of . This dataset is widely used for GAN studies. Although we did not collect annotations from CIFAR-10, we wish to evaluate whether the annotated information is transferable.
We adopt three evaluation metrics that are commonly used in previous works, namely inception score, mode score , and the Fréchet Inception Distance (FID) 
. The Inception Score evaluates the KL divergence between th e conditional label distribution computed by the Inception model pretrained on the ImageNet dataset and the distribution of category labels. The Mode Score is an improved version of the inception score. It has an additional term which computes the KL divergence between the marginal label distribution from generated samples and the data label distribution. Finally, the Fréchet Inception Distance (FID) is the distance between the two Gaussian random variablesand , where is a predefined feature function. Let and be the empirical means, and and be the empirical covariance of and respectively. Then the Fréchet distance is defined as
5.3 Quantitative Results
|GAN Type||Inception Score||Mode Score||FID|
|CTGAN (as reported in )||-||-|
Quantitative results for models trained on ImageNet are summarized in Table 3.
5.3.1 Training with Different Attribute Nets
We observe that VGG achieves the best performance among different architectures of the attribute net. VGG learns feature representations in an hierarchical way, meaning higher layers learn an ensemble of features from lower layers. ResNet and DenseNet are designed to learn residuals, making successive layers refine previous layers. This paradigm of learning may quickly learn representations for object classifications, but may perform poorly in transfer learning or fine tuning tasks. In other tasks like style transfer[27, 17, 12], VGG is more popular than ResNet for extracting features because VGG requires fewer parameter tuning tricks and converges faster than ResNet and DenseNet. In our case, we are finetuning the pretrained model on a small dataset, therefore VGG may perform better than other two models. Figure 8 shows the RMSE loss when each model reaches stopping time. VGG16 has least RMSE loss, which indicates it predicts more accurate attributes.
5.3.2 Training with Different GAN Variants
We examine whether our proposed attributes are useful for different types of GANs. The results show that for three types of GANs, WGAN, DCGAN, and LSGAN, integrating attributes will lead to better performance. This is possibly because the proposed attributes represent higher level semantics that might not be learned directly from images. Additional information might make the discriminator discriminate fake samples easier.
To ensure that only vectors with semantic means could improve the performance of GANs, we assign a random vector drawn from normal distribution to each image. As shown in Table 3, the inception score drops significantly for this case.
This phenomenon indicates that only when the input vector has some semantic means can the discriminator performs better. A random vector has little information about the input sample and thus may interfere the prediction of the discriminator, which causes the drops in evaluation metrics.
5.3.3 Training on Different Datasets
To examine whether the attributes are generalizable to other datasets, we train the model on CIFAR-10. However, as shown in Table 4, the inception score is lower than the WGAN-GP model, which indicates that the attributes may distribute inconsistently among different datasets.
One possible reason is that image resolutions for two datasets are different. Therefore, the attribute net failed to compute the correct attribute scores for real images from CIFAR-10 and fake images generated by GAN. Consequently, the attribute scores may interfere the prediction of the discriminator, similar as a random vector.
5.4 Qualitative Results
We show some images generated by the WGAN+VGG combination (Figure 9). More samples are shown in the supplemental materials. We analyze the generated images qualitatively from three aspects, pixels, diversity, and reality.
We examine pixel values by two factors, color and sharpness. We observe that the color looks natural and diverse in all images. The color distribution is consistent with natural images. Sharpness indicates that the difference between adjacent pixels is large. A sharper image looks less blurry and edges or boundaries can be figured out more easily, so images containing recognizable objects are usually sharp. As we can see from the sample images, almost all of them look sharp, which means that our model captures this feature successfully.
A common failure of GANs is mode collapse, meaning that the same image is generated for different latent vectors. Hence, diversity is an important factor to evaluate the performance of GANs. From the generated samples we observe that they are quite diverse and we can hardly find same images.
Reality indicates whether the image contains recognizable objects. Unfortunately, we find that many images contain meaningless color blotches. But we can also find a few images have distorted dogs and cats, like the second and the seventh images in the first row of Figure 9.
6 Conclusion and Future Works
In this project, we built a new dataset of annotated images to characterize generated images. Comprehensive analyses show that real images contain more semantic objects, have better illuminance, and are perceived more real than fake images. Fake images tend to be perceived as being more weird and more like repeated patterns. Further, a DNN is trained to predict attributes automatically. We integrate the trained attribute net into the discriminator of GAN to improve its performance. For future studies, a larger dataset could be built with more structured attributes for a more comprehensive study. Moreover, annotated attributes may be used for conditioned training or disentangled feature learning.
-  R. Adolphs. What does the amygdala contribute to social cognition? Annals of the New York Academy of Sciences, 1191(1):42–61, 2010.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
-  M. Cerf, E. P. Frady, and C. Koch. Faces and text attract gaze independent of the task: Experimental data and computer model. Journal of vision, 9(12):10–10, 2009.
-  T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.
-  C. Chen, Y. Ren, and C.-C. J. Kuo. Indoor/outdoor classification with multiple experts. In Big Visual Data Analysis, pages 23–63. Springer, 2016.
-  S. Y. Choi, M. Luo, M. Pointer, and P. Rhodes. Investigation of large display color image appearance–iii: Modeling image naturalness. Journal of Imaging Science and Technology, 53(3):31104–1, 2009.
-  P. Chrabaszcz, I. Loshchilov, and F. Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.
-  W.-T. Chu, Y.-K. Chen, and K.-T. Chen. Size does matter: How image size affects aesthetic perception? In Proceedings of the 21st ACM international conference on Multimedia, pages 53–62. ACM, 2013.
-  S. Fan, T.-T. Ng, J. S. Herberg, B. L. Koenig, and S. Xin. Real or fake?: human judgments about photographs and computer-generated images of faces. In SIGGRAPH Asia 2012 technical briefs, page 17. ACM, 2012.
-  S. Fan, T.-T. Ng, B. L. Koenig, J. S. Herberg, M. Jiang, Z. Shen, and Q. Zhao. Image visual realism: From human perception to machine computation. IEEE transactions on pattern analysis and machine intelligence, 2017.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style, 2015.
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769–5779, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
-  X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint, 2017.
-  L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 1998.
-  M. Jiang, J. Xu, and Q. Zhao. Saliency in crowd. In European Conference on Computer Vision, pages 17–32. Springer, 2014.
-  N. Kanwisher, J. McDermott, and M. M. Chun. The fusiform face area: a module in human extrastriate cortex specialized for face perception. Journal of neuroscience, 17(11):4302–4311, 1997.
T. Karras, S. Laine, and T. Aila.
A style-based generator architecture for generative adversarial
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
-  R. B. Kline and D. A. Santor. Principles & practice of structural equation modelling. Canadian Psychology, 40(4):381, 1999.
-  Z. Kourtzi and N. Kanwisher. Activation in human mt/mst by static images with implied motion. Journal of cognitive neuroscience, 12(1):48–55, 2000.
-  J.-F. Lalonde and A. A. Efros. Using color compatibility for assessing image realism. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint, 2016.
-  Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz. A closed-form solution to photorealistic image stylization. Lecture Notes in Computer Science, page 468–483, 2018.
-  M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in neural information processing systems, pages 469–477, 2016.
-  P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic segmentation using adversarial networks. arXiv preprint arXiv:1611.08408, 2016.
-  X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2813–2821. IEEE, 2017.
-  M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pages 5040–5048, 2016.
-  A. McNamara et al. Exploring perceptual equivalence between real and simulated imagery. In Proceedings of the 2nd symposium on Applied Perception in Graphics and Visualization, pages 123–128. ACM, 2005.
-  L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
-  G. W. Meyer, H. E. Rushmeier, M. F. Cohen, D. P. Greenberg, and K. E. Torrance. An experimental evaluation of computer graphics imagery. ACM Transactions on Graphics (TOG), 5(1):30–50, 1986.
A. Payne and S. Singh.
Indoor vs. outdoor scene classification in digital photographs.Pattern Recognition, 38(10):1533–1545, 2005.
-  B. Poole, A. A. Alemi, J. Sohl-Dickstein, and A. Angelova. Improved generator objectives for gans. arXiv preprint arXiv:1612.02780, 2016.
-  J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision, 40(1):49–70, 2000.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
D. Rueckert, and Z. Wang.
Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
-  A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine intelligence, 30(11):1958–1970, 2008.
A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu.
Pixel recurrent neural networks.In
International Conference on Machine Learning, pages 1747–1756, 2016.
-  X. Wei, B. Gong, Z. Liu, W. Lu, and L. Wang. Improving the improved training of wasserstein gans: A consistency term and its dual effect. arXiv preprint arXiv:1803.01541, 2018.
-  J. Winawer, A. C. Huk, and L. Boroditsky. A motion aftereffect from still photographs depicting motion. Psychological Science, 19(3):276–283, 2008.
-  M. Zhang, K. T. Ma, J. H. Lim, Q. Zhao, and J. Feng. Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4372–4381, 2017.
-  J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
-  J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
7 Supplementary Materials
7.1 Training Different Attribute Nets
We implement attribute nets with three settings, VGG-16, ResNet50, and DenseNet169. In addition, we also conduct an experiment that substitutes the attribute net with random noise. The purpose is to check the effectiveness of different network architectures. We impose random noise to prove that only semantic vectors can improve the quality of generated images.
We train all the attribute nets on the annotated data. The dataset is randomly split into training and validation sets, each containing 500 images. All the images are resized to pixels and normalized. The loss function is the mean squared error between predicted values and annotated values. We train the model for maximum 300 epochs. Training is stopped either the maximum epoch is reached or the loss plateaus on the validation set. All the models are optimized with the stochastic gradient descent method with a mini batch size 16.
The base model is a VGG net with 16 layers pretrained on ImageNet. The fully connected layer is replaced with a two-layer feed-forward neural network. The output dimension of each layer is 512 and 8 respectively. We use an initial learning rate of 0.02 and learning rate decay of 0.0001. The momentum is 0.9. The learning rate is halved for every 50 epochs. The size of a mini batch for each iteration is 8. The model is trained on an NVIDIA Titan Black GPU with 6 GB memory. The training takes about 12 hours to complete.
ResNet-50. The base model is a residual net with 50 layers pretrained on ImageNet. We change the output dimension of the last layer to 8. We use an initial learning rate of 0.01. The momentum is 0.9. The learning rate is halved for every 50 epochs. The size of a mini batch for each iteration is 8. The model is trained on an NVIDIA Titan Black GPU with 6 GB memory. The training takes about 10 hours to complete.
DenseNet-169. The base model is a dense net with 169 layers pretrained on ImageNet. We change the output dimension of the last layer to 8. We use an initial learning rate of 0.01. The momentum is 0.9. The learning rate is halved for every 50 epochs. The size of a mini batch for each iteration is 8. The model is trained on an NVIDIA Titan Black GPU with 6 GB memory. The training takes about 10 hours to complete.
Random Noise. Finally, we disable the attribute net and replace its output with a random noise vector drawn from .
7.2 Training with Different Types of GANs
We test the attribute net on three GAN variants, WGAN, DCGAN, and LSGAN. For a fair comparison, we use the original loss function and network architecture for each GAN. We train all the models on the tiny ImageNet dataset. We use the training set that consists of 1.28 million images to train the model. To monitor overfitting, we compute convergence curves of the discriminator’s value on both the training set and a test set that contains 50 thousand images. All the images have a resolution of pixels. They are normalized without resizing before fed to the discriminator. We call the original GAN variant the vanilla model and the GAN variant with attribute net the modified model.
WGAN. The architectures of the generator and discriminator are the same as shown in Figure 2. The objective functions for training are defined as Equation (1) with gradient penalty term 10. We use the gradient penalty to regularize the generator and discriminator while keeping the attribute net fixed. We train the model for iterations. We use the Adam method with momentum terms and to optimize the model. The initial learning rate is 0.0001 and is kept constant during training. We update the discriminator once for every generator iteration. We use a mini batch size of 32 for each iteration. The model is trained on an NVIDIA Titan X GPU with 12GB memory. The training takes about 3 days for the vanilla model and 4 days for the modified model.
DCGAN. Let k3n256s2 denote a convolutional block with filters and stride . d256 denotes a convolutional layer with 256 filters and stride 2. fc44512 denotes a fully connected layer with 44512 filters and the output is reshaped to a tensor. The architecture of DCGAN is defined as following.
Generator: fc44512, d256, d128, d64, d3, tanh.
Discriminator: k5n64s2, k5n128s2, k5n256s2, k5n512s2, fc1
The objective function to optimize is
We use the Adam optimizer to train the model. Momentum terms are set to 0.5 and 0.999 respectively. We train the model for 200000 iterations. The initial learning rate is 0.0002 and is kept constant during training. We update the discriminator once for every generator iteration. We use a mini batch size of 32 for each iteration. The model is trained on an NVIDIA Titan X GPU with 12GB memory. The training takes about 2.5 days for the vanilla model and 3.5 days for the modified model.
LSGAN. LSGAN uses the same network architecture as DCGAN, but with different objective functions to optimize:
We use RMSProp to train the model. We train the model for 200000 iterations. The initial learning rate is 0.0001 and is kept constant during training. We update the discriminator once for every generator iteration. We use a mini batch size of 32 for each iteration. The model is trained on an NVIDIA Titan X GPU with 12GB memory. The training takes about 2.5 days for the vanilla model and 3 days for the modified model.
7.3 Training on Different Datasets
We also evaluate the attribute on the CIFAR-10 dataset as well. CIFAR-10 contains 60,000 images with a size of . We use the training set that contains 50,000 images to train the discriminator, and the remaining 10,000 for validation purpose. We evaluate WGAN with VGG as attribute net on the CIFAR-10 dataset. We train the model for 200000 iterations. We use gradient penalty to regularize norms of gradients. The regularization coefficient is 10. Initial learning rate is 0.0001 and is kept constant during training. We update the discriminator once for every generator iteration. We use a mini batch size of 32 for each iteration. The model is trained on an NVIDIA Titan X GPU with 12GB memory. The training takes about 2 days.