1 Introduction
Generative adversarial nets (GANs) (Goodfellow et al., 2014) as a new way for learning generative models, has recently shown promising results in various challenging tasks, such as realistic image generation (Nguyen et al., 2016b; Zhang et al., 2016; Gulrajani et al., 2017), conditional image generation (Huang et al., 2016b; Cao et al., 2017; Isola et al., 2016), image manipulation (Zhu et al., 2016)
and text generation
(Yu et al., 2016).Despite the great success, it is still challenging for the current GAN models to produce convincing samples when trained on datasets with high variability, even for image generation with low resolution, e.g., CIFAR10. Meanwhile, people have empirically found taking advantages of class labels can significantly improve the sample quality.
There are three typical GAN models that make use of the label information: CatGAN (Springenberg, 2015) builds the discriminator as a multiclass classifier; LabelGAN (Salimans et al., 2016) extends the discriminator with one extra class for the generated samples; ACGAN (Odena et al., 2016) jointly trains the realfake discriminator and an auxiliary classifier for the specific real classes. By taking the class labels into account, these GAN models show improved generation quality and stability. However, the mechanisms behind them have not been fully explored (Goodfellow, 2016).
In this paper, we mathematically study GAN models with the consideration of class labels. We derive the gradient of the generator’s loss w.r.t. class logits in the discriminator, named as classaware gradient, for LabelGAN
(Salimans et al., 2016) and further show its gradient tends to guide each generated sample towards being one of the specific real classes. Moreover, we show that ACGAN (Odena et al., 2016) can be viewed as a GAN model with hierarchical class discriminator. Based on the analysis, we reveal some potential issues in the previous methods and accordingly propose a new method to resolve these issues.Specifically, we argue that a model with explicit target class would provide clearer gradient guidance to the generator than an implicit target class model like that in (Salimans et al., 2016). Comparing with (Odena et al., 2016), we show that introducing the specific real class logits by replacing the overall real class logit in the discriminator usually works better than simply training an auxiliary classifier. We argue that, in (Odena et al., 2016), adversarial training is missing in the auxiliary classifier, which would make the model more likely to suffer mode collapse and produce low quality samples. We also experimentally find that predefined label tends to result in intraclass mode collapse and correspondingly propose dynamic labeling as a solution. The proposed model is named as Activation Maximization Generative Adversarial Networks (AMGAN). We empirically study the effectiveness of AMGAN with a set of controlled experiments and the results are consistent with our analysis and, note that, AMGAN achieves the stateoftheart Inception Score (8.91) on CIFAR10.
In addition, through the experiments, we find the commonly used metric needs further investigation. In our paper, we conduct a further study on the widelyused evaluation metric Inception Score (Salimans et al., 2016) and its extended metrics. We show that, with the Inception Model, Inception Score mainly tracks the diversity of generator, while there is no reliable evidence that it can measure the true sample quality. We thus propose a new metric, called AM Score, to provide more accurate estimation on the sample quality as its compensation. In terms of AM Score, our proposed method also outperforms other strong baseline methods.
The rest of this paper is organized as follows. In Section 2, we introduce the notations and formulate the LabelGAN (Salimans et al., 2016) and ACGAN (Odena et al., 2016) as our baselines. We then derive the classaware gradient for LabelGAN, in Section 3, to reveal how class labels help its training. In Section 4, we reveal the overlaidgradient problem of LabelGAN and propose AMGAN as a new solution, where we also analyze the properties of AMGAN and build its connections to related work. In Section 5, we introduce several important extensions, including the dynamic labeling as an alternative of predefined labeling (i.e., class condition), the activation maximization view and a technique for enhancing the ACGAN. We study Inception Score in Section 6 and accordingly propose a new metric AM Score. In Section 7, we empirically study AMGAN and compare it to the baseline models with different metrics. Finally we conclude the paper and discuss the future work in Section 8.
2 Preliminaries
In the original GAN formulation (Goodfellow et al., 2014)
, the loss functions of the generator
and the discriminator are given as:(1)  
where performs binary classification between the real and the generated samples and
represents the probability of the sample
coming from the real data.2.1 LabelGAN
The framework (see Eq. (1)) has been generalized to multiclass case where each sample has its associated class label , and the label corresponds to the generated samples (Salimans et al., 2016). Its loss functions are defined as:
(2)  
(3) 
where denotes the probability of the sample being class . The loss can be written in the form of crossentropy, which will simplify our later analysis:
(4)  
(5) 
where and with . is the crossentropy, defined as . We would refer the above model as LabelGAN (using class labels) throughout this paper.
2.2 AcGan
Besides extending the original twoclass discriminator as discussed in the above section, Odena et al. (2016) proposed an alternative approach, i.e., ACGAN, to incorporate class label information, which introduces an auxiliary classifier for real classes in the original GAN framework. With the core idea unchanged, we define a variant of ACGAN as the following, and refer it as ACGAN:
(6)  
(7)  
(8)  
(9) 
where and are outputs of the binary discriminator which are the same as vanilla GAN,
is the vectorizing operator that is similar to
but defined with classes, andis the probability distribution over
real classes given by the auxiliary classifier.In ACGAN, each sample has a coupled target class , and a loss on the auxiliary classifier w.r.t. is added to the generator to leverage the class label information. We refer the losses on the auxiliary classifier, i.e., Eq. (7) and (9), as the auxiliary classifier losses.
The above formulation is a modified version of the original ACGAN. Specifically, we omit the auxiliary classifier loss which encourages the auxiliary classifier to classify the fake sample to its target class . Further discussions are provided in Section 5.3. Note that we also adopt the loss in generator.
3 ClassAware Gradient
In this section, we introduce the classaware gradient, i.e., the gradient of the generator’s loss w.r.t. class logits in the discriminator. By analyzing the classaware gradient of LabelGAN, we find that the gradient tends to refine each sample towards being one of the classes, which sheds some light on how the class label information helps the generator to improve the generation quality. Before delving into the details, we first introduce the following lemma on the gradient properties of the crossentropy loss to make our analysis clearer.
Lemma 1.
With being the logits vector and being the softmax function, let be the current softmax probability distribution and denote the target probability distribution, then
(10) 
For a generated sample , the loss in LabelGAN is , as defined in Eq. (4). With Lemma 1, the gradient of w.r.t. the logits vector is given as:
(11) 
With the above equations, the gradient of w.r.t. is:
(12) 
where
(13) 
From the formulation, we find that the overall gradient w.r.t. a generated example is , which is the same as that in vanilla GAN (Goodfellow et al., 2014). And the gradient on real classes is further distributed to each specific real class logit according to its current probability ratio .
As such, the gradient naturally takes the label information into consideration: for a generated sample, higher probability of a certain class will lead to a larger step towards the direction of increasing the corresponding confidence for the class. Hence, individually, the gradient from the discriminator for each sample tends to refine it towards being one of the classes in a probabilistic sense.
That is, each sample in LabelGAN is optimized to be one of the real classes, rather than simply to be real as in the vanilla GAN. We thus regard LabelGAN as an implicit target class model. Refining each generated sample towards one of the specific classes would help improve the sample quality. Recall that there are similar inspirations in related work. Denton et al. (2015) showed that the result could be significantly better if GAN is trained with separated classes. And ACGAN (Odena et al., 2016) introduces an extra loss that forces each sample to fit one class and achieves a better result.
4 The Proposed Method
In LabelGAN, the generator gets its gradients from the specific real class logits in discriminator and tends to refine each sample towards being one of the classes. However, LabelGAN actually suffers from the overlaidgradient problem: all real class logits are encouraged at the same time. Though it tends to make each sample be one of these classes during the training, the gradient of each sample is a weighted averaging over multiple label predictors. As illustrated in Figure 1, the averaged gradient may be towards none of these classes.
In multiexclusive classes setting, each valid sample should only be classified to one of classes by the discriminator with high confidence. One way to resolve the above problem is to explicitly assign each generated sample a single specific class as its target.
4.1 AmGan
Assigning each sample a specific target class , the loss functions of the revisedversion LabelGAN can be formulated as:
(14)  
(15) 
where is with the same definition as in Section 2.1. The model with aforementioned formulation is named as Activation Maximization Generative Adversarial Networks (AMGAN) in our paper. And the further interpretation towards naming will be in Section 5.2. The only difference between AMGAN and LabelGAN lies in the generator’s loss function. Each sample in AMGAN has a specific target class, which resolves the overlaidgradient problem.
ACGAN (Odena et al., 2016) also assigns each sample a specific target class, but we will show that the AMGAN and ACGAN are substantially different in the following part of this section.
4.2 LabelGAN + Auxiliary Classifier
Both LabelGAN and AMGAN are GAN models with classes. We introduce the following crossentropy decomposition lemma to build their connections to GAN models with two classes and the classes models (i.e., the auxiliary classifiers).
Lemma 2.
Given , , , and , let , , then we have
(16) 
With Lemma 2, the loss function of the generator in AMGAN can be decomposed as follows:
(17) 
The second term of Eq. (17) actually equals to the loss function of the generator in LabelGAN:
(18) 
Similar analysis can be adapted to the first term and the discriminator. Note that equals to one. Interestingly, we find by decomposing the AMGAN losses, AMGAN can be viewed as a combination of LabelGAN and auxiliary classifier (defined in Section 2.2). From the decomposition perspective, disparate to AMGAN, ACGAN is a combination of vanilla GAN and the auxiliary classifier.
The auxiliary classifier loss in Eq. (17) can also be viewed as the crossentropy version of generator loss in CatGAN: the generator of CatGAN directly optimizes entropy to make each sample have a high confidence of being one of the classes, while AMGAN achieves this by the first term of its decomposed loss in terms of crossentropy with given target distribution. That is, the AMGAN is the combination of the crossentropy version of CatGAN and LabelGAN. We extend the discussion between AMGAN and CatGAN in the Appendix B.
4.3 NonHierarchical Model
With the Lemma 2, we can also reformulate the ACGAN as a classes model. Take the generator’s loss function as an example:
(19) 
In the classes model, the classes distribution is formulated as . ACGAN introduces the auxiliary classifier in the consideration of leveraging the side information of class label, it turns out that the formulation of ACGAN can be viewed as a hierarchical classes model consists of a twoclass discriminator and a class auxiliary classifier, as illustrated in Figure 2. Conversely, AMGAN is a nonhierarchical model. All classes stay in the same level of the discriminator in AMGAN.
In the hierarchical model ACGAN, adversarial training is only conducted at the realfake twoclass level, while misses in the auxiliary classifier. Adversarial training is the key to the theoretical guarantee of global convergence . Taking the original GAN formulation as an instance, if generated samples collapse to a certain point , i.e., , then there must exit another point with . Given the optimal , the collapsed point will get a relatively lower score. And with the existence of higher score points (e.g. ), maximizing the generator’s expected score, in theory, has the strength to recover from the modecollapsed state. In practice, the and are usually disjoint (Arjovsky & Bottou, 2017), nevertheless, the general behaviors stay the same: when samples collapse to a certain point, they are more likely to get a relatively lower score from the adversarial network.
Without adversarial training in the auxiliary classifier, a modecollapsed generator would not get any penalties from the auxiliary classifier loss. In our experiments, we find ACGAN is more likely to get modecollapsed, and it was empirically found reducing the weight (such as 0.1 used in Gulrajani et al. (2017)) of the auxiliary classifier losses would help. In Section 5.3, we introduce an extra adversarial training in the auxiliary classifier with which we improve ACGAN’s training stability and samplequality in experiments. On the contrary, AMGAN, as a nonhierarchical model, can naturally conduct adversarial training among all the class logits.
5 Extensions
5.1 Dynamic Labeling
In the above section, we simply assume each generated sample has a target class. One possible solution is like ACGAN (Odena et al., 2016), predefining each sample a class label, which substantially results in a conditional GAN. Actually, we could assign each sample a target class according to its current probability estimated by the discriminator. A natural choice could be the class which is of the maximal probability currently: for each generated sample . We name this dynamic labeling.
According to our experiments, dynamic labeling brings important improvements to AMGAN, and is applicable to other models that require target class for each generated sample, e.g. ACGAN, as an alternative to predefined labeling.
We experimentally find GAN models with preassigned class label tend to encounter intraclass mode collapse. In addition, with dynamic labeling, the GAN model remains generating from pure random noises, which has potential benefits, e.g. making smooth interpolation across classes in the latent space practicable.
5.2 The Activation Maximization View
Activation maximization is a technique which is traditionally applied to visualize the neuron(s) of pretrained neural networks
(Nguyen et al., 2016a, b; Erhan et al., 2009).The GAN training can be viewed as an Adversarial Activation Maximization Process. To be more specific, the generator is trained to perform activation maximization for each generated sample on the neuron that represents the log probability of its target class, while the discriminator is trained to distinguish generated samples and prevents them from getting their desired high activation.
It is worth mentioning that the sample that maximizes the activation of one neuron is not necessarily of high quality. Traditionally people introduce various priors to counter the phenomenon (Nguyen et al., 2016a, b). In GAN, the adversarial process of GAN training can detect unrealistic samples and thus ensures the highactivation is achieved by highquality samples that strongly confuse the discriminator.
We thus name our model the Activation Maximization Generative Adversarial Network (AMGAN).
5.3 AcGan
Experimentally we find ACGAN easily get mode collapsed and a relatively low weight for the auxiliary classifier term in the generator’s loss function would help. In the Section 4.3, we attribute mode collapse to the miss of adversarial training in the auxiliary classifier. From the adversarial activation maximization view: without adversarial training, the auxiliary classifier loss that requires high activation on a certain class, cannot ensure the sample quality.
That is, in ACGAN, the vanilla GAN loss plays the role for ensuring sample quality and avoiding mode collapse. Here we introduce an extra loss to the auxiliary classifier in ACGAN to enforce adversarial training and experimentally find it consistently improve the performance:
(20) 
where
represents the uniform distribution, which in spirit is the same as CatGAN
(Springenberg, 2015).Recall that we omit the auxiliary classifier loss in ACGAN. According to our experiments, does improve ACGAN’s stability and make it less likely to get mode collapse, but it also leads to a worse Inception Score. We will report the detailed results in Section 7. Our understanding on this phenomenon is that: by encouraging the auxiliary classifier also to classify fake samples to their target classes, it actually reduces the auxiliary classifier’s ability on providing gradient guidance towards the real classes, and thus also alleviates the conflict between the GAN loss and the auxiliary classifier loss.
6 Evaluation Metrics
One of the difficulties in generative models is the evaluation methodology (Theis et al., 2015). In this section, we conduct both the mathematical and the empirical analysis on the widelyused evaluation metric Inception Score (Salimans et al., 2016) and other relevant metrics. We will show that Inception Score mainly works as a diversity measurement and we propose the AM Score as a compensation to Inception Score for estimating the generated sample quality.
6.1 Inception Score
As a recently proposed metric for evaluating the performance of generative models, Inception Score has been found well correlated with human evaluation (Salimans et al., 2016), where a publiclyavailable Inception model pretrained on ImageNet is introduced. By applying the Inception model to each generated sample and getting the corresponding class probability distribution , Inception Score is calculated via
(21) 
where is short of and is the overall probability distribution of the generated samples over classes, which is judged by
, and KL denotes the KullbackLeibler divergence. As proved in Appendix
D, can be decomposed into two terms in entropy:(22) 
6.2 The Properties of Inception Model
A common understanding of how Inception Score works lies in that a high score in the first term indicates the generated samples have high diversity (the overall class probability distribution evenly distributed), and a high score in the second term indicates that each individual sample has high quality (each generated sample’s class probability distribution is sharp, i.e., it can be classified into one of the real classes with high confidence) (Salimans et al., 2016).
However, taking CIFAR10 as an illustration, the data are not evenly distributed over the classes under the Inception model trained on ImageNet, which is presented in Figure 3(a). It makes Inception Score problematic in the view of the decomposed scores, i.e., and . Such as that one would ask whether a higher indicates a better mode coverage and whether a smaller indicates a better sample quality.
We experimentally find that, as in Figure 2(b), the value of is usually going down during the training process, however, which is expected to increase. And when we delve into the detail of for each specific sample in the training data, we find the value of score is also variant, as illustrated in Figure 3(b), which means, even in real data, it would still strongly prefer some samples than some others. The
operator in Inception Score and the large variance of the value of
aggravate the phenomenon. We also observe the preference on the class level in Figure 3(b), e.g., for trucks, while for birds.It seems, for an ImageNet Classifier, both the two indicators of Inception Score cannot work correctly. Next we will show that Inception Score actually works as a diversity measurement.
6.3 Inception Score as a Diversity Measurement
Since the two individual indicators are strongly correlated, here we go back to Inception Score’s original formulation . In this form, we could interpret Inception Score as that it requires each sample’s distribution highly different from the overall distribution of the generator , which indicates a good diversity over the generated samples.
As is empirically observed, a modecollapsed generator usually gets a low Inception Score. In an extreme case, assuming all the generated samples collapse to a single point, then and we would get the minimal Inception Score , which is the result of zero. To simulate mode collapse in a more complicated case, we design synthetic experiments as following: given a set of points , with each point adopting the distribution and representing class , where is the vectorization operator of length , as defined in Section 2.1, we randomly drop points, evaluate and draw the curve. As is showed in Figure 5, when increases, the value of monotonically increases in general, which means that it can well capture the mode dropping and the diversity of the generated distributions.
One remaining question is that whether good mode coverage and sample diversity mean high quality of the generated samples. From the above analysis, we do not find any evidence. A possible explanation is that, in practice, sample diversity is usually well correlated with the sample quality.
6.4 AM Score with Accordingly Pretrained Classifier
Note that if each point has multiple variants such as , , , one of the situations, where and are missing and only is generated, cannot be detected by score. It means that with an accordingly pretrained classifier, score cannot detect intraclass level mode collapse. This also explains why the Inception Network on ImageNet could be a good candidate for CIFAR10. Exploring the optimal is a challenge problem and we shall leave it as a future work.
However, there is no evidence that using an Inception Network trained on ImageNet can accurately measure the sample quality, as shown in Section 6.2. To compensate Inception Score, we propose to introduce an extra assessment using an accordingly pretrained classifier. In the accordingly pretrained classifier, most real samples share similar and 99.6% samples hold scores less than 0.05 as showed in Figure 3(c), which demonstrates that of the classifier can be used as an indicator of sample quality.
The entropy term on is actually problematic when training data is not evenly distributed over classes, for that is a uniform distribution. To take the into account, we replace with a KL divergence between and . So that
(23) 
which requires close to and each sample has a low entropy . The minimal value of AM Score is zero, and the smaller value, the better. A sample training curve of AM Score is showed in Figure 6, where all indicators in AM Score work as expected. ^{1}^{1}1Inception Score and AM Score measure the diversity and quality of generated samples, while FID (Heusel et al., 2017) measures the distance between the generated distribution and the real distribution.
7 Experiments
To empirically validate our analysis and the effectiveness of the proposed method, we conduct experiments on the image benchmark datasets including CIFAR10 and TinyImageNet^{2}^{2}2https://tinyimagenet.herokuapp.com/ which comprises 200 classes with 500 training images per class. For evaluation, several metrics are used throughout our experiments, including Inception Score with the ImageNet classifier, AM Score with a corresponding pretrained classifier for each dataset, which is a DenseNet (Huang et al., 2016a) model. We also follow Odena et al. (2016) and use the mean MSSSIM (Wang et al., 2004) of randomly chosen pairs of images within a given class, as a coarse detector of intraclass mode collapse.
A modified DCGAN structure, as listed in the Appendix F, is used in experiments. Visual results of various models are provided in the Appendix considering the page limit, such as Figure 9, etc. The repeatable experiment code is published for further research^{3}^{3}3Link for anonymous experiment code: https://github.com/ZhimingZhou/AMGAN.
7.1 Experiments on CIFAR10
7.1.1 GAN with Auxiliary Classifier
The first question is whether training an auxiliary classifier without introducing correlated losses to the generator would help improve the sample quality. In other words, with the generator only with the GAN loss in the ACGAN setting. (referring as GAN)
As is shown in Table 1, it improves GAN’s sample quality, but the improvement is limited comparing to the other methods. It indicates that introduction of correlated loss plays an essential role in the remarkable improvement of GAN training.
7.1.2 Comparison Among Different Models
The usage of the predefined label would make the GAN model transform to its conditional version, which is substantially disparate with generating samples from pure random noises. In this experiment, we use dynamic labeling for ACGAN, ACGAN and AMGAN to seek for a fair comparison among different discriminator models, including LabelGAN and GAN. We keep the network structure and hyperparameters the same for different models, only difference lies in the output layer of the discriminator, i.e., the number of class logits, which is necessarily different across models.
Model  Inception Score  AM Score  

CIFAR10  Tiny ImageNet  CIFAR10  Tiny ImageNet  
dynamic  predefined  dynamic  predefined  dynamic  predefined  dynamic  predefined  
GAN  7.04 0.06  7.27 0.07      0.45 0.00  0.43 0.00     
GAN  7.25 0.07  7.31 0.10      0.40 0.00  0.41 0.00     
ACGAN  7.41 0.09  7.79 0.08  7.28 0.07  7.89 0.11  0.17 0.00  0.16 0.00  1.64 0.02  1.01 0.01 
ACGAN  8.56 0.11  8.01 0.09  10.25 0.14  8.23 0.10  0.10 0.00  0.14 0.00  1.04 0.01  1.20 0.01 
LabelGAN  8.63 0.08  7.88 0.07  10.82 0.16  8.62 0.11  0.13 0.00  0.25 0.00  1.11 0.01  1.37 0.01 
AMGAN  8.83 0.09  8.35 0.12  11.45 0.15  9.55 0.11  0.08 0.00  0.05 0.00  0.88 0.01  0.61 0.01 
ACGAN  ACGAN  LabelGAN  AMGAN  

dynamic  0.61  0.39  0.35  0.36 
predefined  0.35  0.36  0.32  0.36 
Model  Score Std. 

DFM (WardeFarley & Bengio, 2017)  7.72 0.13 
Improved GAN (Salimans et al., 2016)  8.09 0.07 
ACGAN (Odena et al., 2016)  8.25 0.07 
WGANGP + AC (Gulrajani et al., 2017)  8.42 0.10 
SGAN (Huang et al., 2016b)  8.59 0.12 
AMGAN (our work)  8.91 0.11 
Splitting GAN (Guillermo et al., 2017)  8.87 0.09 
Real data 
As is shown in Table 1, ACGAN achieves improved sample quality over vanilla GAN, but sustains mode collapse indicated by the value 0.61 in MSSSIM as in Table 2. By introducing adversarial training in the auxiliary classifier, ACGAN outperforms ACGAN. As an implicit target class model, LabelGAN suffers from the overlaidgradient problem and achieves a relatively higher per sample entropy (0.124) in the AM Score, comparing to explicit target class model AMGAN (0.079) and ACGAN (0.102). In the table, our proposed AMGAN model reaches the best scores against these baselines.
We also test ACGAN with decreased weight on auxiliary classifier losses in the generator ( relative to the GAN loss). It achieves 7.19 in Inception Score, 0.23 in AM Score and 0.35 in MSSSIM. The 0.35 in MSSSIM indicates there is no obvious mode collapse, which also conform with our above analysis.
7.1.3 Inception Score Comparing with Related Work
AMGAN achieves Inception Score 8.83 in the previous experiments, which significantly outperforms the baseline models in both our implementation and their reported scores as in Table 3. By further enhancing the discriminator with more filters in each layer, AMGAN also outperforms the orthogonal work (Guillermo et al., 2017) that enhances the class label information via class splitting. As the result, AMGAN achieves the stateoftheart Inception Score 8.91 on CIFAR10.
7.1.4 Dynamic Labeling and Class Condition
It’s found in our experiments that GAN models with class condition (predefined labeling) tend to encounter intraclass mode collapse (ignoring the noise), which is obvious at the very beginning of GAN training and gets exasperated during the process.
In the training process of GAN, it is important to ensure a balance between the generator and the discriminator. With the same generator’s network structures and switching from dynamic labeling to class condition, we find it hard to hold a good balance between the generator and the discriminator: to avoid the initial intraclass mode collapse, the discriminator need to be very powerful; however, it usually turns out the discriminator is too powerful to provide suitable gradients for the generator and results in poor sample quality.
Nevertheless, we find a suitable discriminator and conduct a set of comparisons with it. The results can be found in Table 1. The general conclusion is similar to the above, ACGAN still outperforms ACGAN and our AMGAN reaches the best performance. It’s worth noticing that the ACGAN does not suffer from mode collapse in this setting.
In the class conditional version, although with finetuned parameters, Inception Score is still relatively low. The explanation could be that, in the class conditional version, the sample diversity still tends to decrease, even with a relatively powerful discriminator. With slight intraclass mode collapse, the persamplequality tends to improve, which results in a lower AM Score. A supplementary evidence, not very strict, of partial mode collapse in the experiments is that: the is around 45.0 in dynamic labeling setting, while it is 25.0 in the conditional version.
The LabelGAN does not need explicit labels and the model is the same in the two experiment settings. But please note that both Inception Score and the AM Score get worse in the conditional version. The only difference is that the discriminator becomes more powerful with an extended layer, which attests that the balance between the generator and discriminator is crucial. We find that, without the concern of intraclass mode collapse, using the dynamic labeling makes the balance between generator and discriminator much easier.
7.1.5 The loss
Note that we report results of the modified version of ACGAN, i.e., ACGAN in Table 1. If we take the omitted loss back to ACGAN, which leads to the original ACGAN (see Section 2.2), it turns out to achieve worse results on both Inception Score and AM Score on CIFAR10, though dismisses mode collapse. Specifically, in dynamic labeling setting, Inception Score decreases from to and the AM Score increases from to , while in predefined class setting, Inception Score decreases from to and the AM Score increases from to .
This performance drop might be because we use different network architectures and hyperparameters from ACGAN (Odena et al., 2016). But we still fail to achieve its report Inception Score, i.e., , on CIFAR10 when using the reported hyperparameters in the original paper. Since they do not publicize the code, we suppose there might be some unreported details that result in the performance gap. We would leave further studies in future work.
7.1.6 The Learning Property
We plot the training curve in terms of Inception Score and AM Score in Figure 7. Inception Score and AM Score are evaluated with the same number of samples , which is the same as Salimans et al. (2016). Comparing with Inception Score, AM Score is more stable in general. With more samples, Inception Score would be more stable, however the evaluation of Inception Score is relatively costly. A better alternative of the Inception Model could help solve this problem.
The ACGAN’s curves appear stronger jitter relative to the others. It might relate to the counteract between the auxiliary classifier loss and the GAN loss in the generator. Another observation is that the AMGAN in terms of Inception Score is comparable with LabelGAN and ACGAN at the beginning, while in terms of AM Score, they are quite distinguishable from each other.
7.2 Experiments on TinyImageNet
In the CIFAR10 experiments, the results are consistent with our analysis and the proposed method outperforms these strong baselines. We demonstrate that the conclusions can be generalized with experiments in another dataset TinyImageNet.
The TinyImageNet consists with more classes and fewer samples for each class than CIFAR10, which should be more challenging. We downsize TinyImageNet samples from to and simply leverage the same network structure that used in CIFAR10, and the experiment result is showed also in Table 1. From the comparison, AMGAN still outperforms other methods remarkably. And the ACGAN gains better performance than ACGAN.
8 Conclusion
In this paper, we analyze current GAN models that incorporate class label information. Our analysis shows that: LabelGAN works as an implicit target class model, however it suffers from the overlaidgradient problem at the meantime, and explicit target class would solve this problem. We demonstrate that introducing the class logits in a nonhierarchical way, i.e., replacing the overall real class logit in the discriminator with the specific real class logits, usually works better than simply supplementing an auxiliary classifier, where we provide an activation maximization view for GAN training and highlight the importance of adversarial training. In addition, according to our experiments, predefined labeling tends to lead to intraclass mode collapsed, and we propose dynamic labeling as an alternative. Our extensive experiments on benchmarking datasets validate our analysis and demonstrate our proposed AMGAN’s superior performance against strong baselines. Moreover, we delve deep into the widelyused evaluation metric Inception Score, reveal that it mainly works as a diversity measurement. And we also propose AM Score as a compensation to more accurately estimate the sample quality.
In this paper, we focus on the generator and its sample quality, while some related work focuses on the discriminator and semisupervised learning. For future work, we would like to conduct empirical studies on discriminator learning and semisupervised learning. We extend AMGAN to unlabeled data in the Appendix
C, where unsupervised and semisupervised is accessible in the framework of AMGAN. The classifierbased evaluation metric might encounter the problem related to adversarial samples, which requires further study. Combining AMGAN with Integral Probability Metric based GAN models such as Wasserstein GAN (Arjovsky et al., 2017) could also be a promising direction since it is orthogonal to our work.References
 Arjovsky & Bottou (2017) Arjovsky, Martin and Bottou, Léon. Towards principled methods for training generative adversarial networks. In ICLR, 2017.
 Arjovsky et al. (2017) Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.

Cao et al. (2017)
Cao, Yun, Zhou, Zhiming, Zhang, Weinan, and Yu, Yong.
Unsupervised diverse colorization via generative adversarial networks.
arXiv preprint, 2017.  Che et al. (2016) Che, Tong, Li, Yanran, Jacob, Athul Paul, Bengio, Yoshua, and Li, Wenjie. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.
 Denton et al. (2015) Denton, Emily L, Chintala, Soumith, Fergus, Rob, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pp. 1486–1494, 2015.
 Erhan et al. (2009) Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, and Vincent, Pascal. Visualizing higherlayer features of a deep network. University of Montreal, 1341:3, 2009.
 Goodfellow (2016) Goodfellow, Ian. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Guillermo et al. (2017) Guillermo, L. Grinblat, Lucas, C. Uzal, and Pablo, M. Granitto. Classsplitting generative adversarial networks. arXiv preprint arXiv:1709.07359, 2017.
 Gulrajani et al. (2017) Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, and Courville, Aaron. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
 Heusel et al. (2017) Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, Klambauer, Günter, and Hochreiter, Sepp. Gans trained by a two timescale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500, 2017.
 Huang et al. (2016a) Huang, Gao, Liu, Zhuang, Weinberger, Kilian Q, and van der Maaten, Laurens. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016a.
 Huang et al. (2016b) Huang, Xun, Li, Yixuan, Poursaeed, Omid, Hopcroft, John, and Belongie, Serge. Stacked generative adversarial networks. arXiv preprint arXiv:1612.04357, 2016b.
 Isola et al. (2016) Isola, Phillip, Zhu, JunYan, Zhou, Tinghui, and Efros, Alexei A. Imagetoimage translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
 Karras et al. (2017) Karras, Tero, Aila, Timo, Laine, Samuli, and Lehtinen, Jaakko. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 Mao et al. (2016) Mao, Xudong, Li, Qing, Xie, Haoran, Lau, Raymond YK, Wang, Zhen, and Smolley, Stephen Paul. Least squares generative adversarial networks. arXiv preprint ArXiv:1611.04076, 2016.
 Nguyen et al. (2016a) Nguyen, Anh, Dosovitskiy, Alexey, Yosinski, Jason, Brox, Thomas, and Clune, Jeff. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in Neural Information Processing Systems, pp. 3387–3395, 2016a.
 Nguyen et al. (2016b) Nguyen, Anh, Yosinski, Jason, Bengio, Yoshua, Dosovitskiy, Alexey, and Clune, Jeff. Plug & play generative networks: Conditional iterative generation of images in latent space. arXiv preprint arXiv:1612.00005, 2016b.
 Odena et al. (2016) Odena, Augustus, Olah, Christopher, and Shlens, Jonathon. Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585, 2016.
 Salimans et al. (2016) Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2226–2234, 2016.
 Springenberg (2015) Springenberg, Jost Tobias. Unsupervised and semisupervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015.

Szegedy et al. (2016)
Szegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, and Wojna,
Zbigniew.
Rethinking the inception architecture for computer vision.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2818–2826, 2016.  Theis et al. (2015) Theis, Lucas, Oord, Aäron van den, and Bethge, Matthias. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
 Wang et al. (2004) Wang, Zhou, Simoncelli, Eero P, and Bovik, Alan C. Multiscale structural similarity for image quality assessment. In Signals, Systems and Computers, 2004. Conference Record of the ThirtySeventh Asilomar Conference on, volume 2, pp. 1398–1402. IEEE, 2004.
 WardeFarley & Bengio (2017) WardeFarley, D. and Bengio, Y. Improving generative adversarial networks with denoising feature matching. In ICLR, 2017.
 Yu et al. (2016) Yu, Lantao, Zhang, Weinan, Wang, Jun, and Yu, Yong. Seqgan: sequence generative adversarial nets with policy gradient. arXiv preprint arXiv:1609.05473, 2016.
 Zhang et al. (2016) Zhang, Han, Xu, Tao, Li, Hongsheng, Zhang, Shaoting, Huang, Xiaolei, Wang, Xiaogang, and Metaxas, Dimitris. Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.
 Zhu et al. (2016) Zhu, JunYan, Krähenbühl, Philipp, Shechtman, Eli, and Efros, Alexei A. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pp. 597–613. Springer, 2016.
Appendix A Gradient Vanishing & & Label Smoothing
a.1 Label Smoothing
Label smoothing that avoiding extreme logits value was showed to be a good regularization (Szegedy et al., 2016). A general version of label smoothing could be: modifying the target probability of discriminator)
(24) 
Salimans et al. (2016) proposed to use only oneside label smoothing. That is, to only apply label smoothing for real samples: and . The reasoning of oneside label smoothing is applying label smoothing on fake samples will lead to fake mode on data distribution, which is too obscure.
We will next show the exact problems when applying label smoothing to fake samples along with the generator loss, in the view of gradient w.r.t. class logit, i.e., the classaware gradient, and we will also show that the problem does not exist when using the generator loss.
a.2 The generator loss
The generator loss with label smoothing in terms of crossentropy is
(25) 
with lemma 1, its negative gradient is
(26) 
(27) 
Gradient vanishing is a well know training problem of GAN. Optimizing towards 0 or 1 is also not what desired, because the discriminator is mapping real samples to the distribution with .
a.3 The generator loss
The generator loss with target in terms of crossentropy is
(28) 
the negative gradient of which is
(29) 
(30) 
Without label smooth , the always preserves the same gradient direction as though giving a difference gradient scale. We must note that nonzero gradient does not mean that the gradient is efficient or valid.
The bothside label smoothed version has a strong connection to LeastSquare GAN (Mao et al., 2016): with the fake logit fixed to zero, the discriminator maps real to on the real logit and maps fake to on the real logit, the generator in contrast tries to map fake sample to . Their gradient on the logit are also similar.
Appendix B CatGAN
The auxiliary classifier loss of AMGAN can also be viewed as the crossentropy version of CatGAN: generator of CatGAN directly optimizes entropy to make each sample be one class, while AMGAN achieves this by the first term of its decomposed loss in terms of crossentropy with given target distribution. That is, the AMGAN is the crossentropy version of CatGAN that is combined with LabelGAN by introducing an additional fake class.
b.1 Discriminator loss on fake sample
The discriminator of CatGAN maximizes the prediction entropy of each fake sample:
(31) 
In AMGAN, as we have an extra class on fake, we can achieve this in a simpler manner by minimizing the probability on real logits.
(32) 
If is not zero, that is, when we did negative label smoothing Salimans et al. (2016), we could define to be a uniform distribution.
(33) 
As a result, the label smoothing part probability will be required to be uniformly distributed, similar to CatGAN.
Appendix C Unlabeled Data
In this section, we extend AMGAN to unlabeled data. Our solution is analogous to CatGAN Springenberg (2015).
c.1 Semisupervised setting
Under semisupervised setting, we can add the following loss to the original solution to integrate the unlabeled data (with the distribution denoted as ):
(34) 
c.2 Unsupervised setting
Under unsupervised setting, we need to introduce one extra loss, analogy to categorical GAN Springenberg (2015):
(35) 
where the is a reference label distribution for the prediction on unsupervised data. For example, could be set as a uniform distribution, which requires the unlabeled data to make use of all the candidate class logits.
This loss can be optionally added to semisupervised setting, where the could be defined as the predicted label distribution on the labeled training data .
Appendix D Inception Score
As a recently proposed metric for evaluating the performance of the generative models, the InceptionScore has been found well correlated with human evaluation (Salimans et al., 2016), where a pretrained publiclyavailable Inception model is introduced. By applying the Inception model to each generated sample and getting the corresponding class probability distribution , Inception Score is calculated via
(36) 
where is short of and is the overall probability distribution of the generated samples over classes, which is judged by , and KL denotes the KullbackLeibler divergence which is defined as
(37) 
An extended metric, the Mode Score, is proposed in Che et al. (2016) to take the prior distribution of the labels into account, which is calculated via
(38) 
where the overall class distribution from the training data has been added as a reference. We show in the following that, in fact, Mode Score and Inception Score are equivalent.
Lemma 3.
Let be the class probability distribution of the sample , and denote another probability distribution, then
(39) 
With Lemma 3, we have
(40)  
(41) 
Appendix E The Lemma and Proofs
Lemma 1.
With being the logits vector and being the softmax function, let be the current softmax probability distribution and denote any target probability distribution, then:
(42) 
Proof.
Lemma 2.
Given , , , and , let , , then we have: