# Softmax GAN

Softmax GAN is a novel variant of Generative Adversarial Network (GAN). The key idea of Softmax GAN is to replace the classification loss in the original GAN with a softmax cross-entropy loss in the sample space of one single batch. In the adversarial learning of N real training samples and M generated samples, the target of discriminator training is to distribute all the probability mass to the real samples, each with probability 1/M, and distribute zero probability to generated data. In the generator training phase, the target is to assign equal probability to all data points in the batch, each with probability 1/M+N. While the original GAN is closely related to Noise Contrastive Estimation (NCE), we show that Softmax GAN is the Importance Sampling version of GAN. We futher demonstrate with experiments that this simple change stabilizes GAN training.

## Authors

• 16 publications
• ### Restricting Greed in Training of Generative Adversarial Network

Generative adversarial network (GAN) has gotten wide re-search interest ...
11/28/2017 ∙ by Haoxuan You, et al. ∙ 0

• ### A Discriminator Improves Unconditional Text Generation without Updating the Generator

We propose a novel mechanism to improve a text generator with a discrimi...
04/05/2020 ∙ by Xingyuan Chen, et al. ∙ 0

• ### Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities

*New Theory Result* We analyze the generalizability of the LS-GAN, showi...
01/23/2017 ∙ by Guo-Jun Qi, et al. ∙ 0

• ### A New GAN-based End-to-End TTS Training Algorithm

End-to-end, autoregressive model-based TTS has shown significant perform...
04/09/2019 ∙ by Haohan Guo, et al. ∙ 0

• ### FIS-GAN: GAN with Flow-based Importance Sampling

Generative Adversarial Networks (GAN) training process, in most cases, a...
10/06/2019 ∙ by Shiyu Yi, et al. ∙ 4

• ### Top-K Training of GANs: Improving Generators by Making Critics Less Critical

We introduce a simple (one line of code) modification to the Generative ...
02/14/2020 ∙ by Samarth Sinha, et al. ∙ 16

• ### MMD GAN: Towards Deeper Understanding of Moment Matching Network

Generative moment matching network (GMMN) is a deep generative model tha...
05/24/2017 ∙ by Chun-Liang Li, et al. ∙ 0

## Code Repositories

### chainer_SoftmaxGAN

reproduction of SoftmaxGAN

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

has achieved great success due to its ability to generate realistic samples. GAN is composed of one Discriminator and one Generator. The discriminator tries to distinguish real samples from generated samples, while the generator counterfeits real samples using information from the discriminator. GAN is unique from the many other generative models. Instead of explicitly sampling from a probability distribution, GAN uses a deep neural network as a direct generator that generates samples from random noises. GAN has been proved to work well on several realistic tasks, e.g. image inpainting, debluring and imitation learning.

Despite its success in many applications, GAN is highly unstable in training. Careful selection of hyperparameters is often necessary to make the training process converge

DCGAN . It is often believed that this instability is caused by unbalanced discriminator and generator training. As the discriminator utilizes a logistic loss, it saturates quickly and its gradient vanishes if the generated samples are easy to separate from the real ones. When the discriminator fails to provide gradient, the generator stops updating. Softmax GAN overcomes this problem by utilizing the softmax cross-entropy loss, whose gradient is always non-zero unless the softmaxed distribution matches the target distribution.

## 2 Related Works

There are many works related to improving the stability of GAN training. DCGAN proposed by Radford et. al. DCGAN

comes up with several empirical techniques that works well, including how to apply batch normalization, how the input should be normalized, and which activation function to use. Some more techniques are proposed by Salimans et. al.

salimans2016improved . One of them is minibatch discrimination. The idea is to introduce a layer that operates across samples to introduce coordination between gradients from different samples in a minibatch. In this work, we achieve a similar effect using softmax across the samples. We argue that softmax is more natural and explanable and yet does not require extra parameters.

Nowozin et. al. fgan generalizes the GAN training loss from Jensen-Shannon divergence to any f-divergence function. Wasserstein distance is mentioned as a member of another class of probability metric in this paper but is not implemented. Under the f-GAN framework, training objectives with more stable gradients can be developed. For example, the Least Square GAN mao2016least uses loss function as the objective, which achieves faster training and improves generation quality.

Arjovsky et. al. managed to use Wasserstein distance as the objective in their Wasserstein GAN (WGAN) WGAN

work. This new objective has non-zero gradients everywhere. The implementation is as simple as removing the sigmoid function in the objective and adding weight clipping to the discriminator network. WGAN is shown to be free of the many problems in the original GAN, such as mode collapse and unstable training process. A related work to WGAN is Loss-Sensitive GAN

LSGAN , whose objective is to minimize the loss for real data and maximize it for the fake ones. The common property of Least Square GAN, WGAN, Loss-Sensitive GAN and this work is the usage of objective functions with non-vanishing gradients.

## 3 Softmax GAN

We denote the minibatch sampled from the training data and the generated data as and respectively. is the union of and . The output of the discriminator is represented by parameterized by . is the partition function of the softmax within batch . We use for samples from and for generated samples in

. As in GAN, generated samples are not directly sampled from a distribution. Instead, they are generated directly from a random variable

with a trainable generator .

We softmax normalized the energy of the all data points within , and use the cross-entropy loss for both the discriminator and the generator. The target of the discriminator is to assign the probability mass equally to all samples in , leaving samples in with zero probability.

 tD(x)={1|B+|,if x∈B+0,if x∈B− (1)
 LD =−∑x∈BtD(x)lne−μθ(x)ZB =−∑x∈B+1|B+|lne−μθ(x)ZB−∑x′∈B−0lne−μθ(x′)ZB =∑x∈B+1|B+|μθ(x)+lnZB (2)

For generator, the target is to assign the probability mass equally to all the samples in .

 tG(x)=1|B| (3)
 LG =−∑x∈BtG(x)lne−μθ(x)ZB =−∑x∈B+1|B|lne−μθ(x)ZB−∑x′∈B−1|B|lne−μθ(x′)ZB =∑x∈B+1|B|μθ(x)+∑x′∈B−1|B|μθ(x′)+lnZB (4)

## 4 Relationship to Importance Sampling

It has been pointed out in the original GAN paper that GAN is similar to NCE NCE in that both of them use a binary logistic classification loss as the surrogate function for training of a generative model. GAN improves over NCE by using a trained generator for noise samples instead of fixing the noise distribution. In the same sense as GAN is related to NCE goodfellow2014distinguishability , this work can be seen as the Importance Sampling version of GAN IS . We prove it as follows.

### 4.1 Discriminator

We use to represent energy function and

is the partition function. The probability density function is then

. With as the observed training example, the maximum likelyhood estimation loss function is as follows

 J=1|O|∑x∈Oξϕ(x)+log∫x′e−ξϕ(x′)dx (5)
 (6)

As it is usually difficult to sample from , Importance Sampling instead introduces a known distribution to sample from, resulting in:

 (7)

In the biased Importance Sampling bengio2008adaptive , the above is converted to the following biased estimation which can be calculated without knowing :

 ˆ∇ϕJ=1|B+|∑x∈B+∇ϕξϕ(x)−1R∑x′∈Qr(x′)∇ϕξϕ(x′) (8)

where , . And is a batch of data sampled from . At this point, we reparameterize .

 ˆ∇θJ=1|B+|∑x∈B+∇θμθ(x)−1∑y∈Qe−μθ(y)∑x′∈Qe−μθ(x′)∇θμθ(x′) (9)

Without loss of generality, we assume and replace with in equation 9, namely , and compare the above with equation 2. It is easy to see that the above is the gradient of . In other words, the discriminator loss function in Softmax GAN is performing maximum likelihood on the observed real data with Importance Sampling to estimate the partition function.

With infinite number of real samples, the optimal solution is

 e−μθ(x)=CpDpD+pG2 (10)

is a constant.

### 4.2 Generator

We substitute equation 10 into 3. The lhs of equation 3 gives

 −∑x∈B1|B|lnpDpD+pG2−lnC=KL(pD+pG2∥pD) (11)

The gradient of the rhs can be seen as biased Importance Sampling as well,

 (12)

which optimizes . After removing the constants, we get

 LG=KL(pD+pG2∥pD)+KL(pD∥pD+pG2) (13)

Thus optimizing the objective of the generator is equivalent to minimizing the Jensen-Shannon divergence between and with Importance Sampling.

### 4.3 Importance Sampling’s link to NCE

Note that Importance Sampling itself is strongly connected to NCE. As pointed out by Link_NCE_IS and Link_NCE_IS_web , both Importance Sampling and NCE are training the underlying generative model with a classification surrogate. The difference is that in NCE, a binary classification task is defined between true and noise samples with a logistic loss, whereas Importance Sampling replaces the logistic loss with a multiclass softmax and cross-entropy loss. We show the relationship between NCE, Importance Sampling, GAN and this work in Figure 1. Softmax GAN is filling the table with the missing item.

### 4.4 Infinite batch size

As pointed out by bengio2008adaptive , biased Importance Sampling estimation converges to the real partition function when the number of samples in one batch goes to infinity. In practice, we found that setting is enough for generating images that are visually realistic.

## 5 Experiments

We run experiments on image generation with the celebA database. We show that although Softmax GAN is minimizing the Jensen Shannon divergence between the generated data and the real data, it is more stable than the original GAN, and is less prone to mode collapsing.

We implement Softmax GAN by modifying the loss function of the DCGAN code (https://github.com/carpedm20/DCGAN-tensorflow). As DCGAN is quite stable, we remove the empirical techniques applied in DCGAN and observe instability in the training. On the contrary, Softmax GAN is stable to these changes.

### 5.1 Stablized training

We follow the WGAN paper, by removing the batch normalization layers and using a constant number of filters in the generator network. We compare the results from GAN and Softmax GAN. The results are shown in Figure 2.

### 5.2 Mode collapse

When GAN training is not stable, the generator may generate samples similar to each other. This lack of diversity is called mode collapse because the generator can be seen as sampling only from some of the modes of the data distribution.

In the DCGAN paper, the pixels of the input images are normalized to . We remove this constraint and instead normalize the pixels to

. At the same time, we replace the leaky relu with relu, which makes the gradients more sparse (this is unfavorable for GAN training according to the DCGAN authors). Under this setting, the original GAN suffers from a significant degree of mode collape and low image qualities. In contrast, Softmax GAN is robust to this change. Examplars are show in Figure

3.

### 5.3 Balance of generator and discriminator

It is claimed in DCGAN that manually balancing the number of iterations of the discriminator and generator is a bad idea. The WGAN paper however gives the discriminator phase more iterations to get a better discriminator for the training of the generator. We set the discriminator vs generator ratio to and and explore the effects of the this ratio on DCGAN and Softmax GAN. The results are in Figure 4 and 5 respectively.

## 6 Conclusions and future work

We propose a variant of GAN which does softmax across samples in a minibatch, and uses cross-entropy loss for both discriminator and generator. The target is to assign all probability to real data for discriminator and to assign probability equally to all samples for the generator. We proved that this objective approximately optimizes the JS-divergence using Importance Sampling. We futhur form the links between GAN, NCE, Importance Sampling and Softmax GAN. We demonstrate with experiments that Softmax GAN consistently gets good results when GAN fails at the removal of empirical techniques.

In our future work, we’ll perform a more systematic comparison between Softmax GAN and other GAN variants and verify whether it works on tasks other than image generation.