Generative adversarial networks (GAN) have become popular for generating data that mimic observations by learning a suitable variable transformation from a random variable. However, empirically, GAN is known to suffer from instability. Also, the theory provided based on the minimax optimization formulation of GAN cannot explain the widely-used practical procedure that uses the so-called logd trick. This paper provides a different theoretical foundation for generative adversarial methods which does not rely on the minimax formulation. We show that with a strong discriminator, it is possible to learn a good variable transformation via functional gradient learning, which updates the functional definition of a generator model, instead of updating only the model parameters as in GAN. The theory guarantees that the learned generator improves the KL-divergence between the probability distributions of real data and generated data after each functional gradient step, until the KL-divergence converges to zero. This new point of view leads to enhanced stable procedures for training generative models that can utilize arbitrary learning algorithms. It also gives a new theoretical insight into the original GAN procedure both with and without the logd trick. Empirical results are shown on image generation to illustrate the effectiveness of our new method.

## Authors

• 6 publications
• 99 publications
• ### An Online Learning Approach to Generative Adversarial Networks

We consider the problem of training generative models with a Generative ...
06/10/2017 ∙ by Paulina Grnarova, et al. ∙ 0

• ### On Stabilizing Generative Adversarial Training with Noise

We present a novel method and analysis to train generative adversarial n...
06/11/2019 ∙ by Simon Jenni, et al. ∙ 7

• ### Statistical inference for generative adversarial networks

This paper studies generative adversarial networks (GANs) from a statist...
04/21/2021 ∙ by Mika Meitz, et al. ∙ 0

• ### Doubly Stochastic Adversarial Autoencoder

Any autoencoder network can be turned into a generative model by imposin...
07/19/2018 ∙ by Mahdi Azarafrooz, et al. ∙ 0

• ### AdaGAN: Boosting Generative Models

Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) are an e...
01/09/2017 ∙ by Ilya Tolstikhin, et al. ∙ 0

• ### Sample weighting as an explanation for mode collapse in generative adversarial networks

Generative adversarial networks were introduced with a logistic MiniMax ...
10/05/2020 ∙ by Aksel Wilhelm Wold Eide, et al. ∙ 0

• ### Real or Not Real, that is the Question

02/12/2020 ∙ by Yuanbo Xiangli, et al. ∙ 15

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider observed real data from an unknown distribution on . Moreover, assume that we are given a random variable with a known distribution such as a Gaussian. We are interested in learning a random variable transformation so that the generated data

has a probability density function that is close to the real distribution

. This is the setting considered in generative adversarial networks (GAN) (Goodfellow et al., 2014), and the transformation is often referred to as a generator. While GAN has been widely used, it is also known that GAN is difficult to train due to its instability, which has led to numerous studies, e.g., Wasserstein GAN (WGAN) and its extensions to pursue a different minimax objective (Arjovsky et al., 2017; Gulrajani et al., 2017; Mroueh & Sercu, 2017), mode-regularized GAN to tackle the issue of mode collapse (Che et al., 2017), unrolled GAN (Metz et al., 2017), AdaGAN (Tolstikhin et al., 2017), MMD GAN (Li et al., 2017), and references therein.

An important concept introduced by GAN is the idea of adversarial learner, denoted here by , which tries to discriminate real data from generated data. Mathematically, GAN solves the following minimax optimization problem:

 maxdminG⎡⎣∑x∗∈real datalnd(x∗)+∑G(z)∈fake dataln(1−d(G(z)))⎤⎦. (1)

Parameterizing and , (1) can be viewed as a saddle point problem in optimization, which can be solved using a stochastic gradient method, where one takes a gradient step with respect to the parameters in and (see Algorithm 4 below). However, the practical procedure, suggested by the original work (Goodfellow et al., 2014), replaces minimization of with respect to in (1) with maximization of with respect to , called the log trick. Thus, GAN with the log trick, though often more effective, can not directly be explained by the theory based on the minimax formulation (1).

This paper presents a new theory for generative adversarial methods which does not rely on the minimax formulation (1). Our theory shows that one can learn a good generator where ‘goodness’ is measured by the KL-divergence between the distributions of real data and generated data, by using functional gradient learning

(Friedman, 2001). However, unlike the standard gradient boosting, which uses additive models, we consider functional compositions in the following form

 Gt(Z)=Gt−1(Z)+ηtgt(Gt−1(Z)), (t=1,…,T) (2)

to obtain . Here is a small learning rate, and each

is a function to be estimated from data. An initial generator

is assumed to be given. We learn from data greedily from to so that improvement (in terms of the KL-divergence) is guaranteed.

Our theory leads to a new stable generative adversarial method. It also provides a new theoretical insight into the original GAN both with and without the log trick. The experiments show the effectiveness of our new method on image generation in comparison with GAN variants.

##### Notation

Throughout the paper, we use to denote data in , and in particular, we use to denote real data. The probability density function of real data is denoted by . We use

to denote the vector 2-norm and

to denote the gradient w.r.t. of a scalar function .

## 2 Theory

To present our theory, we start with stating assumptions. We then analyze one step of random variable transformation in (2) (i.e., transforming to ) and examine an algorithm suggested by this analysis.

### 2.1 Assumptions and definitions

##### A strong discriminator

Given a set of real data and a set of generated data, assume that we can obtain a strong discriminator

using logistic regression so that

tries to discriminate the real data from the generated data. Here we use the logistic model, and so in (1) corresponds to . Define a quantity by

 D(x):=lnp∗(x)p(x)

where and are the probability density functions of real data and generated data, respectively, and assume that . and are thus assumed to be nonzero everywhere. When the number of given examples is sufficiently large, the standard statistical consistency theory of logistic regression (see e.g., (Zhang, 2004)) implies that Therefore, assume that the following -approximation condition is satisfied for a small :

 ∫ q∗(x)(|D(x)−D(x)|+∣∣eD(x)−eD(x)∣∣)dx≤ϵ, q∗(x)=p∗(x)max(1,∥∇lnp∗(x)∥) .

Note that the assumption of the optimal discriminator has been commonly used, and we slightly relax it to a strong discriminator by quantifying the deviation from the optimum by .

##### Smooth and light-tailed p∗

Assume that , the density of real data, is smooth with light tails; we use a constant that depends on the shape of

. Due to the space limit, the precise statements are deferred to the Appendix. Common exponential distributions such as Gaussian distributions and mixtures of Gaussians all satisfy the assumption, and an arbitrary distribution can always be approximated to an arbitrary precision by a mixture of Gaussians.

### 2.2 Analyzing one step of random variable transformation

The goal is to approximate the true density on through (2). Our analysis here focuses on one step of (2) at time , namely, random variable transformation of to . To simplify notation, we assume that we are given a random variable with a probability density on . We are interested in finding a function , so that the transformed variable for small has a distribution closer to . We show that this can be achieved with a gradient-like step in the function space.

To measure the distance of a density from the true density , we will keep track of the KL-divergence

 L(p)=∫p∗(x)lnp∗(x)p(x)dx (3)

before and after the transformation. We know that for all , and if and only if .

The following theorem consequently shows that with an appropriately chosen , the transformation can always reduce the KL-divergence . This means that transformation is an improvement from . The proof is given in the Appendix.

###### Theorem 2.1

Under the assumptions in Section 2.1, let be a continuously differentiable transformation such that and . Let be the probability density of a random variable , and let be the probability density of the random variable such that where . Then there exists a positive constant such that for all :

 L(p′)≤L(p)−η∫p∗(x)g(x)⊤∇D(x)dx+cη2+cηϵ.

The consequences of the theorem become clear when we choose (where is an arbitrary scaling factor). By doing so and letting , we have:

 L(p′)≤L(p)−η∫p∗(x)s(x)∥∇D(x)∥22dx+O(η2). (4)

This means that by letting , the objective value will be reduced for a sufficiently small unless vanishes. The vanishing condition implies that is approximately a constant when has full support on . In this case, the discriminator is unable to differentiate the real data from the generated data. Thus, it is implied that letting makes the probability density of generated data closer to that of real data until the discriminator becomes unable to distinguish the real data and generated data.

We note that taking is analogous to a gradient descent step of in the function space so that a step is taken to modify the function instead of the model parameters. Therefore, our theory leads to a functional gradient view of variable transformation that can always improve the quality of the generator — when the quality is measured by the KL-divergence between the true data and the generated data.

If we repeat the process described above, Algorithm 1 is obtained. We call it composite functional gradient learning of GAN (CFG-GAN), or simply CFG. CFG forms by directly using the functional gradient , as suggested by our theory. If (4) holds, and if we choose , then by cascading (4) from to we obtain the following bound: , where is the density of . This means that as increases, and thus approaches a constant, assuming has full support on . That is, in the limit, the discriminator is unable to differentiate the real data from the generated data.

We note that (Lazarow et al., 2017) describes a related but different cascading process motivated by Langevin dynamics sampling. The Langevin theory requires repeated noise addition in the generation process. Our generation is simpler as there is no need for noise addition.

## 3 Composite Functional Gradient Algorithms

Starting from the CFG algorithm above, we empirically explored algorithms on image generation. In this section we parameterize and denote the model parameters by .

### 3.1 ICFG: Incremental CFG

The first change made to CFG is to make it incremental similar to GAN, resulting in incremental CFG (ICFG, Algorithm 2). CFG (Algorithm 1) optimizes the discriminator to convergence in every iteration. On image generation, this tends to cause the discriminator to overfit, which turned out to be much more harmful than underfitting as an overfit discriminator grossly violates the -approximation condition (Section 2.1) thus grossly pushing the generator in a wrong direction. Also, updating the discriminator to convergence is computationally impractical apart from whether or not doing so is helpful/harmful. ICFG incrementally updates a discriminator little by little interleaved with the generator updates, so that the generator can keep providing new and more challenging examples to prevent the discriminator from overfitting.

Note that here we broadly use the term “overfit” for failure to generalize to unseen data (after fitting to observed data). This includes cases on unseen data in the low-density region according to our assumption, which is outside of manifolds according to the view of disjoint low-dim data manifolds of (Arjovsky & Bottou, 2017), and so our observation is in line with (Arjovsky & Bottou, 2017) in that the shape of distributions could be problematic. However, we handle the issue differently. Instead of changing loss (leading to WGAN), we deal with it by early stopping of discriminator training (like GAN) and functional gradient learning in generator update (unlike GAN). The latter ensures improvement of generator so that it keeps challenging the discriminator, and we will later revisit this point.

ICFG shares nice properties with CFG. The generator is guaranteed to improve with each update to the extent that the assumptions hold; therefore, it is expected to be stable. There is no need to design a complex generator model. The generator model is automatically and implicitly derived from the discriminator and grows as training proceeds (see Figure 1). A shortcoming, however, is that the implicit generator network can become very large. At time , computation of starting from scratch is in ; therefore, performing iterations of training could be in . We found that image generation requires a large to the extent that it is computationally problematic, causing slow training and slow generation.

(Nitanda & Suzuki, 2018) recently proposed a related method for fine-tuning WGAN based on different motivations. We note that their method would also suffer from the same issue if used for image generation from scratch.

A partial remedy, which speeds up training (but not generation), is to have an input pool of a fixed size. That is, we restrict the input distribution to be on a finite set and for every input , maintain for the latest for which was computed. By doing so, when needs to be computed, one can start from instead of starting over from , which saves computation time. However, this remedy solves only a part of the problem.

### 3.2 xICFG: Approximate incremental CFG

As a complete solution to the issue of large generators, we propose Approximate ICFG (xICFG, Algorithm 3). xICFG periodically compresses the generator obtained by ICFG, by training an approximator of a fixed size that approximates the behavior of the generator obtained by ICFG. That is, given a definition of an approximator and its initial state, xICFG repeatedly alternates the following.

• Using the approximator as the initial generator, perform iterations of ICFG to obtain generator .

• Update the approximator to achieve .

The generator size is again in , but unlike ICFG, for xICFG can be small (e.g., ), thus xICFG is efficient. We use the idea of an input pool above for speeding up training, and instead of keeping the same pool to the end, we refresh the pool in every iteration of xICFG (Lines 2&3 of Algorithm 3). For speed, the values for for the latest should be kept not only for use in ICFG but also for preparing the training data for the training of the approximator (Line 6).

A small pool size and a small would reduce the runtime of one iteration, but they would increase the number of required iterations, as they reduce the amount of the improvement achieved by one iteration of xICFG, and so a trade-off should be found empirically. In particular, approximation typically causes some degradation, and so it is important to set and to sufficiently large values so that the amount of the generator improvement exceeds the amount of degradation caused by approximation. In our experiments, however, tuning of meta-parameters turned out to be relatively easy; essentially one set of meta-parameters achieved stable training in all the tested settings across datasets and network architectures, as described later.

### 3.3 Relation to GAN

We show that GAN (Algorithm 4) with the logistic model (as is typically done) is closely related to a special case of xICFG that uses an extreme setting. This viewpoint leads to a new insight into GAN’s instability.

We start with the fact that GAN with the logistic model (and so ) and ICFG share the discriminator update procedure as both minimize the logistic loss. This fact becomes more obvious when we plug into Line 5 of Algorithm 4.

Next, we show that the generator update of GAN is equivalent to coarsely approximating a generator produced by ICFG with =. First note that GAN’s generator update (Line 8 of Algorithm 4) requires the gradient . Using again, and writing for the -th component of vector , the -th component of this gradient can be written as:

 [∇θGln(1−d(G(z)))]k=[∇θGlnexp(−D(G(z)))1+exp(−D(G(z)))]k =−s0(G(z))∑j[∇D(G(z))]j∂[G(z)]j∂[θG]k (5)

where , resulting from differentiating at . Now suppose that we apply ICFG with to a generator to obtain a new generator and then we update to approximate so that is minimized as in Line 6 of xICFG. To take one step of gradient descent for this approximation, we need the gradient , and its -th component is By setting the scaling factor , this is exactly the same as (5), required for the GAN generator update. (Recall that our theory and algorithms accommodate an arbitrary data-dependent scaling factor .)

Thus, algorithmically GAN is closely related to a special case of xICFG that does the following:

• Set =1 so that ICFG updates the generator just once.

• To update the approximator, take only one gradient descent step with only one mini-batch, instead of optimizing to the convergence with many examples. Therefore, the degree of approximation could be poor.

The same argument applies also to the log-trick variant of GAN by replacing with . When generated data is very far from the real data and so , we have (without the log trick), which would make the gradient (required for updating the GAN generator) vanish, as noted in (Goodfellow et al., 2014), even though the generator is poor and so requires updating. In contrast, we have (with the log trick) in this poor generator situation, which is more sensible as well as more similar to our choice () for the xICFG experiments.

##### Why is GAN unstable?

In spite of their connection, GAN is unstable, and xICFG with appropriate meta-parameters is stable (shown later). Thus, we figure that GAN’s instability derives from what is unique to GAN, the two bullets above – an extremely small and coarse approximation. Either can cause degradation of the generator, leading to instability.

We have contrasted GAN’s generator update with xICFG’s approximator update. Now we compare it with ICFG’s generator update to consider the algorithmic merits of our functional gradient approach. The short-term goal of generator update can be regarded as the increase of the discriminator output on generated data, i.e., to have for any . ICFG updates the generator by , and so with small , is guaranteed to increase for any . This is because by definition is the direction that increases the discriminator output for , and it is precisely obtained on the fly for every at the time of generation. By contrast, GAN stochastically and approximately updates using a small sample

(one mini-batch SGD step backpropagating

), and so GAN’s update can be noisy, which can lead to instability through generator degradation.

## 4 Experiments

We tested xICFG on the image generation task.

### 4.1 Experimental setup

##### Baseline methods

For comparison, we also tested the following three methods: the original GAN without the log trick (GAN0 in short), GAN with the log trick (GAN1), and WGAN with the gradient penalty (WGANgp) (Gulrajani et al., 2017). The choice of GAN0 and GAN1 is due to its relation to xICFG as analyzed above. WGANgp was chosen as a representative of state-of-the-art methods, as it was shown to rival or outperform a number of previous methods such as the original WGAN with weight clipping (Arjovsky et al., 2017), Least Squares GAN (Mao et al., 2017), Boundary Equilibrium GAN (Berthelot et al., 2017), GAN with denoising feature matching (Warde-Farley & Bengio, 2017), and Fisher GAN (Mroueh & Sercu, 2017).

##### Evaluation metrics

Making reliable likelihood estimates with generative adversarial models is known to be challenging (Theis et al., 2016), and we instead focused on evaluating the visual quality of generated images, using datasets that come with labels for classification. We measured the inception score (Salimans et al., 2016). The intuition behind this score is that high-quality images should lead to high confidence in classification. It is defined as where is the label distribution conditioned on generated data and is the label distribution over the generated data. Following previous work, e.g., (Yang et al., 2017; Che et al., 2017)

, the probabilities were estimated by a classifier trained with the labels provided with the datasets (instead of the ImageNet-trained

inception model used in (Salimans et al., 2016)) so that the image classes of interest were well represented in the classifier. We, however, call this score the ‘inception score’, following custom. We also compared the label distributions over generated data and real data, but we found that in our settings this measure roughly correlates to the inception score (generally, a very good match when the methods produce decent inception scores), and so we do not report it to avoid redundancy. We note that these metrics are limited, e.g., they would not detect mode collapse or missing modes within a class. Apart from that, we found the inception score to generally correspond to human perception well.

##### Data

We used MNIST, the Street View House Numbers dataset (SVHN) (Netzer et al., 2011)

, and the large-scale scene understanding (LSUN) dataset. These datasets are provided with class labels (digits ‘0’ – ‘9’ for MNIST and SVHN and 10 scene types for LSUN). A number of studies have used only one LSUN class (‘bedroom’). Since a single-class dataset would preclude evaluation using class labels, we instead generated a balanced two-class dataset using the same number of images from the ‘bedroom’ class and the ‘living room’ class (LSUN BR+LR). Similarly, we generated a balanced dataset from ‘tower’ and ‘bridge’ (LSUN T+B). The number of real images used for training was 60K (MNIST), 521K (SVHN), 2.6 million (LSUN BR+LR), and 1.4 million (LSUN T+B). The LSUN images were shrunk and cropped into 64

64 as in previous studies. The pixel values were scaled into .

##### Network architectures

The tested methods require as input a network architecture of a discriminator and that of an approximator or a generator. Among the numerous network architectures we could experiment with, we focused on two types with two distinct merits – good results and simplicity.

The first type (convolutional; stronger) aims at complexity appropriate for the dataset so that good results can be obtained. On MNIST and SVHN, we used an extension of DCGAN (Radford et al., 2015), adding 11 convolution layers. Larger (6464) images of LSUN were found to benefit from more complex networks, and so we used a residual net (ResNet) (He et al., 2015) of four residual blocks, which is a simplification from the WGANgp code release, for both the discriminator and the approximator/generator. Details are given in the Appendix.

These networks include batch normalization layers

(Ioffe & Szegedy, 2015). The original study states that WGANgp does not work well with a discriminator with batch normalization. Although it would be ideal to use exactly the same networks for all the methods, it would be rather unfair for the other methods if we always remove batch normalization. Therefore, in each setting, we tested WGANgp with the options of either removing batch normalization only from or from both and , and picked the best. (We also tried other normalizations such as layer normalization but did not see any merit.) In addition, we tested some cases without batch normalization anywhere for all the methods.

The second type (fully-connected or

; weaker) uses a minimally simple approximator/generator, consisting of two 512-dim fully-connected layers with ReLU, followed by the output layer with

, which has a merit of simplicity, requiring less design effort. We combined it with a convolutional discriminator, the DCGAN extension above.

##### xICFG implementation details

To speed up training, we limited the number of epochs of the approximator training in xICFG to 10 while reducing the learning rate by multiplying by 0.1 whenever the training loss stops going down. The scaling function

in ICFG was set to . To initialize the approximator for xICFG, we first created a simple generator consisting of a projection layer with random weights (Gaussian with 0 mean and 0.01 stddev) to produce the desired dimensionality, and then trained to approximate . The training time reported below includes the time spent for this initialization.

##### Other details

In all cases, the prior

was set to generate 100-dimensional Gaussian vectors with zero mean and standard deviation 1. All the experiments were done using a single NVIDIA Tesla P100.

The meta-parameter values for xICFG were fixed to those in Table 1 except when we pursued smaller values for for practical advantages (described below). For GAN, we used the same mini-batch size as xICFG, and we set the discriminator update frequency to 1 as other values led to poorer results. The SGD update was done with rmsprop (Tieleman & Hinton, 2012)

for xICFG and GAN. The learning rate for rmsprop was fixed for xICFG, but we tried several values for GAN as it turned out to be critical. Similarly, for xICFG, we found it important to set the step size

for the generator update in ICFG to an appropriate value. The SGD update for WGANgp was done with Adam (Kingma & Ba, 2015) as in the original study. We set the meta-parameters for WGANgp to the suggested values, except that we tried several values for the learning rate. Thus, the amount of tuning effort was about the same for all but WGANgp, which required additional search for the normalization options. Tuning was done based on the inception score on the validation set of 10K input vectors (i.e., 10K 100-dim Gaussian vectors), and we report inception scores on the test set of 10K input vectors, disjoint from the validation set.

### 4.2 Results

First, we report the inception score results. The scores of the real data (a held-out set of 10K images) are 9.91 (MNIST), 9.13 (SVHN), 1.84 (LSUN BR+LR), and 1.90 (LSUN T+B), respectively, which roughly set the upper bounds that can be achieved by generated images. Figure 2 shows the score of generated images (in relation to training time) with the convolutional networks, including the two cases without batch normalization anywhere for all the methods (upper-right and lower-right). Recall that a smaller has practical advantages of a smaller generator resulting in faster generation and smaller footprints while a larger stabilizes xICFG training by ensuring that training makes progress by overcoming the degradation caused by approximation. With convolutional approximators, we explored values for by a decrement of 5 starting from =25 (which works well for all) and found that can be reduced to 5 (SVHN), 10 (MNIST), and 15 (both LSUN) without negative consequences. The results in Figure 2 were obtained by using these smaller . xICFG generally outperforms the others. Although on LSUN datasets GAN1 occasionally exceeds xICFG, inspection of generated images reveals that it suffers from severe mode collapse. The results with the simple but weak fully-connected approximator/generator are shown in Figure 3. Among the baseline methods, only WGANgp succeeded in this setting, but its score fell behind xICFG. These results show that xICFG is effective and efficient.

Examples of generated images are shown in Figures 79. Note however that a small set of images may not represent the entire population well due to variability. Looking through larger sets of generated images, we found that roughly, when the inception score is higher, the images are sharper and/or there are fewer images that are harder to tell what they are, and that when the score fluctuates violently (as GAN1 does on LSUN), severe mode collapse is observed. Overall, we feel that the images generated by xICFG are better than or at least as good as those of the best-performing baseline, WGANgp, one of the state-of-the-art methods.

##### Discriminator output values

Successful training should make it harder and harder for the discriminator to distinguish real images and generated images, which would manifest as the discriminator output values for real images and generated images becoming closer and closer. In Figure 11, the -axis is the inception score, and the -axis is ‘’ ( in short), which is the difference between the discriminator output values for real images and generated images averaged over time intervals of a fixed length, obtained as a by-product of the forward propagation for updating the discriminator. The arrows indicate the direction of time flow. When training is going well (indicated by blue solid arrows),  decreases and the inception score improves as training proceeds. When it is failing,  goes up rapidly and the inception scores degrades rapidly (red dotted arrows in Fig 11 right). Here, the discriminator is overfitting to the small set of real data (1000 MNIST examples), violating the -approximation assumption. That slows down and eventually stops the progress of the generator, resulting in the increase of . In practice, training should be stopped before the rapid growth of . Thus, the decrease/increase of  values (which can be obtained at almost no cost during training) can be used as an indicator of the status of xICFG training, similar to WGAN and in contrast to GAN.

##### Use of the approximator as a generator

As noted above, a smaller generator resulting from a smaller has practical advantages, but a larger stabilizes training. One way to reduce generator size without reducing is to use the final approximator (after completing regular xICFG training) as a generator, at the expense of performance degradation. We show below the inception scores of the final approximator ( in the final call of ICFG) in comparison with the final generator ( in the final call of ICFG) in the settings of Figure 2 (convolutional).

Although the final approximator underperforms the final generator as expected, it rivals and sometimes exceeds the best baseline method, whose generator has the same size as the approximator. Thus, use of the final approximator may be a viable option. The results also indicate that the good performance of xICFG is due to not only having a larger (and therefore more complex) generator but also stable and efficient training, which makes it possible to refine the approximator network to the degree that it can outperform the baseline methods.

##### Not memorization

Finally, in Fig 10 we show examples indicating xICFG does something more than memorization.

## 5 Conclusion

In the generative adversarial learning setting, we considered a generator that can be obtained using composite functional gradient learning. Our theoretical results led to the new stable algorithm xICFG. The experimental results showed that xICFG generated equally good or better images than GAN and WGAN variants in a stable manner.

## References

• Arjovsky & Bottou (2017) Arjovsky, M. and Bottou, L. Towards principled methods for training generative adversarial networks. In Proceedings of International Conference on Learning Representations (ICLR), 2017.
• Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In

International Conference on Machine Learning (ICML)

, 2017.
• Berthelot et al. (2017) Berthelot, D., Schumm, T., and Metz, L. BEGAN: Boundary equilibrium generative adversarial networks. arXiv:1703.10717, 2017.
• Che et al. (2017) Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. Mode regularized generative adversarial networks. In Proceedings of International Conference on Learning Representations (ICLR), 2017.
• Friedman (2001) Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Statist., 29(5):1189–1232, 2001. ISSN 0090-5364.
• Goodfellow et al. (2014) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014.
• Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017.
• He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. arXiv:1512.03385, 2015.
• Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of International Conference on Machine Learning (ICML), 2015.
• Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representations (ICLR), 2015.
• Lazarow et al. (2017) Lazarow, J., Jin, L., and Tu, Z.

Introspective neural networks for generative modeling.

In

Proceedings of International Conference on Computer Vision (ICCV)

, 2017.
• Li et al. (2017) Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Póczos, B.

MMD GAN: Towards deeper understanding of moment matching network.

In Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017.
• Mao et al. (2017) Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Smolley, S. P. Least squares generative adversarial networks. arXiv:1611.04076, 2017.
• Metz et al. (2017) Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. Unrolled generative adversarial networks. In Proceedings of International Conference on Learning Representations (ICLR), 2017.
• Mroueh & Sercu (2017) Mroueh, Y. and Sercu, T. Fisher GAN. In Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017.
• Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In

Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning

, 2011.
• Nitanda & Suzuki (2018) Nitanda, A. and Suzuki, T. Gradient layer: Enhancing the convergence of adversarial training for generative models. arXiv:1801.02227, 2018.
• Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2015.
• Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. In Advances in Neural Information Processing Systems 29 (NIPS 2016), 2016.
• Theis et al. (2016) Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models. In Proceedings of International Conference on Learning Representations (ICLR), 2016.
• Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.
• Tolstikhin et al. (2017) Tolstikhin, I., Gelly, S., Bousquet, O., Simon-Gabriel, C.-J., and Schölkopf, B. AdaGAN: Boosting generative models. In Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017.
• Warde-Farley & Bengio (2017) Warde-Farley, D. and Bengio, Y. Improving generative adversarial networks with denoising feature matching. In Proceedings of International Conference on Learning Representations (ICLR), 2017.
• Yang et al. (2017) Yang, J., Kannan, A., Batra, D., and Parikh, D. LR-GAN: Layered recursive generative adversarial networks for image generation. In Proceedings of International Conference on Learning Representations (ICLR), 2017.
• Zhang (2004) Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56–85, 2004.

## Appendix A Main theorem and its proof

Theorem A.1 below, our main theorem, analyzes the extended KL-divergence for some defined as follows:

 Lβ(p):=∫(βp∗(x)+(1−β)p(x))lnβp∗(x)+(1−β)p(x)(1−β)p∗(x)+βp(x)dx .

Theorem 2.1 above is a simplified version of Theorem A.1, and it can be obtained by setting in Theorem A.1. The assumption of smooth and light-tailed in Section 2.1 is precisely stated in Assumption A.1 below.

We first state the assumptions and then present Theorem A.1 and its proof.

### a.1 Assumptions of Theorem a.1

##### A strong discriminator

Given a set of real data, a set of generated data, and indicating that the probability for , and for , assume that we can obtain a strong discriminator that solves the following weighed logistic regression problem:

 Dβ≈argminD′∑i:xi∈S∗∪Swi[βiln(1+exp(−D′(xi)))+(1−βi)ln(1+exp(D′(xi)))] (βi,wi)={(β, 1/|S∗|) for xi∈S∗ (real data) (1−β, 1/|S|) for xi∈S (generated data)

Define a quantity by

 Dβ(x):=lnβp∗(x)+(1−β)p(x)(1−β)p∗(x)+βp(x)

where and are the probability density functions of real data and generated data, respectively, and assume that there exists a positive number such that . Note that if we choose , then we can take , and the assumption on and can be relaxed from being both nonzero (with ) to not being zero at the same time. However, in practice, one can simply take , and that is what was done in our experiments, because the practical behavior of choosing is similar to .

When the number of given examples is sufficiently large, the standard statistical consistency theory of logistic regression implies that

 Dβ(x)≈Dβ(x) .

Therefore, assume that the following -approximation condition is satisfied for a small :

 ∫p∗(x)max(1,∥∇lnp∗(x)∥)(∣∣Dβ(x)−Dβ(x)∣∣+∣∣eDβ(x)−eDβ(x)∣∣)dx≤ϵ . (6)
##### Smooth light-tailed p∗

For convenience, we impose the following assumption.

###### Assumption A.1

There are constants such that when , we have

 ∫sup∥g∥≤h|p∗(x)+∇p∗(x)⊤g−p∗(x+g)|dx≤c0h2, ∫sup∥g∥≤h|p∗(x+g)−p∗(x)|2p∗(x)dx≤c0h2, ∫∥∇p∗(x)∥dx≤c0.

The assumption says that is a smooth density function with light tails. For example, common exponential distributions such as Gaussian distributions and mixtures of Gaussians all satisfy the assumption. It is worth mentioning that the assumption is not truly needed for the algorithm. This is because an arbitrary distribution can always be approximated to an arbitrary precision by mixtures of Gaussians. Nevertheless, the assumption simplifies the statement of our analysis (Theorem A.1 below) because we do not have to deal with such approximations.

Also, to meet the assumption in practice, one can add a small Gaussian noise to every observed data point, as also noted by (Arjovsky & Bottou, 2017), but we empirically found that our method works without adding noise on image generation, which is good as we have one fewer meta-parameters.

### a.2 Theorem a.1 and its proof

###### Theorem A.1

Under the assumptions in Appendix A.1, let be a continuously differentiable transformation such that and . Let be the probability density of a random variable , and let be the probability density of the random variable such that where . Then there exists a positive constant such that for all :

 Lβ(p′)≤Lβ(p)−η∫p∗(x)u(x)g(x)⊤∇Dβ(x)dx+cη2+cηϵ,

where .

##### Notation

We use

to denote the vector 2-norm and the matrix spectral norm (the largest singular value of a matrix). Given a differentiable scalar function

, we use to denote its gradient, which becomes a -dimensional vector function. Given a differentiable function , we use to denote its Jacobi matrix and we use to denote the divergence of , defined as

 ∇⋅g(x):=r∑j=1∂g(x)∂[x]j,

where we use to denote the -th component of . We know that

 ∫∇⋅w(x)dx=0 (7)

for all vector function such that .

###### Lemma A.1

Assume that is a continuously differentiable transformation. Assume that and , then as , the inverse transformation of is unique.

Moreover, consider transformation of random variables by . Define to be the associated probability density function after this transformation when the pdf before the transformation is . Then for any , we have:

 ~p∗(x)=p∗(f(x))|det(∇f(x))|. (8)

Similarly, we have

 p(x)=p′(f(x))|det(∇f(x))|, (9)

where and are defined in Theorem A.1.

Proof  Given , define map as , then the assumption implies that is a contraction when : . Therefore has a unique fixed point , which leads to the inverse transformation .

(8) and (9) follow from the standard density formula under transformation of variables.

###### Lemma A.2

Under the assumptions of Lemma A.1, there exists a constant such that

 |det(∇f(x))−(1+η∇⋅g(x))|≤cη2. (10)

Proof

We note that

 ∇f(x)=I+η∇g(x).

Therefore

 det(∇f(x))=1+η∇⋅g(x)+∑j≥2ηjmj(g(x)),

where is a function of . Since is bounded, we obtain the desired formula.

###### Lemma A.3

Under the assumptions of Lemma A.1, and assume that Assumption A.1 holds, then there exists a constant such that

 ∫∣∣~p∗(x)−(p∗(x)+ηp∗(x)∇⋅g(x)+η∇p∗(x)⊤g(x))∣∣dx≤cη2. (11)

and

 ∫(~p∗(x)−p∗(x))2p∗(x)dx≤cη2. (12)

Proof  Using the algebraic inequality

 ∣∣p∗(f(x))|det(∇f(x))|−(p∗(x)+ηp∗(x)∇⋅g(x)+η∇p∗(x)⊤g(x))∣∣ ≤ ∣∣p∗(f(x))−(p∗(x)+η∇p∗(x)⊤g(x))∣∣∣∣det(∇f(x))∣∣ +∣∣(p∗(x)+η∇p∗(x)⊤g(x))∣∣∣∣(1+η∇⋅g(x))−|det(∇f(x))|∣∣ +η2∣∣∇⋅g(x)∇p∗(x)⊤g(x))∣∣,

and using from (8), we obtain

 ∫∣∣~p∗(x)−(p∗(x)+ηp∗(x)∇⋅g(x)+η∇p∗(x)⊤g(x))∣∣dx ≤ ∫∣∣p∗(f(x))−(p∗(x)+η∇p∗(x)⊤g(x))∣∣|det(∇f(x))|dxA0 +∫∣∣(p∗(x)+η∇p∗(x)⊤g(x))∣∣∣∣(1+η∇⋅g(x))−|det(∇f(x))|∣∣dxB0 +η2∫∣∣∇⋅g(x)∇p∗(x)⊤g(x))∣∣dxC0 ≤ cη2

for some constant , which proves (11). The last inequality uses the following facts.

 A0=∫∣∣p∗(f(x))−(p∗(x)+η∇p∗(x)⊤g(x))∣∣O(1)dx=O(η2),

where the first equality follows from the boundedness of and , and the second equality follows from the first inequality of Assumption A.1.

 B0=∫∣∣(p∗(x)+η∇p∗(x)⊤g(x))∣∣O(η2)dx=O(η2),

where the first equality follows from (10), and the second equality follows from the third equality of Assumption A.1.

 C0=∫∥∇p∗(x)∥O(1)dx=O(1),

where the first equality follows from the boundedness of and , and the second equality follows from the third equality of Assumption A.1.

Moreover, using (8), we obtain

 |~p∗(x)−p∗(x)|≤|p∗(f(x))−p∗(x)||det(∇f(x))|+p∗(x)||det(∇f(x))|−1|.

Therefore

 ∫(~p∗(x)−p∗(x))2p∗(x)dx ≤ 2∫(p∗(f(x))−p∗(x))2|det(∇f(x))|2+p∗(x)2(|det(∇f(x))|−1)2p∗(x)dx≤cη2

for some , which proves (12). The second inequality follows from the second inequality of Assumption A.1, and the boundedness of , and the fact that from (10).

### Proof of Theorem a.1

In the following integration, with a change of variable from to