Online Kernel based Generative Adversarial Networks

One of the major breakthroughs in deep learning over the past five years has been the Generative Adversarial Network (GAN), a neural network-based generative model which aims to mimic some underlying distribution given a dataset of samples. In contrast to many supervised problems, where one tries to minimize a simple objective function of the parameters, GAN training is formulated as a min-max problem over a pair of network parameters. While empirically GANs have shown impressive success in several domains, researchers have been puzzled by unusual training behavior, including cycling so-called mode collapse. In this paper, we begin by providing a quantitative method to explore some of the challenges in GAN training, and we show empirically how this relates fundamentally to the parametric nature of the discriminator network. We propose a novel approach that resolves many of these issues by relying on a kernel-based non-parametric discriminator that is highly amenable to online training—we call this the Online Kernel-based Generative Adversarial Networks (OKGAN). We show empirically that OKGANs mitigate a number of training issues, including mode collapse and cycling, and are much more amenable to theoretical guarantees. OKGANs empirically perform dramatically better, with respect to reverse KL-divergence, than other GAN formulations on synthetic data; on classical vision datasets such as MNIST, SVHN, and CelebA, show comparable performance.



There are no comments yet.


page 8

page 14


Non-parametric estimation of Jensen-Shannon Divergence in Generative Adversarial Network training

Generative Adversarial Networks (GANs) have become a widely popular fram...

MMD GAN: Towards Deeper Understanding of Moment Matching Network

Generative moment matching network (GMMN) is a deep generative model tha...

ErGAN: Generative Adversarial Networks for Entity Resolution

Entity resolution targets at identifying records that represent the same...

Annealed Generative Adversarial Networks

We introduce a novel framework for adversarial training where the target...

Implicit Kernel Learning

Kernels are powerful and versatile tools in machine learning and statist...

Leverage Score Sampling for Complete Mode Coverage in Generative Adversarial Networks

Commonly, machine learning models minimize an empirical expectation. As ...

Dual Discriminator Generative Adversarial Nets

We propose in this paper a novel approach to tackle the problem of mode ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) Goodfellow et al. (2014)

frame the task of estimating a generative model as solving a particular two-player zero-sum game. One player, the generator

, seeks to produce samples which are indistinguishable from those generated from some true distribution , and the other player, the discriminator , aims to actually distinguish between such samples. In the classical setting each of these players is given by neural networks parameterized by and

, respectively. For a random input seed vector

, the generator outputs a synthetic example

, and the discriminator returns a probability (or score) according to whether the sample is genuine or not. In its original formulation, a GAN is trained by solving a min-max problem over the parameters

and :


From a theoretical perspective, this framework for learning generative models has two very appealing qualities. First, in a setting where is convex in and concave in , a true equilibrium point of this game could be readily obtained using various descent-style methods or regret-minimization Kodali et al. (2017). Second, it was observed in Goodfellow et al. (2014) that, given an infinite amount of training data, and optimizing over the space of all possible discriminators and all possible generative models, the equilibrium solution of (1) would indeed return a generative model that captures the true distribution . Zhao et al. (2016); Mao et al. (2017) also claim that their GANs can learn the true distribution based on this strong assumption.

The challenge, in practice, is that none of these assumptions hold, at least for the way that the most popular GANs are implemented. The standard protocol for GAN training is to find an equilibrium of (1) by alternately updating and

via stochastic gradient descent/ascent using samples drawn from both the true distribution (dataset) and generated samples. It has been observed that simultaneous descent/ascent procedures can fail to find equilibria even in convex settings

Mescheder et al. (2017, 2018); Daskalakis et al. (2017), as the equilibria are “unstable” when traversal through the parameter landscape is viewed as a dynamical system; one might even expect to see cycling around the min-max point without progress.

Arora et al. Arora et al. (2017a) raise another issue involving the capacity of the discriminator network. In the thought experiment from Goodfellow et al. (2014) we summarized above, the discriminator must have “enough capacity” to produce a suitably complex function. In practice, however, GANs are trained with discriminators from a parametric family of neural networks with a fixed number of parameters . Arora et al. (2017a) observes that if the generator is allowed to produce distributions from mixtures of base measures then the generator can fool the discriminator (up to ) with the appropriate choice of mixture that differs substantially from . Authors of this work have suggested that this issue of finite capacity explains GAN mode collapse, a phenomenon observed by practitioners whereby a GAN generator, when trained to estimate multi-modal distributions, ends up dropping several modes.

We would argue that the problem of mode collapse and “cycling” are strongly related, and arise fundamentally from the parametric nature of the discriminator class of most GANs. Here is an intuitive explanation for this relationship: One can view the interaction between the two opponents as playing a game of whack-a-mole against . As is pushed around by the discriminating , modes can appear inadvertently in the training process of that had not previously considered. With limited capacity, will have to drop its discrimination focus on previously “hot” regions of the input space and consider these new bugs in the generator. But now that has lots of focus on such regions, can return to producing new modes here, and will again have to reverse course.

We have found a useful way to visualize the dance performed between and . While it is hard to observe cycling behavior in the high-dimensional parameter space of , we can take a natural projection of the discriminator as follows. Let be the sequence of parameters produced in training the GAN via the standard sequential update steps on the objective (1), and let be a random sample of examples from the training set. Consider the matrix , where each row represents the function value of the discriminator on a set of key points at a fixed training time . We can perform a Principle Component Analysis of this data, with two components, and obtain a projected version of the data , so each row is a 2-dimensional projection of the discriminator at a given time. We display this data in Figure 1 (Left graph), where light pink nodes are earlier training time points and dark red are later rounds of training. The observation here is quite stark: the discriminator does indeed cycle in a relatively consistent way, and suggests that training may not always be highly productive.

In the present paper, we propose a GAN formulation that relies on a non-parametric

family of discriminators, kernel-based classifiers, which can be efficiently trained online; we call this OKGAN (the Online Kernel-based Generative Adversarial Network). OKGANs exhibit a several benefits over previous GAN formulations, both theoretical and empirical. On the theory side, the non-parametric nature of the family of functions helps avoid the two negative results we highlighted above: the limited-capacity challenge raised by

Arora et al. (2017a) and the numerical training issues described by Mescheder et al. (2017). Given that the kernel classifiers can grow in complexity with additional data, the discriminator is able to adapt to the increasing complexity of the generator and the underlying distribution to be learned111Our kernel classifiers do require a limited budget size for the number of examples to store, but this is mostly for computational reasons, and the budget size can be scaled up as needed. Furthermore, the discriminator formulation is now a convex problem, and we have a wealth of results on both the computational as well as statistical properties of kernel-based predictors. Kernel classifiers also allow additional flexibility on data representation through the choice of kernel, or combination of kernels. For example, we have found that mixtures of gaussian kernels with different radii perform best in complicated image datasets such as CelebA.

What is quite clear, empirically, is that OKGANs do not suffer from mode collapse, at least not on any of the low-dimensional synthetic examples that we have tested. On simple problems where the target exhibits a density function, we show OKGAN dramatically outperforms prior methods when it comes to capturing the target distribution via reverse KL-divergence. We show that OKGANs achieve the highest diversity on synthetic datasets by using quantitative metrics. Additionally, when we include the use of an “encoder,” we qualitatively demonstrate OKGANs work well on classical image datasets, including MNIST, SVHN, and CelebA. We observe the discriminator of OKGANs adapts in a more aggressive fashion, and the discriminator does not appear to exhibit cycling behavior, a common phenomenon for other GANs using neural net based discriminators.

Figure 1: Qualitative comparison of Vanilla GAN (Left) and OKGAN (Right, proposed) on cycling behavior on 2D-grid dataset (see Section 3.1). Vanilla GAN shows cycling behavior; the parameters of discriminator cycle around the equilibrium, slowing down the convergence to the optimum and causing the generator to be unable to effectively learn the real distribution. In contrast, OKGAN does not appear to suffer from cycling behavior.

1.1 Related works

Many theoretical works have aimed at understanding GANs. Arora et al. (2017a); Zhang et al. (2017) study the generalization properties of GANs under neural distance, Liu et al. (2017) studies the convergence properties of GANs via “adversarial” divergence. Bai et al. (2018) asserts that the diversity in GANs can be improved by letting the discriminator class get strong distinguishing power against a certain generator class. Nagarajan and Kolter (2017); Mescheder et al. (2017, 2018); Li et al. (2017b); Liang and Stokes (2018); Nie and Patel (2019) consider a range of questions around GAN dynamics.

Mode collapse and cycling behavior are two of the main issues raised about GAN training. Goodfellow (2016) observes the cycling behavior of the generator’s output in an experiment with a 2D synthetic dataset when mode collapse occurs. Berard et al. (2019) proposes a new visualization technique called path-angle to study the game vector field of GANs and shows cycling behavior empirically with this technique. Daskalakis et al. (2017) tries to improve GAN training via so-called “optimistic” mirror descent. Metz et al. (2016) uses an approximately optimal discriminator for the generator update by formulating the generator objective with an unrolled optimization of the discriminator. Srivastava et al. (2017) adds a reconstructor network that maps the data distribution to Gaussian noise, providing more useful feature vectors in training. Arora et al. (2017b) explores the limitations of encoder-decoder architectures to prevent mode collapse. Lin et al. (2018) proposes a mathematical definition of mode collapse and gives an information-theoretic analysis. Bourgain Theorem mentioned in Xiao et al. (2018) uses metric embeddings to construct a latent Gaussian mixture, a direct approach to solve mode collapse.

Our work is not the first to implement kernel learning ideas into GANs. Perhaps the first such was introduced by Gretton et al. (2007)

, a statistical hypothesis testing framework called Maximum Mean Discrepancy (MMD), which aims to distinguish between real and fake distributions, rather than explicitly constructing a discriminator.

Li et al. (2015); Dziugaite et al. (2015)

propose generative moment matching network (GMMN) with a fixed Gaussian kernel for the MMD statistical testing.

Li et al. (2017a) introduces MMD GANs which improve GMMN by adding an injective function to the kernel and making the kernel trainable. Bińkowski et al. (2018) demonstrates the superiority of MMD GANs in terms of gradient bias, and Wang et al. (2018)

improves MMD GANs with a repulsive loss function and a bounded Gaussian kernel.

2 Online kernel GANs

2.1 Online kernel classifier

In the classical formulation of a GAN, the discriminator can generally be regarded as a classifier that aims to distinguish between data in the training set, sampled from , and so-called “fake” data produced by the generator. The original GAN formulation of Goodfellow et al. (2014), and nearly every other generative models inspired by this work Salimans et al. (2016); Nowozin et al. (2016); Chen et al. (2016); Gulrajani et al. (2017); Berthelot et al. (2017); Karras et al. (2017); Miyato et al. (2018), the discriminator is a finitely-parameterized neural network with parameters . It has generally been believed that the discriminator model family should be suitably complex in order to guide the generator to accurately mimic a complex distribution, and thus a deep neural network was the obvious choice for . What we argue in this paper is that a more classical choice of discriminator model, a function class based on a Reproducing Kernel Hilbert Space (RKHS) Bottou and Lin (2007), possesses suitable capacity and has a number of benefits over deep networks. For example, the learning task is indeed a convex problem, which provides guaranteed convergence with well-understood rates. Second, using margin theory and the RKHS-norm to measure function size, we have an efficient way to measure the generalization ability of classifiers selected from an RKHS, and thus to regularize appropriately. Third, they are well suited to fast online training, with regret-based guarantees.

An overview of kernel learning methods.

We now review the basics of kernel-based learning algorithms, and online learning with kernels; see Gretton et al. (2007); Kivinen et al. (2004); Cortes and Vapnik (1995); Dekel et al. (2008); Scholkopf and Smola (2001) for further exposition. Let be some abstract space of data, which typically, although not necessarily, is some finite-dimensional real vector space. A kernel is called positive semi-definite if it is symmetric function on pairs of examples from , and for every positive integer and every set of examples the matrix is positive semi-definite. Typically we view a PSD kernel as a dot product in some high-dimensional space, and indeed a classic result is that for any PSD there is an associated feature map which maps points in to some (possibly infinite dimensional) Hilbert space for which Scholkopf and Smola (2001). Given a kernel , we can consider functions of the form , where the ’s are arbitrary real coefficients and the ’s are arbitrary points in . The set of functions of this form can be viewed as a pre Hilbert space, using the norm , and when we complete this set of functions we obtain the Reproducing Kernel Hilbert Space . Again, this is a very brief survey, but more can be found in the excellent book of Schölkopf and Smola Scholkopf and Smola (2001).

Let us give an overview of learning in an RKHS associated to some kernel . First, imagine we have a sequence of examples sampled from some distribution on , where . Our goal is to estimate a classifier in . Assume we have some convex loss function , where is the cost of predicting when the true label is ; typically we will use the hinge loss or the logistic loss. In a batch setting, we may estimate given by minimizing the regularized risk functional defined as follows:


Assuming that the loss function satisfies a simple monotonicity property, as a result of the celebrated representer theorem Scholkopf and Smola (2001) we may conclude that a solution to the above problem always exists in the linear span of the set . In other words, estimating a function in an infinite dimensional space reduces to find coefficients which parameterize the resulting solution .

Online training.

Researchers have known for some time that training kernel-based learning algorithms can be prohibitively expensive when the dataset size is large; the problem is worse when the dataset is growing in size. Solving (2) naively can lead to computational cost that is at least cubic in . A more scalable training procedure involves online updates to a current function estimate. A more thorough description of online kernel learning can be found in Kivinen et al. (2004) and Dekel et al. (2008), but we give a rough outline here. Let be the function reached at round of an iterative process,


A simple gradient update with step size , using the instantaneous regularized risk with respect to a single example , leads to the following iterative procedure:


In short: the algorithm at time maintains a set of points and corresponding corresponding coefficients and offset , and when a new example arrives, due to (5), the coefficient is created, and the other s are scaled down:


In our implementation, we add multiple examples at once as a minibatch at every round. For example, at round , the input is not a single example but examples . Thus, we change (6) and (7) as below. Also, is updated as an average of new coefficients.

Limiting the budget.

One may note that the above algorithm scales quadratically with , since after a given number of rounds must compute for all . But this can be alleviated with a careful budgeting policy, where only a limited cache of ’s and are stored. This is natural for a number of reasons, but especially given that ’s decay exponentially and thus each will fall below after only updates. The issue of budgeting and its relation to performance and computational issues was thoroughly explored by Dekel et al. Dekel et al. (2008), and we refer the reader to their excellent work. In our experiments we relied on the “Remove-Oldest” method akin to first-in-first-out (FIFO) caching. Then, at round , let’s say a fixed budget size of our online kernel classifier is , and are key examples saved in the budget. As the result, after we finish training on the minibatch at round t, we will get a classifier function as following.


2.2 Objective function and training

In OKGAN, the discriminator in the original GAN formulation (1) is regarded as the online kernel classifier, and it is obtained not from a parametric family of neural networks but from RKHS . The goal of the online kernel classifier is to separate the real data and the fake data. When we obtain the classifier function after each batch, its value of real data and fake data is respectively positive and negative. That’s why we use a hinge loss while formulating a min-max objective of OKGAN. If the generator is parameterized by , the objective of OKGAN is


We obtain the online kernel classifier through the process in 2.1 after training one batch of the dataset. Then, we use the objective function of the generator as:


We use (10) as the loss function for the generator rather than , which is the second term in (9) with opposite sign. It is the same reason that non-saturating loss is preferred than minimax loss Fedus et al. (2017).

Objective function of OKGAN with encoder

OKGAN has superior performance on low-dimensional data such as 2d synthetic datasets (See Table 1). But without additional representation power, it struggles to generate high-quality images that have been the hallmark of other GAN architectures. However, we find that this is remedied by adding an encoder layer. Moreover, the encoder

enables us to calculate the kernel with high dimensional data such as complicated image datasets.

is also a neural network and trained in a way that separates real data() and fake data() because the online kernel classifier should recognize as real and as fake. From the perspective of OKGAN with the encoder, a combination of the encoder and the online kernel classifier is considered as the discriminator. Thus, when is parameterized by and is parameterized by , we acquire an minmax objective of OKGAN as:


There are 3 steps to train OKGAN with the encoder. First, samples(), where N samples are real and others are fake, after passing the encoder become , and we get an online kernel classifier from these. Second, the generator is updated based on a generator objective function with the updated and the existing . Finally, the encoder is updated based on a encoder objective function with the updated and . The objective function of and is shown as below.


2.3 Flexibility on data representation through kernels

OKGAN successfully generates classical image datasets (see Section 3) by achieving flexibility on data representation through the choice of kernels. We implement commonly used kernels and mixtures of them in the online kernel classifier, which are Gaussian kernel(), linear kernel(), polynomial kernel(), rational quadratic kernel(), mixed Gaussian kernel(), and mixed RQ-linear kernel() Bińkowski et al. (2018). The mathematical form of kernels is:

3 Experiments

In this section, we provide experimental results of OKGANs in both quantitative and qualitative ways. We quantitatively compare OKGANs with other GANs on 2D synthetic datasets and show how well OKGANs solve the mode collapse problem, using quantitative metrics proposed earlier in Srivastava et al. (2017); Lin et al. (2018); Xiao et al. (2018). Moreover, we analyze OKGANs qualitatively on classical image datasets and observe that OKGANs do not suffer from cycling behavior on 2D synthetic datasets, shown through our novel visualization technique.

3.1 Experimental setup


We use 2D synthetic datasets for the quantitative analysis on mode collapse, specifically, we use 2D-grid, 2D-ring, and 2D-circle Srivastava et al. (2017); Lin et al. (2018); Xiao et al. (2018). The 2D-grid and 2D-ring datasets are Gaussian mixtures with 25 and 8 modes, organized in a grid shape and a ring shape respectively. The specific setup of these two datasets is the same as Lin et al. (2018). The 2D-circle dataset, which is proposed in Xiao et al. (2018), consists of a continuous circle surrounding another Gaussian located in the center, and we follow the setup of Xiao et al. (2018). Additionally, MNIST, SVHN (Street View House Numbers), and CelebA are all used for the qualitative analysis. More details of datasets are in Appx. C.

Generator & encoder

For the quantitative analysis and cycling behavior on 2D synthetic datasets with OKGAN, we need a neural network architecture only for the generator, since the discriminator is formed by the online-kernel classifier. The generator architecture of OKGAN is the same as one of PacGAN Lin et al. (2018), and we use the online kernel classifier instead of the discriminator of PacGAN. In experiments with classical image datasets for the qualitative analysis, we use DCGAN Radford et al. (2015) architecture to the OKGAN. The generator of OKGAN is the same as that of DCGAN, and the encoder of OKGAN is a reverse architecture of the generator. The output dimension of the encoder is 100 for MNIST, SVHN, and CelebA.

Kernel choice

The appropriate choice of a kernel is an essential part of the online kernel classifier. Since all 2D synthetic datasets are a Gaussian mixture distribution, we choose Gaussian kernel for experiments on 2D synthetic datasets. When it comes to learning a real distribution of the 2D synthetic datasets, it is significant to control the value in the kernel during training OKGAN. The small enables the generator to explore all different kinds of modes by smoothing the landscape of the kernel function. On the contrary, the large helps fake points, which are previously located between modes, move to one of the nearest modes. Therefore, the initial is small, and we increase with a fixed ratio to make large in the end. The initial value for 2D-grid and 2D-circle is , and the initial for 2D-ring is . The rate of increase is same for all 2D synthetic datasets as .

For the qualitative analysis on classical image datasets, Gaussian kernel () works well on MNIST and polynomial kernel () works well on SVHN dataset. For CelebA dataset, we use the mixed gaussian kernel, where  Bińkowski et al. (2018). All coefficients of kernels in these experiments are constant during training.

Other hyperparameters in online kernel classifier

We use different budget size for each dataset. The budget size is 4096 for 2D synthetic datasets. In addition to this,

is 700 for MNIST, 2000 for SVHN, and 1000 for CelebA. The budget size for 2D synthetic datasets is the highest because OKGANs only rely on the online kernel classifier without the encoder in this case. Moreover, there are several other hyperparameters such as regularization term(

) and step size() and they are all constant during training. More technical details will be explained in Appx. B.

Evaluation Metrics

The evaluation metrics for 2D synthetic datasets are also previously proposed by

Srivastava et al. (2017); Lin et al. (2018), which are # of mode, percentage of high-quality samples, and reverse Kullback-Leibler(KL) divergence

. Let’s say a standard deviation of Gaussian is

, and a set of generated samples is . An entire set of modes is . Then, # of mode, percentage of high-quality samples is defined as:

We calculate reverse KL divergenceKullback and Leibler (1951) by considering real and fake distributions as discrete distributions. GAN with high # of mode, high percentage of high-quality samples, and low reverse KL divergence is regarded as good at solving mode collapse and learning real distribution well.

3.2 Quantitative analysis

For 2D synthetic datasets, we compare OKGAN with the two most powerful unconditional GANs on solving mode collapse, called PacGAN Lin et al. (2018) and BourGAN Xiao et al. (2018). Since BourGAN uses different architectures, we apply neural network architectures of PacGAN to the BourGAN framework. Thus, we refer to the performance of PacGAN in Lin et al. (2018) and measure the performance with new BourGAN. The quantitative performance of these three GANs on 2D-grid, 2D-ring, and 2D-circle is summarized in Table 1. For 2D-circle, we only compare OKGAN with BourGAN which proposes such dataset for the first time. Our results are averaged over 10 trials. More experiments with 2D synthetic datasets by changing the number of modes are shown in Appx. C.1.

2D-grid 2D-ring 2D-circle
#modes high reverse #modes high reverse center high reverse
(max 25) quality(%) KL (max 8) quality(%) KL captured quality(%) KL
PacGAN 23.8 91.3 0.13 7.9 95.6 0.07 - - -
BourGAN 24.8 95.1 0.036 7.9 100.0 0.019 0.5 99.9 0.015
OKGAN 25.0 86.2 0.006 8.0 95.3 0.002 1.0 98.1 0.0003
Table 1: Quantitative results on 2D synthetic datasets. For 2D-circle, at every trial, "center captured" is 1 when the center is captured and 0 when it is not captured.

As you can see from Table 1 and Appx. C.1, OKGAN shows the best performance overall in terms of mitigating mode collapse. In contrast with other GANs, OKGAN captures all modes for all 2D synthetic datasets. The remarkable point is that reverse KL divergence of OKGAN is the lowest for all three datasets. This indicates that OKGAN not only produces all modes but also generates each mode with the similar proportion in a real distribution. Furthermore, the fake distribution of OKGAN converges to the real distribution faster than that of BourGAN, and OKGAN is trained in a more stable way (See Figure 2). Therefore, we conclude that OKGAN increases the diversity of generated samples by taking advantage of a kernel-based non-parametric discriminator.

Figure 2: Reverse KL divergence graph on 2D-grid(Left), 2D-ring(Middle), and 2D-circle(Right).

3.3 Qualitative analysis

Classical image datasets

We qualitatively compare OKGAN with DCGAN on the CelebA dataset (see Figure 3). Both DCGAN and OKGAN successfully generate fake images of celebrities. Further qualitative comparison on MNIST and SVHN are provided in Appx. C.2.

Figure 3: Qualitative comparison on CelebA dataset

Cycling behavior

In Figure 1(a), we can clearly observe that Vanilla GAN (VGAN) shows the cycling behavior during training, which means the discriminator does cycle and fails to give meaningful information to the generator. In a parameter-based alternative update framework such as VGAN, it is challenging for the discriminator to chase the transitions of the generator with a slow pace of parameter updates. However, in the case of OKGAN, by obtaining a closed-form discriminator with a non-parametric kernel method, the discriminator is updated in a more aggressive fashion and separates real and fake data more effectively at every update. As you can see in Figure 1(b), the discriminator of OKGAN tends to find the optimal discriminator with no apparent cycling behavior, which also leads to solving the mode collapse problem in the end.

4 Discussion

In this work, we propose OKGAN, a new type of GAN whose discriminator contains the online kernel classifier. We provide a novel method for visualizing the cycling behavior of GANs and empirically show that OKGAN does not suffer from this issue. Moreover, with a kernel-based non-parametric discriminator, OKGAN successfully learns 2D synthetic data with no mode collapse and generates high quality samples in image datasets. In future, the deeper theoretical understanding on dynamics of GANs with the non-parametric discriminator can be discussed. Applying the idea of combining the kernel method with neural networks Wilson et al. (2016) to GANs will be another interesting future work.


  • S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang (2017a) Generalization and equilibrium in generative adversarial nets (gans). In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 224–232. Cited by: §1.1, §1, §1.
  • S. Arora, A. Risteski, and Y. Zhang (2017b) Theoretical limitations of encoder-decoder gan architectures. arXiv preprint arXiv:1711.02651. Cited by: §1.1.
  • Y. Bai, T. Ma, and A. Risteski (2018) Approximability of discriminators implies diversity in gans. arXiv preprint arXiv:1806.10586. Cited by: §1.1.
  • H. Berard, G. Gidel, A. Almahairi, P. Vincent, and S. Lacoste-Julien (2019) A closer look at the optimization landscapes of generative adversarial networks. arXiv preprint arXiv:1906.04848. Cited by: §1.1.
  • D. Berthelot, T. Schumm, and L. Metz (2017) Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §2.1.
  • M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018) Demystifying mmd gans. arXiv preprint arXiv:1801.01401. Cited by: §1.1, §2.3, §3.1.
  • L. Bottou and C. Lin (2007) Support vector machine solvers. Large scale kernel machines 3 (1), pp. 301–320. Cited by: §2.1.
  • X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.1.
  • C. Cortes and V. Vapnik (1995) Support-vector networks. Machine learning 20 (3), pp. 273–297. Cited by: §2.1.
  • C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng (2017) Training gans with optimism. arXiv preprint arXiv:1711.00141. Cited by: §1.1, §1.
  • O. Dekel, S. Shalev-Shwartz, and Y. Singer (2008)

    The forgetron: a Kernel-Based perceptron on a budget

    SIAM J. Comput. 37 (5), pp. 1342–1372. Cited by: §2.1, §2.1, §2.1.
  • G. K. Dziugaite, D. M. Roy, and Z. Ghahramani (2015) Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906. Cited by: §1.1.
  • W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow (2017) Many paths to equilibrium: gans do not need to decrease a divergence at every step. arXiv preprint arXiv:1710.08446. Cited by: §2.2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §1, §2.1.
  • I. Goodfellow (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §1.1.
  • A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola (2007) A kernel method for the two-sample-problem. In Advances in neural information processing systems, pp. 513–520. Cited by: §1.1, §2.1.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §2.1.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §2.1.
  • J. Kivinen, A. J. Smola, and R. C. Williamson (2004) Online learning with kernels. IEEE transactions on signal processing 52 (8), pp. 2165–2176. Cited by: §2.1, §2.1.
  • N. Kodali, J. Abernethy, J. Hays, and Z. Kira (2017) On convergence and stability of gans. arXiv preprint arXiv:1705.07215. Cited by: §1.
  • S. Kullback and R. A. Leibler (1951) On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: §3.1.
  • C. Li, W. Chang, Y. Cheng, Y. Yang, and B. Póczos (2017a) Mmd gan: towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pp. 2203–2213. Cited by: §C.3, §1.1.
  • J. Li, A. Madry, J. Peebles, and L. Schmidt (2017b) On the limitations of first-order approximation in gan dynamics. arXiv preprint arXiv:1706.09884. Cited by: §1.1.
  • Y. Li, K. Swersky, and R. Zemel (2015) Generative moment matching networks. In International Conference on Machine Learning, pp. 1718–1727. Cited by: §1.1.
  • T. Liang and J. Stokes (2018) Interaction matters: a note on non-asymptotic local convergence of generative adversarial networks. arXiv preprint arXiv:1802.06132. Cited by: §1.1.
  • Z. Lin, A. Khetan, G. Fanti, and S. Oh (2018) Pacgan: the power of two samples in generative adversarial networks. In Advances in neural information processing systems, pp. 1498–1507. Cited by: §B.1, §C.1, §1.1, §3.1, §3.1, §3.1, §3.2, §3.
  • S. Liu, O. Bousquet, and K. Chaudhuri (2017) Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pp. 5545–5553. Cited by: §1.1.
  • X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 2794–2802. Cited by: §1.
  • L. Mescheder, A. Geiger, and S. Nowozin (2018) Which training methods for gans do actually converge?. arXiv preprint arXiv:1801.04406. Cited by: §1.1, §1.
  • L. Mescheder, S. Nowozin, and A. Geiger (2017) The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1825–1835. Cited by: §1.1, §1, §1.
  • L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein (2016) Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163. Cited by: §1.1.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §2.1.
  • V. Nagarajan and J. Z. Kolter (2017) Gradient descent gan optimization is locally stable. In Advances in neural information processing systems, pp. 5585–5595. Cited by: §1.1.
  • W. Nie and A. Patel (2019) Towards a better understanding and regularization of gan training dynamics. arXiv preprint arxiv:1806.09235. Cited by: §1.1.
  • S. Nowozin, B. Cseke, and R. Tomioka (2016) F-gan: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §2.1.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §B.1, §3.1.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §2.1.
  • B. Scholkopf and A. J. Smola (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge, MA, USA. Cited by: §2.1, §2.1.
  • A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton (2017) Veegan: reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3308–3318. Cited by: §1.1, §3.1, §3.1, §3.
  • W. Wang, Y. Sun, and S. Halgamuge (2018) Improving mmd-gan training with repulsive loss function. arXiv preprint arXiv:1812.09916. Cited by: §1.1.
  • A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing (2016) Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378. Cited by: §4.
  • C. Xiao, P. Zhong, and C. Zheng (2018) Bourgan: generative networks with metric embeddings. In Advances in Neural Information Processing Systems, pp. 2269–2280. Cited by: Figure 4, §C.1, §1.1, §3.1, §3.2, §3.
  • P. Zhang, Q. Liu, D. Zhou, T. Xu, and X. He (2017) On the discrimination-generalization tradeoff in gans. arXiv preprint arXiv:1711.02771. Cited by: §1.1.
  • J. Zhao, M. Mathieu, and Y. LeCun (2016) Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126. Cited by: §1.

Appendix A Algorithm & training details

learning rate, learning rate decay, batch size, batch size of mini-batch, budget size, number of iterations of generator per discriminator update.
while not converged do
     Sample &
     if not first iteration then
     end if
      inputs for the online kernel classifier are
     In the perspective of the classifier, consider as a total size of data and as a batch size.
      refer to the process in Section 2.1
     s and s are respectively coefficients and key examples saved in the current budget
     for  do
     end for
end while
Algorithm 1 OKGAN with encoder

The above algorithm applies to OKGAN with encoder. We use OKGAN with encoder on datasets such as MNIST, SVHN, and CelebA. For the experiment on 2D synthetic datasets, we use OKGAN which contains only the generator and the online kernel classifier. The learning rate and the learning rate decay is respectively and for OKGAN on 2D synthetic datasets. When it comes to dealing with classical image datasets, the learning rate and the learning rate decay is respectively and . Adam optimizer with is used for 2D synthetic datasets, and Adam optimizer with is used for MNIST, SVHN, and CelebA. The batch size of mini-batch for the online kernel classifier is for all datasets. Each batch size of 2D synthetic datasets, MNIST, SVHN, and CelebA is 500, 200, 128, 128.

We need to update the generator several times per one discriminator update because the generator requires many updates in order to fool the discriminator, which has a strong discriminative power by taking advantage of non-parametric kernel learning. Each number of iterations of the generator per one discriminator update for 2D synthetic datasets, MNIST, SVHN, and CelebA is 5, 10, 1, 3.

Appendix B Architectures & hyperparameters

b.1 Further details of neural network architectures

As we mention in Section 3.1, we use neural network architectures of PacGANLin et al. (2018)

to BourGAN and our proposed OKGAN in the experiment with 2D synthetic datasets. Specifically, the generator has four hidden layers, batch normalization, and 400 units with ReLU activation per hidden layer. The input noise for the generator is a two dimensional Gaussian whose mean is zero, and covariance is identity. Also, the discriminator of PacGAN and BourGAN has three hidden layers with LinearMaxout with 5 maxout pieces and 200 units per hidden layer. Batch normalization is not used in the discriminator. Additionally, when we use PacGAN, we set the number of packing to be two. OKGAN does not need the encoder architecture for the experiment with 2d synthetic datasets.

In terms of training OKGAN with classical image datasets such as MNIST, SVHN, and CelebA, we need the encoder architecture, which is a reverse of the generator. We apply DCGAN Radford et al. (2015)

neural network architectures to OKGAN. The generator of OKGAN is a series of strided two dimensional convolutional transpose layers, each paired with a 2d batch norm and ReLU activation. (# of input channel, # of output channel, kernel size, stride, padding) are important factors for convolutional layers and convolutional transpose layers. For MNIST and CelebA dataset, we use 5 convolutional transpose layers sequentially as (100, 512, 4, 1, 0), (512, 256, 4, 2, 1), (256, 128, 4, 2, 1), (128, 64, 4, 2, 1), (64,

, 4, 2, 1). is 1 for MNIST and 3 for CelebA. Additionally, for MNIST, we find out that (100, 1024, 4, 1, 0), (1024, 512, 4, 2, 1), (512, 256, 4, 2, 1), (256, 128, 4, 2, 1), (128, 1, 4, 2, 1) also works well. For SVHN dataset, we firstly use fully connected layer before applying convolutional transpose layers. Then, we sequentially use 3 convolutional transpose layers as (128, 64, 4, 2, 1), (64, 32, 4, 2, 1), (32, 3, 4, 2, 1). Furthermore, the encoder of OKGAN is a series of strided two dimensional convolutional layers, each paired with a 2d batch norm and LeakyReLU activation. It is easy to figure out (# of input channel, # of output channel, kernel size, stride, padding) of the encoder because the encoder is simply a reverse architecture of the generator.

b.2 Hyperparameters in the online kernel classifier

The parameters such as types of kernels, budget size, regularization term, and step size are already discussed in Section 3.1. The regularization term() is 0.1, and the step size() is 0.05 for all experiments. Moreover, the online kernel classifier allows two types of the loss function; hinge loss and logistic loss. We fix the margin value as 1.0 when we use hinge loss for the online kernel classifier. Also, we fix the degree as 3 and the coef0 as 0.0 when we use the polynomial kernel.

Appendix C Experiment details

c.1 Experiment details on 2D synthetic datasets & further experiments

For 2D-grid and 2D-ring, we follow the experiment setup used in Lin et al. (2018). The standard deviation of each Gaussian in 2D-grid is 0.05, and a grid value of four edges is (-4, -4), (-4, 4), (4, -4), (4, 4). In the case of 2D ring, the standard deviation of each Gaussian is 0.01, and the radius of a ring shape is 1. In addition to this, for 2D-circle, we follow the experiment setup in Xiao et al. (2018)

. 100 Gaussian distributions are on a circle with a radius 2, and three identical Gaussians are located at the center of the circle. The standard deviation of each Gaussian is 0.05. We generate 2500 samples from the trained generator to quantitatively compare GANs on evaluation metrics in Section


. Also, We train the model with 4000 epochs, 5000 epochs, 3000 epochs respectively for 2D-grid, 2D-ring, 2D-circle.

Figure 4: Qualitative comparison on 2D synthetic datasets. (OKGAN vs. BourGAN). The advantage of BourGAN is that BourGAN generates high-quality samples by avoiding unwanted samples between modes Xiao et al. (2018). In contrast, OKGAN is better at capturing modes with similar proportion in real distribution.

In Figure 4, we provide the qualitative analysis of BourGAN and OKGAN on 2D-grid, 2D-ring, and 2D-circle. Moreover, we perform an additional experiment on the 2D-grid dataset with 49 modes, which shows the superiority of OKGAN in terms of achieving diversity. For this new dataset, the standard deviation of each Gaussian is 0.05, and a grid value of four edges is (-4, -4), (-4, 4), (4, -4), (4, 4). In this case, we use the initial value of Gaussian kernel as 0.5. In Table 2, only OKGAN successfully generates all 49 modes with the lowest reverse KL divergence. Every evaluation metric value is averaged over 5 trials.

#modes high reverse
(max 49) quality(%) KL
PacGAN 45.0 83.8 0.364
BourGAN 42.0 68.2 0.253
OKGAN 49.0 74.3 0.033
Table 2: Quantitative results on 2D-grid dataset with 49 modes.
Figure 5: Qualitative comparison on 2D-grid dataset with 49 modes.

c.2 More qualitative results

In addition to the experiment on CelebA dataset in Section 3.3, we provide more qualitative results on other images datasets such as MNIST and SVHN. For MNIST, we use 5 convolutional transpose layers as (100, 512, 4, 1, 0), (512, 256, 4, 2, 1), (256, 128, 4, 2, 1), (128, 64, 4, 2, 1), (64, 1, 4, 2, 1) both for the generator of DCGAN and OKGAN. Then, we check how OKGAN works well on even more complicated dataset like SVHN, which contains random street numbers with some colors. We observe that OKGANs successfully generate high-quality samples both on MNIST and SVHN.

Figure 6: Qualitative comparison on MNIST dataset
Figure 7: Qualitative comparison on SVHN dataset

c.3 Computational complexity analysis

We conduct computational complexity analysis on DCGAN and OKGAN with respect to the batch size. We use CelebA dataset on this experiment, and the time spent per discriminator update is measured while we increase the batch size. In this case, the discriminator of OKGAN is regarded as the combination of the encoder and the online kernel classifier. When the batch size is , the time complexity of each update of typical GANs is Li et al. (2017a). The time complexity of training the encoder of OKGAN is as well. For the online kernel classifier, since the budget size and the batch size of mini-batch are fixed, the computation time of training the classifier on one mini-batch increases linearly with the batch size . Therefore, the time complexity of the discriminator update of OKGAN is (See Figure 8). Even though we add the online kernel classifier to achieve the diversity of generated samples, OKGAN is not too computationally expensive compared to typical GANs.

Figure 8: Computation time graph of DCGAN and OKGAN