Implicit Kernel Learning

02/26/2019 ∙ by Chun-Liang Li, et al. ∙ ibm Carnegie Mellon University 0

Kernels are powerful and versatile tools in machine learning and statistics. Although the notion of universal kernels and characteristic kernels has been studied, kernel selection still greatly influences the empirical performance. While learning the kernel in a data driven way has been investigated, in this paper we explore learning the spectral distribution of kernel via implicit generative models parametrized by deep neural networks. We called our method Implicit Kernel Learning (IKL). The proposed framework is simple to train and inference is performed via sampling random Fourier features. We investigate two applications of the proposed IKL as examples, including generative adversarial networks with MMD (MMD GAN) and standard supervised learning. Empirically, MMD GAN with IKL outperforms vanilla predefined kernels on both image and text generation benchmarks; using IKL with Random Kitchen Sinks also leads to substantial improvement over existing state-of-the-art kernel learning algorithms on popular supervised learning benchmarks. Theory and conditions for using IKL in both applications are also studied as well as connections to previous state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Kernel methods are among the essential foundations in machine learning and have been extensively studied in the past decades. In supervised learning, kernel methods allow us to learn non-linear hypothesis. They also play a crucial role in statistics. Kernel maximum mean discrepancy (MMD) (Gretton et al., 2012)

is a powerful two-sample test, which is based on a statistics computed via kernel functions. Even though there is a surge of deep learning in the past years, several successes have been shown by kernel methods and deep feature extraction.

Wilson et al. (2016) demonstrate state-of-the-art performance by incorporating deep learning, kernel and Gaussian process. Li et al. (2015); Dziugaite et al. (2015) use MMD to train deep generative models for complex datasets.

In practice, however, kernel selection is always an important step. Instead of choosing by a heuristic, several works have studied

kernel learning. Multiple kernel learning (MKL) (Bach et al., 2004; Lanckriet et al., 2004; Bach, 2009; Gönen and Alpaydın, 2011; Duvenaud et al., 2013)

is one of the pioneering frameworks to combine predefined kernels. One recent kernel learning development is to learn kernels via learning spectral distributions (Fourier transform of the kernel).

Wilson and Adams (2013) model spectral distributions via a mixture of Gaussians, which can also be treated as an extension of linear combination of kernels (Bach et al., 2004). Oliva et al. (2016)

extend it to Bayesian non-parametric models. In addition to model spectral distribution with

explicit density models aforementioned, many works optimize the sampled random features or its weights (e.g. Băzăvan et al. (2012); Yang et al. (2015); Sinha and Duchi (2016); Chang et al. (2017); Bullins et al. (2018)). The other orthogonal approach to modeling spectral distributions is learning feature maps for standard kernels (e.g. Gaussian). Feature maps learned by deep learning lead to state-of-the-art performance on different tasks (Hinton and Salakhutdinov, 2008; Wilson et al., 2016; Li et al., 2017).

In addition to learning effective features, implicit generative models via deep learning also lead to promising performance in learning distributions of complex data (Goodfellow et al., 2014). Inspired by its recent success, we propose to model kernel spectral distributions with implicit generative models in a data-driven fashion, which we call Implicit Kernel Learning (IKL). IKL provides a new route to modeling spectral distributions by learning sampling processes of the spectral densities, which is under explored by previous works aforementioned.

In this paper, we start from studying the generic problem formulation of IKL, and propose an easily implemented, trained and evaluated neural network parameterization which satisfies Bochner’s theorem (Section 2). We then demonstrate two example applications of the proposed IKL. Firstly, we explore MMD GAN (Li et al., 2017) with IKL on learning to generate images and text (Section 3). Secondly, we consider a standard two-staged supervised learning task with Random Kitchen Sinks (Sinha and Duchi, 2016) (Section 4). The conditions required for training IKL and its theoretical guarantees in both tasks are also studied. In both tasks, we show that IKL leads to competitive or better performance than heuristic kernel selections and existing approaches modeling kernel spectral densities. It demonstrates the potentials of learning more powerful kernels via deep generative models. Finally, we discuss the connection with existing works in Section 5.

2 Kernel Learning

Kernels have been used in several applications with success, including supervised learning, unsupervised learning, and hypothesis testing. They have also been combined with deep learning in different applications 

(Mairal et al., 2014; Li et al., 2015; Dziugaite et al., 2015; Wilson et al., 2016; Mairal, 2016). Given data , kernel methods compute the inner product of the feature transformation in a high-dimensional Hilbert space via a kernel function , which is defined as , where is usually high or even infinitely dimensional. If is shift invariant (i.e. ), we can represent as an expectation with respect to a spectral distribution .

Bochner’s theorem (Rudin, 2011)

A continuous, real valued, symmetric and shift-invariant function on is a positive definite kernel if and only if there is a positive finite measure such that

2.1 Implicit Kernel Learning

We restrict ourselves to learning shift invariant kernels. According to that, learning kernels is equivalent to learning a spectral distribution by optimizing

(1)

where is a task-specific objective function and is a set of kernels. (1) covers many popular objectives, such as kernel alignment (Gönen and Alpaydın, 2011) and MMD distance (Gretton et al., 2012). Existing works (Wilson and Adams, 2013; Oliva et al., 2016) learn the spectral density with explicit forms via parametric or non-parametric models. When we learn kernels via (1), it may not be necessary to model the density of

, as long as we are able to estimate kernel evaluations

via sampling from  (Rahimi and Recht, 2007). Alternatively, implicit probabilistic (generative) models define a stochastic procedure that can generate (sample) data from without modeling . Recently, the neural implicit generative models (MacKay, 1995) regained attentions with promising results (Goodfellow et al., 2014) and simple sampling procedures. We first sample from a base distribution

which is known (e.g. Gaussian distribution), then use a deterministic function

parametrized by , to transform into , where follows the complex target distribution . Inspired by the success of deep implicit generative models (Goodfellow et al., 2014), we propose an Implicit Kernel Learning (IKL) method by modeling via an implicit generative model , where , which results in

(2)

and reducing (1) to solve

(3)

The gradient of  (3) can be represented as

Thus,  (3) can be optimized via sampling from data and from the base distribution to estimate gradient as shown above (SGD) in every iteration. Next, we discuss the parametrization of to satisfy Bochner’s Theorem, and describe how to evaluate IKL kernel in practice.

Symmetric

To result in real valued kernels, the spectral density has to be symmetric, where . Thus, we parametrize , where is the Hadamard product and can be any unconstrained function if the base distribution is symmetric (i.e.

), such as standard normal distributions.

Kernel Evaluation

Although there is usually no closed form for the kernel evaluation in (2) with fairly complicated , we can evaluate (approximate) via sampling finite number of random Fourier features , where , and is the evaluation on of the Fourier transformation  (Rahimi and Recht, 2007).

Next, we demonstrate two example applications covered by (3), where we can apply IKL, including kernel alignment and maximum mean discrepancy (MMD).

3 MMD GAN with IKL

Given , instead of estimating the density , Generative Adversarial Network (GAN) (Goodfellow et al., 2014) is an implicit generative model, which learns a generative network  (generator). The generator transforms a base distribution over into to approximate , where is the distribution of  and . During the training, GAN alternatively estimates a distance between  and , and updates to minimize

. Different probability metrics have been studied 

(Goodfellow et al., 2014; Li et al., 2015; Dziugaite et al., 2015; Nowozin et al., 2016; Arjovsky et al., 2017; Mroueh et al., 2017; Li et al., 2017; Mroueh and Sercu, 2017; Gulrajani et al., 2017; Mroueh et al., 2018; Arbel et al., 2018) for training GANs.

Kernel maximum mean discrepancy (MMD) is a probability metric, which is commonly used in two-sample-test to distinguish two distributions with finite samples (Gretton et al., 2012). Given a kernel , the MMD between and is defined as

(4)

For characteristic kernels, iff . Li et al. (2015); Dziugaite et al. (2015) train the generator by optimizing with a Gaussian kernel . Li et al. (2017) propose MMD GAN, which trains via , where is a pre-defined set of kernels. The intuition is to learn a kernel , which has a stronger signal (i.e. larger distance when ) to train . Specifically, Li et al. (2017) consider a composition kernel which combines Gaussian kernel and a neural network  as , where

(5)

The MMD GAN objective then becomes .

3.1 Training MMD GAN with IKL

Although the composition kernel with a learned feature embedding is powerful, choosing a good base kernel is still crucial in practice (Bińkowski et al., 2018). Different base kernels for MMD GAN, such as rational quadratic kernel (Bińkowski et al., 2018) and distance kernel (Bellemare et al., 2017), have been studied. Instead of choosing it by hands, we propose to learn the base kernel by IKL, which extend (5) to be with the form

(6)

We then extend the MMD GAN objective to be

(7)

where is the MMD distance (4) with the IKL kernel (6). Clearly, for a given , the maximization over in (7) can be represented as (1) by letting , and . In what follows, we will use for convenience , and to denote kernels defined in (6), (2) and  (5) respectively.

3.2 Property of MMD GAN with IKL

As proven by Arjovsky and Bottou (2017), some probability distances adopted by existing works (e.g. Goodfellow et al. (2014)) are not weak (i.e. then ), which cannot provide better signal to train . Also, they usually suffer from discontinuity, hence it cannot be trained via gradient descent at certain points. We prove that is a continuous and differentiable objective in and weak under mild assumptions as used in (Arjovsky et al., 2017; Li et al., 2017). is locally Lipschitz and differentiable in ; is Lipschitz in and is compact. is differentiable in and there are local Lipschitz constants, which is independent of , such that . The above assumptions are adopted by Arjovsky et al. (2017). Lastly, assume given any , where is compact, and is differentiable and Lipschitz in which has an upper bound for Lipschitz constant of given different .

Assume function and kernel satisfy Assumption 3.2, is weak, that is, . Also, is continuous everywhere and differentiable almost everywhere in .

Assume is bounded. Let , is Lipschitz in if

, which is variance since

We penalize  as an approximation of Lemma 3.2 in practice to ensure that assumptions in Theorem 3.2 are satisfied. The algorithm with IKL and gradient penalty (Bińkowski et al., 2018) is shown in Algorithm 1.

  Input: the learning rate, the batch size, number of updates per update, the number of basis, the coefficient of gradient penalty, the coefficient of variance constraint.
  Initial parameter for , for , for
  Define
  while  has not converged do
     for  do
        Sample
        
     end for
     Sample
     
  end while
Algorithm 1 MMD GAN with IKL

3.3 Empirical Study

We consider image and text generation tasks for quantitative evaluation. For image generation, we evaluate the inception score (Salimans et al., 2016) and FID score (Heusel et al., 2017) on CIFAR-10 (Krizhevsky and Hinton, 2009). We use DCGAN (Radford et al., 2016) and expands the output of to be 16-dimensional as Bińkowski et al. (2018). For text generation, we consider a length-32 character-level generation task on Google Billion Words dataset. The evaluation is based on Jensen-Shannon divergence on empirical 4-gram probabilities (JS-4) of the generated sequence and the validation data as used by Gulrajani et al. (2017); Heusel et al. (2017); Mroueh et al. (2018). The model architecture follows Gulrajani et al. (2017) in using ResNet with 1D convolutions. We train every algorithm iterations for comparison.

For MMD GAN with fixed base kernels, we consider the mixture of Gaussian kernels  (Li et al., 2017) and the mixture of RQ kernels

. We tuned hyperparameters

and for each kernel as reported in Appendix F.1.

Lastly, for learning base kernels, we compare IKL with SM kernel (Wilson and Adams, 2013) , which learns mixture of Gaussians to model kernel spectral density. It can also be treated as the explicit generative model counter part of the proposed IKL.

In both tasks, , the base distribution of IKL, is a standard normal distribution and is a 3-layer MLP with hidden units for each layer. Similar to the aforementioned mixture kernels, we consider the mixture of IKL kernel with the variance constraints , where is the bandwidths for the mixture of Gaussian kernels. Note that if is an identity map, we recover the mixture of Gaussian kernels. We fix to be and resample random features for IKL in every iteration. For other settings, we follow Bińkowski et al. (2018) and the hyperparameters can be found in Appendix F.1.

3.3.1 Results and Discussion

We compare MMD GAN with the proposed IKL and different fixed kernels. We repeat the experiments

times and report the average result with standard error in Table 

1. Note that for inception score the larger the better; while JS-4 the smaller the better. We also report WGAN-GP results as a reference. Since FID score results (Heusel et al., 2017) is consistent with inception score and does not change our discussion, we put it in Appendix C.1 due to space limit. Sampled images on larger datasets are shown in Figure 1.

Method Inception Scores JS-4
Gaussian
RQ
SM
IKL
WGAN-GP
Table 1: Inception scores and JS-4 divergece results.
Figure 1: Samples generated by MMDGAN-IKL on CIFAR-10, CELEBA and LSUN dataset.
Pre-defined Kernels

Bińkowski et al. (2018) show RQ kernels outperform Gaussian and energy distance kernels on image generation. Our empirical results agree with such finding: RQ kernels achieve inception score while for Gaussian kernel it is , as shown in the left column of Table 1. In text generation, nonetheless, RQ kernels only achieve JS-4 score111For RQ kernels, we searched possible hyperparameter settings and reported the best one in Appendix, to ensure the unsatisfactory performance is not caused by the improper parameters. and are not on par with acquired by Gaussian kernels, even though it is still slightly worse than WGAN-GP. These results imply kernel selection is task-specific. On the other hand, the proposed IKL learns kernels in a data-driven way, which results in the best performance in both tasks. In CIFAR-10, although Gaussian kernel is worse than RQ, IKL is still able to transforms , which is Gaussian, into a powerful kernel, and outperforms RQ on inception scores ( v.s. ). For text generation, from Table 1 and Figure 2, we observe that IKL can further boost Gaussian into better kernels with substantial improvement. Also, we note that the difference between IKL and pre-defined kernels in Table 1 is significant based on the -test at 95% confidence level.

Figure 2: Convergence of MMD GANs with different kernels on text generation.
Learned Kernels

The SM kernel (Wilson and Adams, 2013), which learns the spectral density via mixture of Gaussians, does not significantly outperforms Gaussian kernel as shown in Table 1, since  Li et al. (2017) already uses equal-weighted mixture of Gaussian formulation. It suggests that proposed IKL can learn more complicated and effective spectral distributions than simple mixture models.

Study of Variance Constraints

In Lemma 3.2, we prove bounding variance guarantees to be Lipschitz as required in Theorem 3.2. We investigate the importance of this constraint. In Figure 3, we show the training objective (MMD), and the JS-4 divergence for training MMD GAN (IKL) without variance constraint, i.e. . We could observe the variance keeps going up without constraints, which leads exploded MMD values. Also, when the exploration is getting severe, the JS-4 divergence starts increasing, which implies MMD cannot provide meaningful signal to . The study justifies the validity of Theorem 3.2 and Lemma 3.2.

Figure 3: Learning MMD GAN (IKL) without the variance constraint on Google Billion Words datasets for text generation.
Other Studies

One concern of the proposed IKL is the computational overhead introduced by sampling random features as well as using more parameters to model . Since we only use small network to model , the increased computation overhead is almost negligible under GPU parallel computation. The detailed comparison can be found in Appendix C.2. We also compare IKL with Bullins et al. (2018), which can be seen as a variant of IKL without , and studt the variance constraint. Those additional discussions can be found in Appendix C.1.

4 Random Kitchen Sinks with IKL

Rahimi and Recht (2009) propose Random Kitchen Sinks (RKS) as follows. We sample and transform into , where

. We then learn a classifier on the transformed features

. Kernel methods with random features (Rahimi and Recht, 2007) is an example of RKS, where is the spectral distribution of the kernel and . We usually learn a model by solving

(8)

If

is a convex loss function, the objective (

8) can be solved efficiently to global optimum.

Spectral distributions are usually set as a parameterized form, such as Gaussian distributions, but the selection of is important in practice. If we consider RKS as kernel methods with random features, then selecting is equivalent to the well-known kernel selection (learning) problem for supervised learning (Gönen and Alpaydın, 2011).

Two-Stage Approach

We follows Sinha and Duchi (2016) to consider kernel learning for RKS with a two-stage approach. In stage 1, we consider kernel alignment (Cristianini et al., 2002) of the form, . By parameterizing via the implicit generative model as in Section 2, we have the following problem:

(9)

which can be treated as (1) with . After solving (9), we learn a sampler where we can easily sample. Thus, in stage 2, we thus have the advantage of solving a convex problem (8) in RKS with IKL. The algorithm is shown in Algorithm 2.

  Stage 1: Kernel Learning
  Input: , the batch size for data and for random feature, learning rate
  Initial parameter for
  while  has not converged or reach maximum iters do
     Sample Fresh sample
     
     
  end while
  Stage 2: Random Kitchen Sinks
   Sample , note that is not necessarily equal to
  Transform into via and
  Learn a linear classifier on
Algorithm 2 Random Kitchen Sinks with IKL

Note that in stage 1, we resample in every iteration to train an implicit generative model . The advantage of Algorithm 2 is the random features used in kernel learning and RKS can be different, which allows us to use less random features in kernel learning (stage 1), and sample more features for RKS (stage 2).

One can also jointly train both feature mapping and the model parameters , such as neural networks. We remark that our intent is not to show state-of-the-art results on supervised learning, on which deep neural networks dominate (Krizhevsky et al., 2012; He et al., 2016). We use RKS as a protocol to study kernel learning and the proposed IKL, which still has competitive performance with neural networks on some tasks (Rahimi and Recht, 2009; Sinha and Duchi, 2016). Also, the simple procedure of RKL with IKL allows us to provide some theoretical guarantees of the performance, which is sill challenging of deep learning models.

Comparison with Existing Works

Sinha and Duchi (2016) learn non-uniform weights for random features via kernel alignment in stage 1 then using these optimized features in RKS in the stage 2. Note that the random features used in stage 1 has to be the same as the ones in stage 2. A jointly training of feature mapping and classifier can be treated as a 2-layer neural networks (Băzăvan et al., 2012; Alber et al., 2017; Bullins et al., 2018). Learning kernels with aforementioned works will be more costly if we want to use a large number of random features for training classifiers. In contrast to implicit generative models,  Oliva et al. (2016) learn an explicit Bayesian nonparametric generative model for spectral distributions, which requires specifically designed inference algorithms. Learning kernels for (8) in dual form without random features has also been proposed. It usually require costly steps, such as eigendecomposition of the Gram matrix (Gönen and Alpaydın, 2011).

4.1 Empirical Study

We evaluate the proposed IKL on both synthetic and benchmark binary classification tasks. For IKL, is standard Normal and is a -layer MLP for all experiments. The number of random features to train in Algorithm 2 is fixed to be . Other experiment details are described in Appendix F.2.

Figure 4: Left figure is training examples when . Right figure is the classification error v.s. data dimension.
Kernel learning with a poor choice of

We generate with , where is the data dimension. A two dimensional example is shown in Figure 4. Competitive baselines include random features (RFF) (Rahimi and Recht, 2007) as well as OPT-KL (Sinha and Duchi, 2016). In the experiments, we fix in RKS for all algorithms. Since Gaussian kernels with the bandwidth is known to be ill-suited for this task (Sinha and Duchi, 2016), we directly use random features from it for RFF and OPT-KL. Similarly, we set to be standard normal distribution as well.

The test error for different data dimension is shown in Figure 4. Note that RFF is competitive with OPT-KL and IKL when is small (), while its performance degrades rapidly as increases, which is consistent with the observation in Sinha and Duchi (2016). More discussion of the reason of failure can be referred to Sinha and Duchi (2016). On the other hand, although using standard normal as the spectral distribution is ill-suited for this task, both OPT-KL and IKL can adapt with data and learn to transform it into effective kernels and result in slower degradation with .

Note that OPT-KL learns the sparse weights on the sampled random features (). However, the sampled random features can fail to contain informative ones, especially in high dimension (Bullins et al., 2018). Thus, when using limited amount of random features, OPT-IKL may result in worse performance than IKL in the high dimensional regime in Figure 4.

Performance on benchmark datasets

Next, we evaluate our IKL framework on standard benchmark binary classification tasks. Challenging label pairs are chosen from MNIST (LeCun et al., 1998) and CIFAR-10 (Krizhevsky and Hinton, 2009) datasets; each task consists of training and test examples. For all datasets, raw pixels are used as the feature representation. We set the bandwidth of RBF kernel by the median heuristic. We also compare with Wilson and Adams (2013), the spectral mixture (SM) kernel, which uses Gaussian mixture to learn spectral density and can be seen as the explicit generative model counterpart of IKL. Also, SM kernel is a MKL variant with linear combination (Gönen and Alpaydın, 2011). In addition, we consider the joint training of random features and model parameters, which can be treated as two-layer neural network (NN) and serve as the lower bound of error for comparing different kernel learning algorithms.

The test error versus different in the second stage are shown in Figure 5. First, in light of computation efficiency, SM and the proposed IKL only sample random features in each iteration in the first stage, and draws different number of basis from the learned for the second stage. OPT-KL, on the contrary, the random features used in training and testing should be the same. Therefore, OPT-IKL needs to deal with random features in the training. It brings computation concern when is large. In addition, IKL demonstrates improvement over the representative kernel learning method OPT-KL, especially significant on the challenging datasets such as CIFAR-10. In some cases, IKL almost reaches the performance of NN, such as MNIST, while OPT-KL degrades to RFF except for small number of basis . This illustrates the effectiveness of learning kernel spectral distribution via the implicit generative model . Also, IKL outperforms SM, which is consistent with the finding in Section 3 that IKL can learn more complicated spectral distributions than simple mixture models (SM).

(a) MNIST (4-9)
(b) MNIST (5-6)
(c) CIFAR-10 (auto-truck)
(d) CIFAR-10 (plane-bird)
Figure 5:

Test error rate versus number of basis in second stage on benchmark binary classification tasks. We report mean and standard deviation over five runs. Our method (IKL) is compared with RFF 

(Rahimi and Recht, 2009), OPT-KL (Sinha and Duchi, 2016), SM (Wilson and Adams, 2013) and the end-to-end training MLP (NN).

4.2 Consistency and Generalization

The simple two-stages approach, IKL with RKS, allows us to provide the consistency and generalization guarantees. For consistency, it guarantees the solution of finite sample approximations of  (9) approach to the optimum of  (9) (population optimum), when we increase number of training data and number of random features. We firstly define necessary symbols and state the theorem.

Let be a label similarity function, where . We use to denote interchangeably. Given a kernel , we define the true and empirical alignment functions as,

In the following, we abuse the notation to be for ease of illustration. Recall the definitions of and . We define two hypothesis sets

(Rademacher’s Complexity) Given a hypothesis set , where if , and a fixed sample , the empirical Rademacher’s complexity of is defined as

where are

i.i.d. Rademacher random variables.

We then have the following theorems showing that the consistency guarantee depends on the complexity of the function class induced by IKL as well as the number of random features. The proof can be found in Appendix D. (Consistency) Let , with i.i.d, samples drawn from . With probability at least , we have

Applying Cortes et al. (2010), We also have a generalization bound, which depends number of training data , number of random features and the Rademacher complexity of IKL kernel, as shown in Appendix E. The Rademacher complexity , for example, can be or even for kernels with different bounding conditions (Cortes et al., 2013). We would expect worse rates for more powerful kernels. It suggests the trade-off between consistency/generalization and using powerful kernels parametrized by neural networks.

5 Discussion

We propose a generic kernel learning algorithm, IKL, which learns sampling processes of kernel spectral distributions by transforming samples from a base distribution into ones for the other kernel (spectral density). We compare IKL with other algorithms for learning MMD GAN and supervised learning with Random Kitchen Sinks (RKS). For these two tasks, the conditions and guarantees of IKL for are studied. Empirical studies show IKL is better than or competitive with the state-of-the-art kernel learning algorithms. It proves IKL can learn to transform into effective kernels even if is less less favorable to the task.

We note that the preliminary idea of IKL is mentioned in Băzăvan et al. (2012), but they ended up with a algorithm that directly optimizes sampled random features (RF), which has many follow-up works (e.g. Sinha and Duchi (2016); Bullins et al. (2018)). The major difference is, by learning the transformation function , the RF used in training and evaluation can be different. This flexibility allows a simple training algorithm (SGD) and does not require to keep learned features. In our studies on GAN and RKS, we show using a simple MLP can already achieve better or competitive performance with those works, which suggest IKL can be a new direction for kernel learning and worth more studies.

We highlight that IKL is not conflict with existing works but can be combined with them. In Section 3, we show combining IKL with kernel learning via embedding (Wilson et al., 2016) and mixture of spectral distributions (Wilson and Adams, 2013). Therefore, in addition to the examples shown in Section 3 and Section 4, IKL is directly applicable to many existing works with kernel learning via embedding (e.g. Dai et al. (2014); Li and Póczos (2016); Wilson et al. (2016); Al-Shedivat et al. (2016); Arbel et al. (2018); Jean et al. (2018); Chang et al. (2019)

). A possible extension is combining with Bayesian inference 

(Oliva et al., 2016) under the framework similar to Saatchi and Wilson (2017). The learned sampler from IKL can possibly provide an easier way to do Bayesian inference via sampling.

References

  • Al-Shedivat et al. (2016) Al-Shedivat, M., Wilson, A. G., Saatchi, Y., Hu, Z., and Xing, E. P. (2016). Learning scalable deep kernels with recurrent structure. arXiv preprint arXiv:1610.08936.
  • Alber et al. (2017) Alber, M., Kindermans, P.-J., Schütt, K., Müller, K.-R., and Sha, F. (2017). An empirical study on the properties of random bases for kernel methods. In NIPS.
  • Arbel et al. (2018) Arbel, M., Sutherland, D. J., Bińkowski, M., and Gretton, A. (2018). On gradient regularizers for mmd gans. In NIPS.
  • Arjovsky and Bottou (2017) Arjovsky, M. and Bottou, L. (2017). Towards principled methods for training generative adversarial networks. In ICLR.
  • Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein GAN. In ICML.
  • Bach (2009) Bach, F. R. (2009). Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS.
  • Bach et al. (2004) Bach, F. R., Lanckriet, G. R., and Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the smo algorithm. In ICML.
  • Băzăvan et al. (2012) Băzăvan, E. G., Li, F., and Sminchisescu, C. (2012). Fourier kernel learning. In ECCV.
  • Bellemare et al. (2017) Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., and Munos, R. (2017). The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743.
  • Bińkowski et al. (2018) Bińkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. (2018). Demystifying mmd gans. In ICLR.
  • Borisenko and Minchenko (1992) Borisenko, O. and Minchenko, L. (1992). Directional derivatives of the maximum function. Cybernetics and Systems Analysis, 28(2):309–312.
  • Bullins et al. (2018) Bullins, B., Zhang, C., and Zhang, Y. (2018). Not-so-random features. In ICLR.
  • Chang et al. (2017) Chang, W.-C., Li, C.-L., Yang, Y., and Poczos, B. (2017). Data-driven random fourier features using stein effect. In IJCAI.
  • Chang et al. (2019) Chang, W.-C., Li, C.-L., Yang, Y., and Póczos, B. (2019). Kernel change-point detection with auxiliary deep generative models. arXiv preprint arXiv:1901.06077.
  • Cortes et al. (2013) Cortes, C., Kloft, M., and Mohri, M. (2013). Learning kernels using local rademacher complexity. In NIPS.
  • Cortes et al. (2010) Cortes, C., Mohri, M., and Rostamizadeh, A. (2010). Generalization bounds for learning kernels. In ICML.
  • Cristianini et al. (2002) Cristianini, N., Shawe-Taylor, J., Elisseeff, A., and Kandola, J. S. (2002). On kernel-target alignment. In ICML.
  • Dai et al. (2014) Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M.-F. F., and Song, L. (2014). Scalable kernel methods via doubly stochastic gradients. In NIPS.
  • Dudley (2018) Dudley, R. M. (2018). Real Analysis and Probability. Chapman and Hall/CRC.
  • Duvenaud et al. (2013) Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B., and Ghahramani, Z. (2013). Structure discovery in nonparametric regression through compositional kernel search. In ICML.
  • Dziugaite et al. (2015) Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. (2015). Training generative neural networks via maximum mean discrepancy optimization. In UAI.
  • Fan et al. (2008) Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). Liblinear: A library for large linear classification. JMLR.
  • Gönen and Alpaydın (2011) Gönen, M. and Alpaydın, E. (2011). Multiple kernel learning algorithms. JMLR.
  • Goodfellow et al. (2014) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., and Bengio, Y. (2014). Generative adversarial nets. In NIPS.
  • Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. JMLR.
  • Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017). Improved training of wasserstein gans. In NIPS.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
  • Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS.
  • Hinton and Salakhutdinov (2008) Hinton, G. E. and Salakhutdinov, R. R. (2008). Using deep belief nets to learn covariance kernels for gaussian processes. In NIPS.
  • Jean et al. (2018) Jean, N., Xie, S. M., and Ermon, S. (2018). Semi-supervised deep kernel learning: Regression with unlabeled data by minimizing predictive variance. In Advances in Neural Information Processing Systems, pages 5327–5338.
  • Krizhevsky and Hinton (2009) Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
  • Lanckriet et al. (2004) Lanckriet, G. R., Cristianini, N., Bartlett, P., Ghaoui, L. E., and Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. JMLR.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE.
  • Li et al. (2017) Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Poczos, B. (2017).

    Mmd gan: Towards deeper understanding of moment matching network.

    In NIPS.
  • Li and Póczos (2016) Li, C.-L. and Póczos, B. (2016). Utilize old coordinates: Faster doubly stochastic gradients for kernel methods. In UAI.
  • Li et al. (2015) Li, Y., Swersky, K., and Zemel, R. (2015). Generative moment matching networks. In ICML.
  • MacKay (1995) MacKay, D. J. (1995). Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment.
  • Mairal (2016) Mairal, J. (2016). End-to-end kernel learning with supervised convolutional kernel networks. In NIPS.
  • Mairal et al. (2014) Mairal, J., Koniusz, P., Harchaoui, Z., and Schmid, C. (2014). Convolutional kernel networks. In NIPS.
  • Mroueh et al. (2018) Mroueh, Y., Li, C.-L., Sercu, T., Raj, A., and Cheng, Y. (2018). Sobolev gan. In ICLR.
  • Mroueh and Sercu (2017) Mroueh, Y. and Sercu, T. (2017). Fisher gan. In NIPS.
  • Mroueh et al. (2017) Mroueh, Y., Sercu, T., and Goel, V. (2017). Mcgan: Mean and covariance feature matching gan. In ICML.
  • Nowozin et al. (2016) Nowozin, S., Cseke, B., and Tomioka, R. (2016). f-gan: Training generative neural samplers using variational divergence minimization. In NIPS.
  • Oliva et al. (2016) Oliva, J. B., Dubey, A., Wilson, A. G., Póczos, B., Schneider, J., and Xing, E. P. (2016). Bayesian nonparametric kernel-learning. In AISTATS.
  • Radford et al. (2016) Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR.
  • Rahimi and Recht (2007) Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In NIPS.
  • Rahimi and Recht (2009) Rahimi, A. and Recht, B. (2009). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS.
  • Rudin (2011) Rudin, W. (2011). Fourier analysis on groups. John Wiley & Sons.
  • Saatchi and Wilson (2017) Saatchi, Y. and Wilson, A. G. (2017). Bayesian gan. In NIPS, pages 3625–3634.
  • Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Improved techniques for training gans. In NIPS.
  • Sinha and Duchi (2016) Sinha, A. and Duchi, J. C. (2016). Learning kernels with random features. In NIPS.
  • Wilson and Adams (2013) Wilson, A. and Adams, R. (2013). Gaussian process kernels for pattern discovery and extrapolation. In ICML.
  • Wilson et al. (2016) Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. (2016). Deep kernel learning. In AISTATS.
  • Yang et al. (2015) Yang, Z., Wilson, A., Smola, A., and Song, L. (2015). A la carte–learning fast kernels. In AISTATS.
  • Zhang et al. (2017) Zhang, Y., Liang, P., and Charikar, M. (2017). A hitting time analysis of stochastic gradient langevin dynamics. In COLT.

Appendix A Proof of Theorem 3.2

We first show then . The results are based on Arbel et al. (2018), which leverages Corollary 11.3.4 of Dudley (2018). Follows the sketch of Arbel et al. (2018), the only thing we remain to show is proving is Lipschitz. By definition, we know that . Also, since and (Lipschtiz assumption of ), we have

where the last inequality is since is also a Lipschitz function with Lipschitz constant .

The other direction, then , is relatively simple. Without loss of generality, we assume there exists and such that and are identity functions (up to scaling), which recover the Gaussian kernel . Therefore, implies , which completes the proof because MMD with any Gaussian kernel is weak (Gretton et al., 2012).

a.1 Continuity

(Borisenko and Minchenko (1992)) Define . If is locally Lipschitz in , is compact and exists, where , then is differentiable almost everywhere.

We are going to show

(10)

is differentiable with respect to almost everywhere by using the auxiliary Lemma A.1. We fist show in (10) is locally Lipschitz in . By definition, , therefore,

The first inequality is followed by the assumption that is locally Lipschitz in , with a upper bound for Lipschitz constants. By Assumption 3.2, , we prove is locally Lipschitz. The similar argument is applicable to other terms in (10); therefore, (10) is locally Lipschitz in .

Last, with the compactness assumption on and , and differentiable assumption on , applying Lemma A.1 proves Theorem 3.2.

Appendix B Proof of Lemma 3.2

Without loss of the generality, we can rewrite the kernel function as , where is bounded. We then have

The last inequality follows by . Since is bounded, if , there exist a constant such that .

By mean value theorem, for any and , there exists , where , such that

Combining with , we prove

Appendix C Additional Studies of MMD GAN with IKL

c.1 Additional Quantitative Results

We show the full quantitative results on MMD GANs with different kernels with mean and standard error in Table 2

. In every tasks, IKL is the best among the predefined base kernels (Gaussian, RQ) and the competitive kernel learning algorithm (SM). The difference in FID is less significant than inception score and JS-4, but we note that FID score is a biased evaluation metric as discussed in 

Bińkowski et al. (2018).

Method Inception Scores FID Scores JS-4
Gaussian
RQ
SM
IKL
WGAN-GP
Table 2: Inception scores, FID scores, and JS-4 divergece results.

c.2 Computational Issues of GAN trainings with IKL

Model Capacity

For , the number of parameters for DCGAN is around million for size images and millions for size images. The ResNet architecture used in Gulrajani et al. (2017) has around millions parameters. In contrast, in all experiments, we use simple three layer MLP as for IKL, where the input and output dimensions are 16, and hidden layer size is 32. The total parameters are just around 2,000. Compared with , the additional number of parameters used for is almost negligible.

Computational Time

The potential concern of IKL is sampling random features for each examples. In our experiments, we use random features for each iteration. We measure the time per iteration of updating critic iterations ( for WGAN-GP and MMD GAN with Gaussian kernel; and for IKL) with different batch sizes under Titan X. The difference between WGAN-GP, MMD GAN and IKL are not significant. The reason is computing MMD and random feature is highly parallelizable, and other computation, such as evaluating and its gradient penalty, dominates the cost because has much more parameters as aforementioned. Therefore, we believe the proposed IKL is still cost effective in practice.

c.3 Detailed Discussion of Variance Constraints

In Section 3, we propose to constrain variance via . There are other alternatives, such as constraining penalty or using Langrange. In practice, we do not observe significant difference.

Although we show the necessity of the variance constraint in language generation in Figure 3, we remark that the proposed constraint is a sufficient condition. For CIFAR-10, without the constraint, we observe that the variance is still bouncing between and without explosion as Figure 3. Therefore, the training leads to a satisfactory result with inception score, but it is slightly worse than IKL in Table 1. The necessary or weaker sufficient conditions are worth further studying as a future work.

c.4 IKL with and without Neural Networks on GAN training

Instead of learning a transform function for the spectral distribution as we proposed in Section 2 (IKL-NN), the other realization of IKL is to keep a pool of finite number learned random features , and approximate the kernel evaluation by , where . During the learning, it directly optimize . Many existing works study this idea for supervised learning, such as Băzăvan et al. (2012); Yang et al. (2015); Sinha and Duchi (2016); Chang et al. (2017); Bullins et al. (2018). We call the latter realization as IKL-RFF. Next, we discuss and compare the difference between IKL-NN and IKL-RFF.

The crucial difference between IKL-NN and IKL-RFF is, IKL-NN can sample arbitrary number of random features by first sampling and transforming it via , while IKL-RFF is restricted by the pool size . If the application needs more random features, IKL-RFF will be memory inefficient. Specifically, we compare IKL-NN and IKL-RFF with different number of random features in Figure 6. With the same number of parameters (i.e., )222 denotes number of parameters in , is number of random features and is the dimension of the . , IKL-NN outperforms IKL-RFF of on Inception scores ( versus ). For IKL-RFF to achieve the same or better Inception scores of IKL-NN, the number of random features needs increasing to , which is less memory efficient than the IKL-NN realization. In particular, of IKL-NN is a three-layers MLP with number of parameters (), while IKL-RFF has number of parameters, for , respectively.

Figure 6: The comparison between IKL-NN and IKL-RFF on CIFAR-10 under different number of random features. Algorithm JS-4 IKL-NN IKL-RFF IKL-RFF (+2) IKL-RFF (+4) IKL-RFF (+8)
Table 3: The comparison between IKL-NN and IKL-RFF on Google Billion Word.

On the other hand, using large for IKL-RFF not only increases the number of parameters, but might also enhance the optimization difficulty. Zhang et al. (2017) discuss the difficulty of optimizing RFF directly on different tasks. Here we compare IKL-NN and IKL-RFF on challenging Google Billion Word dataset. We train IKL-RFF with the same setting as Section 3.3 and Appendix F.1, where we set the pool size to be and the updating schedule between critic and generator to be , but we tune the Adam optimization parameter for IKL-RFF for fair comparison. As discussed above, please note that the number of parameters for is while IKL-RFF uses when . The results are shown in Table 3. Even IKL-RFF is using more parameters, the performance is not competitive as IKL-NN, which achieves .

In Algorithm 1, we update and in each iteration with times, where we use here. We keep the number of updating to be , but increase the number of update for to be in each iteration. The result is shown in Table 3 with symbols +2, +4 and +8 respectively. Clearly, we see IKL-RFF need more number of updates to achieve competitive performance with IKL-NN. The results might implies IKL-RFF is a more difficult optimization problem with more parameters than IKL-NN. It also confirms the effectiveness of learning implicit generative models with deep neural networks Goodfellow et al. (2014), but the underlying theory is still an open research question. A better optimization algorithm Zhang et al. (2017) may improve the performance gap between IKL-NN and IKL-RFF, which worth more study as future work.

Appendix D Proof of Theorem 4.2

We first prove two Lemmas.

(Consistency with respect to data) With probability at least , we have