1 Introduction
Kernels are important tools in machine learning, and find applications in a wide range of models, from one of the earliest models, support vector machine (SVM), to the latest work on statistical testing with Maximum Mean Discrepancy (MMD) and various applications in deep generative models including generative adversarial networks (GAN) and variational autoencoders (VAE)
[DBLP:conf/nips/LiCCYP17, DBLP:journals/corr/ZhaoSE17b]. These models are built by either restricting the solution space to a Reproducing Kernel Hilbert Space (RKHS) induced by a kernel, or adopting the MMD metric as the objective functions that require specified kernels in their MMDs. Other kernelbased methods such as those proposed by DBLP:conf/icml/YinZ18, FengWL:UAI17rely on special kernels for computing estimations of some quantities such as gradients.
A notsodesirable issue of the aforementioned kernelbased methods, however, is the need of selecting appropriate hyperparameters, e.g., the bandwidth of an RBF kernel. These hyperparameters are critical for obtaining good performance, and manual selection often leads to suboptimal solutions. Preliminary works have been proposed to mitigate this problem. For example, DBLP:journals/jmlr/GonenA11 proposed to learn a combination of some predefined kernels. Alternatively, some other recent works focus on learning kernels based on random features or the corresponding spectral distributions when taking Fourier transform for the kernel [RahimiR:NIPS07, DBLP:conf/eccv/BazavanLS12, DBLP:conf/icml/WilsonA13] (see Section 3.1 for a more detailed description).
As a principled approach, in this paper, we propose a new kernellearning paradigm by defining a more expressive yet easytobehandled distribution of the corresponding kernel in the spectral domain, and by parameterizing it as a datadependent implicit distribution with deep neural networks (DNNs). We call the resulting network KernelNet, call the kernel corresponds to the datadependent distribution datadependent kernel. Specifically, based on related kernel theory, we formulate a kernel in the spectral domain as an expectation of a function w.r.t. some random features, whose distribution is called the spectral distribution. The particular function and the spectral distribution then allow us to impose expressive DNN structures to parameterize the kernel. We propose a novel and expressive way of parameterizing the spectral distribution as a datadependent implicit distribution. This is distinct from a recent work [DBLP:journals/corr/abs190210214] which modeled the spectral distribution as a dataindependent distribution, and thus is less expressive. Our method is more general than DBLP:journals/corr/abs190210214, a special case of ours. Moreover, the datadependent kernel component in KernelNet could lead to performance improvement over dataindependent parameterization, as evidenced by our experiments.
The proposed KernelNet is quite flexible to be applied to existing models. We apply it to Deep Generative Models (DGMs) based on two representative DGM frameworks, the Maximum Mean Discrepancy GAN (MMDGAN) [DBLP:conf/nips/LiCCYP17] and the InfoVAE [DBLP:journals/corr/ZhaoSE17b]. Specifically, we apply our kernel parameterization to several stateoftheart MMDGAN variants, including the spectral normalized MMDGAN (SNMMDGAN) [DBLP:conf/nips/ArbelSBG18, DBLP:conf/iclr/MiyatoKKY18] and the scale MMDGAN (SMMDGAN) [DBLP:conf/nips/ArbelSBG18], which allows us to learn the models and the KernelNet simultaneously. We show theoretically that our proposed models satisfy weak topology, an important property to ensure the robustness of optimization procedures. We propose an implicit VAE model, where an MMDregularizer is incorporated into the objective of VAE, following the InfoVAE framework. Our model is implicit in the sense that the variational distribution is parameterized as an implicit distribution implemented by a conditional generator, compared to the typical Gaussian assumption in standard VAE. This enables us to model a much more flexible latent space. The learning is done by adopting the Stein gradient estimator to optimize the implicit distribution [LiT:ICLR18]. Extensive experiments are performed to evaluate our proposed framework. The results suggest that our framework achieves significant improvement over existing ones.
2 Preliminaries
We review MMDGAN and InfoVAE, two DGMs where our proposed datadependent kernels apply.
2.1 MmdGan
Generative Adversarial Network (GAN) is one of the most popular and powerful generative models in deep learning. It consists of a generator and a discriminator. The generator generate samples by transforming a simple noise distribution to an implicit data distribution
, i.e., one can easily generate samples from the distribution, but without the knowledge of the density function. The discriminator is trained to distinguish the true training data distribution and the implicit distribution induced by the generator. The generator, on the other hand, is trained to fool the discriminator. At equilibrium, the generator is able to generate samples that are distributed as the true data distribution .MMDGAN achieves this by miminzing the maximum mean discrepancy (MMD) between two probability measures, the data and model distributions.

The Maximum Mean Discrepancy (MMD) between two probability measures and is defined as:
where is a Reproducing Kernel Hilbert Space (RKHS) and is a function in this RKHS.
For an RKHS induced by kernel , MMD can be computed using the following equation:
For a characteristic kernel, if and only if . Thus, MMD can be used as a way of measuring the similarity of distributions or as a training objective. DBLP:conf/nips/LiCCYP17 proposes MMDGAN, where the kernel is defined as a composition of an injective function
for feature extraction and a kernel function
for kernel evaluation, e.g., . Note that is also a valid kernel function. For example, if is the Gaussian kernel, is also a kernel.In addition, a generator parameterized by is introduced for data generation, whose generated data are expected to match the training data in distribution by minimizing the corresponding MMD. Specifically, let represent the training data distribution and the implicit output distribution induced by the generator. The objective of MMDGAN is formulated as:
In MMDGAN, both and are parameterized by neural networks. Because of the minmax adversarial training, will eventually match in theory. Some improvements have also been proposed. One particular variant we are focusing on is called Scaled Maximum Mean Discrepancy.

The Scaled Maximum Mean Discrepancy is defined as:
Replacing the objective function of MMDGAN by SMMD leads to the SMMDGAN [DBLP:conf/nips/ArbelSBG18]. One useful and important method in training MMDGAN is spectral normalization [DBLP:conf/iclr/MiyatoKKY18], which can control the Lipschitz constant of the injective function and lead to stable training and better performance. Spectral normalization normalizes the weight matrix by its spectral norm , i.e., with . The Lipschitz constant of is thus bounded from above by 1. MMDGAN with spectral normalization is called SNMMDGAN, whereas SMMDGAN with spectral parameterization [DBLP:conf/nips/ArbelSBG18, DBLP:conf/iclr/MiyatoKKY18], defined as with being a learnable parameter, is called SNSMMDGAN.
2.2 InfoVAE
VAE and its variants are another family of DGMs where latent spaces are defined with posterior distributions. Specifically, define a generative process for an observation , starting from the corresponding latent variable , as: with and the parameter of the generative model. For efficient inference of , VAE [KingmaW:ICLR14, RezendeMW:ICML14] defines an inference network (or encoder) to generate from , with the corresponding generation distribution being parameterized by (also called variational distribution).
InfoVAE [DBLP:journals/corr/ZhaoSE17b] is a generalization of VAE by introducing an informationtheoretic regularizer into the VAE framework. The objective is:
(1) 
where is the mutual information between and that is defined as:
Note that both KL and MMD describe difference between distributions. Thus in our approach, we propose to use an MMD regularizer instead of mutual information, some discussions can be found in Appendix E. Furthermore, our model considers an implicit setting where is defined as an implicit distribution, making our model more expressive.
3 KernelNet for Learning Deep Generative Models
3.1 The Proposed KernelNet
To alleviate the difficulties with predefined kernels such as hyperparameter selection, we propose KernelNet, a principled way to parameterize a kernel with a deep neural network (DNN). Our method improves the recent work DBLP:journals/corr/abs190210214 by making the kernel datadependent and applying it to different DGMs, which has shown to be able to boost model performance.
We start with a classic result on positive definite functions [Rudin:94], stated in Lemma 1.
Lemma 1 (Rudin:94)
A continuous function in is positive definite if and only if it is the Fourier transform of a nonnegative measure.
Let . By Lemma 1 and [RahimiR:NIPS07], a kernel such that can be represented as:
(2) 
where is an indeterminate satisfying , and “*” denotes the conjugate transpose. The kernel representation (2
) directly allows us to construct an unbiased estimator for
by introducing any valid distribution for the augmented variable , called the spectral distribution. This distribution will be parameterized by an implicit distribution with a DNN, described later. In the following, we first reformulate (2) into two equivalent forms for the purposes of analysis and algorithm design, respectively. Because the probability density function and kernel function are realvalued, by Euler’s formula, we have
Proposition 2
Let be drawn from some spectral distribution , and be drawn uniformly from . The realvalued kernel in (2) can be reformulated into the following two forms:
(3)  
(4) 
Detailed derivations can be found in Appendix A. In the above two representation, (3) is more convenient for analysis, and (4) is found more stable in implementation (algorithm).
To enhance the expression power, we can make the distribution complex enough and learnable by parameterizing it with a DNN that induces an implicit distribution, as in generative adversarial networks (GAN) [GoodfellowAMXWOCB:NIPS14]. Specifically, we rewrite as with parameter . A sample from is modeled as the following generating process:
(5) 
where denotes the output of a DNN parameterized by with the input
drawn from some simple distributions like standard Gaussian or uniform distributions.
Extension to datadependent kernels
Although the above kernel parameterization is flexible enough to represent a rich family of implicit spectral distributions, it can be further extended by introducing a datadependent spectral distribution. By datadependent spectral distribution, we mean that there are some kernels satisfying (2), whose spectral distributions depend on the data pair , i.e., there exists a for each pair and a marginal distribution across the whole dataset.
We use the term datadependent because of two reasons: 1) The marginal distribution depends on specific datasets, which could induce different formulas on different datasets; 2) For a given input pair , the conditional distributions and the kernel values depend the input pair . Thus does not necessarily imply . Proposition 3 shown this setting is possible.
Proposition 3
There exist some positive definite kernels, which can be expressed as:
The proof is provided in Appendix B. One example of such a kernel is a symmetric positive definite kernel defined on manifold , whose value depends on the geodesic distance between two points rather than the Euclidean distance.
Some practical tricks can also be understood in the sense of datadependent kernels. For example, using median heuristic bandwidth can be seen as dividing
and by some data dependent quantity, denoted as . That is, if one defines and , and substitutes them into (2), one ends up:(6) 
where , and . The kernel definition in (6) is similar to a kernel with a datadependent spectral distribution .
Constructing a datadependent kernel network
To construct a datadependent kernel network, we extend (5) to the following generative process: with . Note that such an implicit construction requires multiple noise samples to approximate the distribution of for each pair, which could be time consuming when minibatch sizes are large. To tackle this issue, we propose to decompose a kernel into an explicit dataindependent component and an implicit datadependent component. Note that such a decomposition still guarantees the implicity of the overall kernel, and thus will not lose generalization. Specifically, we use the reparameterization trick to define a datadependent sampling process, i.e., a sample from the datadependent spectral distribution, denoted by as it depends on :
(7) 
where and are outputs from a DNN parameterized by ; denotes the elementwise multiplication. Because this may lead to an asymmetric kernel, we can further construct a symmetric kernel by simply setting . For the dataindependent component, we adopt the implicit representation defined in (5). Consequently, the overall KernelNet is constructed as:
(8)  
(9) 
The network structure is illustrated in Figure 1. In implementation, the expectations are approximated by samples, e.g.:
where all ’s are samples from the spectral distributions through (5). In addition, ’s are drawn from . Since the construction is implicit with no stochastic intermediate nodes, standard backpropagation can be applied for efficient endtoend training. Lemma 4 below guarantees that the output of the KernelNet (8) is still a legitimate kernel.
Lemma 4 (DBLP:books/daglib/0026002)
Let and be two kernels over , then is also a kernel.
It is worth noting that if we remove the datadependent component, the kernel reduces to the one in DBLP:journals/corr/abs190210214. We will show in the experiments that the datadependent component actually plays an important role, and can lead to performance improvement.
3.2 KernelNet for MMDGAN
In this section, we incorporate the proposed KernelNet into learning the MMDGAN model. We seek to develop an algorithm to jointly optimize both the KernelNet and the MMDGAN model. A straightforward way is to replace the standard kernel in MMDGAN with the proposed datadependent kernel (8). However, as the standard MMDGAN fails to satisfy continuity in weak topology, a property that ensures metric discontinuity in a topological space, it is unclear whether the variant with our KernelNet would satisfy the property. To this end, we first define continuity in weak topology.

is said to endow continuity in weak topology if implies , where means convergence in distribution.
Continuity in weak topology in MMDGAN is important because it was discovered by DBLP:conf/nips/ArbelSBG18 that continuity in the weak topology makes the loss provide better signal to the generator as approaches . DBLP:conf/nips/ArbelSBG18 also finds that the optimized MMD distance in MMDGAN is not continuous in the weak topology, leading to training instability and poor performance. To alleviate this problem, a number of methods have been introduced, e.g., through weightclipping [DBLP:conf/nips/LiCCYP17], spectral normalization [DBLP:conf/iclr/MiyatoKKY18], gradient penalty [GulrajaniAADC:NIPS17], and adopting a scaled objective (SMMDGAN) [DBLP:conf/nips/ArbelSBG18]. Fortunately, we can prove that adopting our KernelNet in MMDGAN leads to continuity in weak topology, based on the recent work DBLP:conf/nips/ArbelSBG18.
Proposition 5
By parameterizing the kernel with our KernelNet , is continuous in the weak topology if the following are satisfied:
(10) 
where denotes the Frobenius norm of a matrix, is the injective function in MMDGAN, i.e., , and denotes its Lipschitz constant.
The proof can be found in Appendix C. Based on Proposition 5, we propose two variants of the MMDGAN model, respectively corresponding to the SNMMDGAN [DBLP:conf/nips/ArbelSBG18, DBLP:conf/iclr/MiyatoKKY18] and the SMMDGAN [DBLP:conf/nips/ArbelSBG18], by incorporating the conditions in Proposition 5 into the objective functions.
MMDGAN with the KernelNet
By adopting spectral normalization and the method of Lagrange multipliers to regularize the conditions in Proposition 5, we propose SNMMDGANDK. Note that is satisfied because of the spectral normalization operation, which normalizes the weight matrix during the training process. The objective is defined as:
(11) 
where
in (11) means we use “+” when minimizing w.r.t. and use “” when maximizing w.r.t. , according to Proposition 5.
Scaled MMDGAN with the KernelNet
Similarly, based on the SMMDGAN model by DBLP:conf/nips/ArbelSBG18, we propose our variant by incorporating the conditions in Proposition 5 into the SMMD framework in Definition 2.1. First, we have:
Proposition 6
The derivation can be found in Appendix D. Consequently, by incorporating the conditions in Proposition 5, the objectives for SNSMMDGANDK are defined as:
(12) 
where ; and has the same meaning as that in (11). Note that SMMD has already satisfied continuity in weaktopology, while we found that in SMMDDK, adding regularizer like 12 would lead to better results.
3.3 KernelNet for Implicit InfoVAE
In this section, we describe how to incorporate our KernelNet into the InfoVAE framework. First, to increase the power of InfoVAE, we make the encoder implicit. That is, instead of adopting a particular distribution family such as Gaussian for the encoder, we add random noise at each layer of the encoder (including input data), and transport these simple noise distributions to a complex implicit distribution by the enncoder network, as done in GAN. One problem with such a formulation is the need of evaluating the density of the implicit encoder distribution for model training, as seen in the original ELBO (1). To deal with this issue, we adopt the Stein gradient estimator (SGE) [LiT:ICLR18] to approximate the gradient of logdensity used in the training.
To introduce our KernelNet into the framework, we notice that mutual information in (1) is essentially a divergence metric between the two distributions and . Consequently, we propose to simply replace the mutual information by the MMD parameterized by our KernelNet. The final objective is thus:
which can be reformulated as:
(13) 
Note that is independent of the model, and thus can be discarded. When and , (3.3) reduces to vanilla VAE.
4 Experiments
We conduct experiments to test the performance of our proposed KernelNet applied to MMDGAN and implicit VAE, and compare them with related methods, including MMD and nonMMD based GANs, parametric and nonparametric implicit VAE models. Our code will be available online.
4.1 MmdGan
We evaluated our MMDGAN variants on the CIFAR10, CelebA [liu2015faceattributes]
, and ImageNet
[DBLP:journals/ijcv/RussakovskyDSKS15] datasets. Following DBLP:conf/nips/ArbelSBG18, the output dimension of discriminator is set to 1, and inputs of the generator are sampled from a uniform distribution . For a fair comparison, all the models are compared under the same architecture in each experiment. Our model architecture follows DBLP:conf/nips/ArbelSBG18. For CIFAR10, we use an architecture with a 7layer discriminator and a 4layer generator. For CelebA, we use a 5layer discriminator and a 10layer ResNet generator. For ImageNet, our generator and discriminator are both 10layer ResNet. We use two 3layer fullyconnected neural networks to parameterize and. For each neural network, there are 16 neurons in every hidden layer. Spectral normalization
[DBLP:conf/iclr/MiyatoKKY18] is used in SNMMDGANDK, and spectral parameterization [DBLP:conf/nips/ArbelSBG18] is used in SNSMMDGANDK.We use Adam optimizer [kingma2014adam] with a learning rate 0.0001, , for CIFAR10 and CelebA, and a learning rate 0.0002, , for ImageNet. Ratio of learning rate of kernel parameters to learning rate of generator is set to be 0.005. Batch size is 64 for all the models. Models are trained for 150,000 generator update steps for CIFAR10 and CelebA, 200,000 generator update steps for ImageNet. For every generator step, we update discriminator and kernel parameters 5 steps. At every step, 1024 samples of and are used to compute the values of the kernel function. We set in (11) and in (12) for all the experiments.
For evaluation, Fréchet Inception Distance (FID) [DBLP:journals/corr/HeuselRUNKH17], Inception Score (IS) [SalimansGZCRC:NIPS16] and Kernel Inception Distance (KID) [DBLP:journals/corr/abs180101401] are computed using 100,000 samples. In every 2,000 generator steps, we compare the model with the one 10,000 steps ago based on the relative KID test [DBLP:journals/corr/BounliphoneBBAG15], and decrease the learning rate by 0.8 if no improvement is found in 3 consecutive times.
The main results are summarized in the Table 1, Table 2 and Table 3, with the generated images shown in Figure 2, Figure 4(a) and Figure 4(b). More results can be found in the Appendix E. Our proposed methods with MMDbased and SMMDbased objectives are denoted as SNMMDGANDK and SNSMMDGANDK. For comparison, we also include models using kernels constructed without the datadependent part, which is the same as the IKL method proposed by DBLP:journals/corr/abs190210214. These models are denoted as SNMMDGANIKL and SNSMMDGANIKL. As we can see, our KernelNetbased models obtain best results than other methods, including the nonMMDbased GAN methods. The results indicate the importance of effectiveness of modeling data dependent kernels with KernelNet. For clearer illustrations, Figure 3 plots the learning curves of score evaluations for different models during the whole training process on CIFAR10. Example generated images are also illustrated in Figure 4(a) and 4(b).
IS  FID  

MMDGAN  
WGANGP  
SobolevGAN  
SNGAN  
SNSWGAN  
SNMMDGAN  
SNMMDGANIKL  
SNMMDGANDK  
SNSMMDGAN  
SNSMMDGANIKL  
SNSMMDGANDK 
IS  FID  

WGANGP  
SobolevGAN  
SNGAN  
SNSWGAN  
SNSMMDGAN  
SNSMMDGANDK 
IS  FID  

BGAN  
SNGAN  
SMMDGAN  
SNSMMDGAN  
SNSMMDGANDK 
4.2 Implicit VAE
Multimodal distribution sampling
We first illustrate the implicit encoder can learn latent variable with multimode distributions. This is done by removing the decoder and only training on the encoder, which essentially learns a parametric sampler. We use a 3layer fullyconnected neural network with 20 hidden units as the encoder, whose inputs are Gaussian noises. Figure 4 plots the learned distributions estimated by samples on two target distribution, which can perfectly generates multimode samples.
Implicit VAE
Next, we test our proposed Implicit InfoVAE model on the MNIST dataset [DBLP:conf/icml/SalakhutdinovM08] to learn an implicit VAE model. We use a fullyconnected neural network with 1 hidden layer for both encoder and decoder, whose hidden units are set to 400. and
are parameterized by DNNs consisting of 2 fully connected hidden layers with 32 hidden units. Bernoulli noises are injected into the encoder by using dropout with a dropout rate of 0.3. The latent dimension is 32. The models are trained for 300 epochs. Stochastic gradient descent (SGD) with momentum of 0.9 is used with a batch size of 32. We sample 32
for every . The learning rate for the encoder and decoder is 0.002, while it is 0.001 for kernel learning. At every step, 512 and are sampled. For evaluation, we follow DBLP:journals/corr/WuBSG16 and use Annealed Importance Sampling (AIS) to approximate the negative loglikelihood (NLL). Ten independent AIS chains are used, each of which have 1000 intermediate distributions. The final results are computed using 5000 random sampled test data.The results are shown in Table 4, where we compare with related models including: VAE: vanilla VAE; SteinVAE: amortized SVGD [FengWL:UAI17]; SIVI: SemiImplicit VAE [DBLP:conf/icml/YinZ18]; Spectral: implicit VAE with spectral method for gradient estimation [shi2018spectral]; InfoVAE. We denote our Implicit InfoVAE with Stein gradient estimator with objective (1) as InfoIVAE, and models with objective (3.3) as InfoIVAERBF (i.e.,MMD with RBF kernel is used) and InfoIVAEIK (i.e., MMD with implicit kernel without datadependent component) and InfoIVAEDK (i.e., MMD with datadependent KernelNet), respectively. Note that these model have also reported scores related to NLL in their original paper, which are not directly comparable to ours. For fair comparisons, we use the same encoderdecoder structure and rerun all the models. Our model obtains the best NLL score among all the models. Reconstructed and generated samples are shown in Appendix E.
We also plot the SNE visualization of latent variables learned by InfoIVAEIK and InfoIVAEDK in Figure 5. From the figure we can see that latent variables learned using datadependent kernel looks more separable than implicit kernel without the datadependent part. More discussions on InfoVAE experiments can be found in Appendix E. An extra semisupervised experiment is also presented in Appendix E, where we follow [DBLP:conf/nips/KingmaMRW14] to evaluate the quality of the learned latent variables, and obtain better results.
5 Conclusion
We propose KernelNet, a novel way of parameterizing learnable datadependent kernels using implicit spectral distributions parameterized by DNNs. We analyze how the proposed KernelNet can be applied to deep generative models, including the stateoftheart variants of MMDGAN and InfoVAE. Experiments show that the proposed datadependent KernelNet leads to performance improvement over related models, demonstrating the effectiveness of datadependent kernels in practice.
References
Appendix A Details on proposition 2
By Euler’s formula, we have:
For realvalued kernel, we remove the imaginary part, we have:
Now we show , where b follows a uniform distribution :
The proposition has been proved.
Appendix B Existence of datadependent kernel
Consider a symmetric positive definite kernel defined on manifold (e.g. heat kernel), whose value depends on geodesic distance between 2 points rather than Euclidean distance.
For two pairs and , if , we may have . With abuse of notation, we may write the kernel as , where we use to denote the concatenation of and . Now the corresponding positive definite function is a mapping from to , rather than from to .
According to Lemma 1, we can write the kernel as
where . If there exists such that
has to be datadependent, otherwise we’ll have in the previous example, which contradicts with the fact.
Now we show why there always exists such that for any . One can regard as system of linear equations , where is unknown.
Denote , we can regard as coefficient matrix and as augmented matrix. Because . If , then there must be 1 or infinite solutions, hence exists; If , we can conclude that . Simply setting , i.e. the concatenation of two vectors, then always holds for any . Thus we can always write .
Two additional notes: (1). Note that
Comments
There are no comments yet.