Kernels are important tools in machine learning, and find applications in a wide range of models, from one of the earliest models, support vector machine (SVM), to the latest work on statistical testing with Maximum Mean Discrepancy (MMD) and various applications in deep generative models including generative adversarial networks (GAN) and variational autoencoders (VAE)[DBLP:conf/nips/LiCCYP17, DBLP:journals/corr/ZhaoSE17b]. These models are built by either restricting the solution space to a Reproducing Kernel Hilbert Space (RKHS) induced by a kernel, or adopting the MMD metric as the objective functions that require specified kernels in their MMDs. Other kernel-based methods such as those proposed by DBLP:conf/icml/YinZ18, FengWL:UAI17
rely on special kernels for computing estimations of some quantities such as gradients.
A not-so-desirable issue of the aforementioned kernel-based methods, however, is the need of selecting appropriate hyperparameters, e.g., the bandwidth of an RBF kernel. These hyperparameters are critical for obtaining good performance, and manual selection often leads to sub-optimal solutions. Preliminary works have been proposed to mitigate this problem. For example, DBLP:journals/jmlr/GonenA11 proposed to learn a combination of some predefined kernels. Alternatively, some other recent works focus on learning kernels based on random features or the corresponding spectral distributions when taking Fourier transform for the kernel [RahimiR:NIPS07, DBLP:conf/eccv/BazavanLS12, DBLP:conf/icml/WilsonA13] (see Section 3.1 for a more detailed description).
As a principled approach, in this paper, we propose a new kernel-learning paradigm by defining a more expressive yet easy-to-be-handled distribution of the corresponding kernel in the spectral domain, and by parameterizing it as a data-dependent implicit distribution with deep neural networks (DNNs). We call the resulting network KernelNet, call the kernel corresponds to the data-dependent distribution data-dependent kernel. Specifically, based on related kernel theory, we formulate a kernel in the spectral domain as an expectation of a function w.r.t. some random features, whose distribution is called the spectral distribution. The particular function and the spectral distribution then allow us to impose expressive DNN structures to parameterize the kernel. We propose a novel and expressive way of parameterizing the spectral distribution as a data-dependent implicit distribution. This is distinct from a recent work [DBLP:journals/corr/abs-1902-10214] which modeled the spectral distribution as a data-independent distribution, and thus is less expressive. Our method is more general than DBLP:journals/corr/abs-1902-10214, a special case of ours. Moreover, the data-dependent kernel component in KernelNet could lead to performance improvement over data-independent parameterization, as evidenced by our experiments.
The proposed KernelNet is quite flexible to be applied to existing models. We apply it to Deep Generative Models (DGMs) based on two representative DGM frameworks, the Maximum Mean Discrepancy GAN (MMD-GAN) [DBLP:conf/nips/LiCCYP17] and the Info-VAE [DBLP:journals/corr/ZhaoSE17b]. Specifically, we apply our kernel parameterization to several state-of-the-art MMD-GAN variants, including the spectral normalized MMD-GAN (SN-MMD-GAN) [DBLP:conf/nips/ArbelSBG18, DBLP:conf/iclr/MiyatoKKY18] and the scale MMD-GAN (SMMD-GAN) [DBLP:conf/nips/ArbelSBG18], which allows us to learn the models and the KernelNet simultaneously. We show theoretically that our proposed models satisfy weak topology, an important property to ensure the robustness of optimization procedures. We propose an implicit VAE model, where an MMD-regularizer is incorporated into the objective of VAE, following the Info-VAE framework. Our model is implicit in the sense that the variational distribution is parameterized as an implicit distribution implemented by a conditional generator, compared to the typical Gaussian assumption in standard VAE. This enables us to model a much more flexible latent space. The learning is done by adopting the Stein gradient estimator to optimize the implicit distribution [LiT:ICLR18]. Extensive experiments are performed to evaluate our proposed framework. The results suggest that our framework achieves significant improvement over existing ones.
We review MMD-GAN and Info-VAE, two DGMs where our proposed data-dependent kernels apply.
Generative Adversarial Network (GAN) is one of the most popular and powerful generative models in deep learning. It consists of a generator and a discriminator. The generator generate samples by transforming a simple noise distribution to an implicit data distribution, i.e., one can easily generate samples from the distribution, but without the knowledge of the density function. The discriminator is trained to distinguish the true training data distribution and the implicit distribution induced by the generator. The generator, on the other hand, is trained to fool the discriminator. At equilibrium, the generator is able to generate samples that are distributed as the true data distribution .
MMD-GAN achieves this by miminzing the maximum mean discrepancy (MMD) between two probability measures, the data and model distributions.
The Maximum Mean Discrepancy (MMD) between two probability measures and is defined as:
where is a Reproducing Kernel Hilbert Space (RKHS) and is a function in this RKHS.
For an RKHS induced by kernel , MMD can be computed using the following equation:
For a characteristic kernel, if and only if . Thus, MMD can be used as a way of measuring the similarity of distributions or as a training objective. DBLP:conf/nips/LiCCYP17 proposes MMD-GAN, where the kernel is defined as a composition of an injective function
for feature extraction and a kernel functionfor kernel evaluation, e.g., . Note that is also a valid kernel function. For example, if is the Gaussian kernel, is also a kernel.
In addition, a generator parameterized by is introduced for data generation, whose generated data are expected to match the training data in distribution by minimizing the corresponding MMD. Specifically, let represent the training data distribution and the implicit output distribution induced by the generator. The objective of MMD-GAN is formulated as:
In MMD-GAN, both and are parameterized by neural networks. Because of the min-max adversarial training, will eventually match in theory. Some improvements have also been proposed. One particular variant we are focusing on is called Scaled Maximum Mean Discrepancy.
The Scaled Maximum Mean Discrepancy is defined as:
Replacing the objective function of MMD-GAN by SMMD leads to the SMMD-GAN [DBLP:conf/nips/ArbelSBG18]. One useful and important method in training MMD-GAN is spectral normalization [DBLP:conf/iclr/MiyatoKKY18], which can control the Lipschitz constant of the injective function and lead to stable training and better performance. Spectral normalization normalizes the weight matrix by its spectral norm , i.e., with . The Lipschitz constant of is thus bounded from above by 1. MMD-GAN with spectral normalization is called SN-MMD-GAN, whereas SMMD-GAN with spectral parameterization [DBLP:conf/nips/ArbelSBG18, DBLP:conf/iclr/MiyatoKKY18], defined as with being a learnable parameter, is called SN-SMMD-GAN.
VAE and its variants are another family of DGMs where latent spaces are defined with posterior distributions. Specifically, define a generative process for an observation , starting from the corresponding latent variable , as: with and the parameter of the generative model. For efficient inference of , VAE [KingmaW:ICLR14, RezendeMW:ICML14] defines an inference network (or encoder) to generate from , with the corresponding generation distribution being parameterized by (also called variational distribution).
Info-VAE [DBLP:journals/corr/ZhaoSE17b] is a generalization of VAE by introducing an information-theoretic regularizer into the VAE framework. The objective is:
where is the mutual information between and that is defined as:
Note that both KL and MMD describe difference between distributions. Thus in our approach, we propose to use an MMD regularizer instead of mutual information, some discussions can be found in Appendix E. Furthermore, our model considers an implicit setting where is defined as an implicit distribution, making our model more expressive.
3 KernelNet for Learning Deep Generative Models
3.1 The Proposed KernelNet
To alleviate the difficulties with pre-defined kernels such as hyperparameter selection, we propose KernelNet, a principled way to parameterize a kernel with a deep neural network (DNN). Our method improves the recent work DBLP:journals/corr/abs-1902-10214 by making the kernel data-dependent and applying it to different DGMs, which has shown to be able to boost model performance.
We start with a classic result on positive definite functions [Rudin:94], stated in Lemma 1.
Lemma 1 (Rudin:94)
A continuous function in is positive definite if and only if it is the Fourier transform of a non-negative measure.
Let . By Lemma 1 and [RahimiR:NIPS07], a kernel such that can be represented as:
where is an indeterminate satisfying , and “*” denotes the conjugate transpose. The kernel representation (2
) directly allows us to construct an unbiased estimator forby introducing any valid distribution for the augmented variable , called the spectral distribution. This distribution will be parameterized by an implicit distribution with a DNN, described later. In the following, we first reformulate (2
) into two equivalent forms for the purposes of analysis and algorithm design, respectively. Because the probability density function and kernel function are real-valued, by Euler’s formula, we have
Let be drawn from some spectral distribution , and be drawn uniformly from . The real-valued kernel in (2) can be reformulated into the following two forms:
To enhance the expression power, we can make the distribution complex enough and learnable by parameterizing it with a DNN that induces an implicit distribution, as in generative adversarial networks (GAN) [GoodfellowAMXWOCB:NIPS14]. Specifically, we rewrite as with parameter . A sample from is modeled as the following generating process:
where denotes the output of a DNN parameterized by with the input
drawn from some simple distributions like standard Gaussian or uniform distributions.
Extension to data-dependent kernels
Although the above kernel parameterization is flexible enough to represent a rich family of implicit spectral distributions, it can be further extended by introducing a data-dependent spectral distribution. By data-dependent spectral distribution, we mean that there are some kernels satisfying (2), whose spectral distributions depend on the data pair , i.e., there exists a for each pair and a marginal distribution across the whole dataset.
We use the term data-dependent because of two reasons: 1) The marginal distribution depends on specific datasets, which could induce different formulas on different datasets; 2) For a given input pair , the conditional distributions and the kernel values depend the input pair . Thus does not necessarily imply . Proposition 3 shown this setting is possible.
There exist some positive definite kernels, which can be expressed as:
The proof is provided in Appendix B. One example of such a kernel is a symmetric positive definite kernel defined on manifold , whose value depends on the geodesic distance between two points rather than the Euclidean distance.
Some practical tricks can also be understood in the sense of data-dependent kernels. For example, using median heuristic bandwidth can be seen as dividingand by some data dependent quantity, denoted as . That is, if one defines and , and substitutes them into (2), one ends up:
where , and . The kernel definition in (6) is similar to a kernel with a data-dependent spectral distribution .
Constructing a data-dependent kernel network
To construct a data-dependent kernel network, we extend (5) to the following generative process: with . Note that such an implicit construction requires multiple noise samples to approximate the distribution of for each pair, which could be time consuming when minibatch sizes are large. To tackle this issue, we propose to decompose a kernel into an explicit data-independent component and an implicit data-dependent component. Note that such a decomposition still guarantees the implicity of the overall kernel, and thus will not lose generalization. Specifically, we use the reparameterization trick to define a data-dependent sampling process, i.e., a sample from the data-dependent spectral distribution, denoted by as it depends on :
where and are outputs from a DNN parameterized by ; denotes the element-wise multiplication. Because this may lead to an asymmetric kernel, we can further construct a symmetric kernel by simply setting . For the data-independent component, we adopt the implicit representation defined in (5). Consequently, the overall KernelNet is constructed as:
The network structure is illustrated in Figure 1. In implementation, the expectations are approximated by samples, e.g.:
where all ’s are samples from the spectral distributions through (5). In addition, ’s are drawn from . Since the construction is implicit with no stochastic intermediate nodes, standard back-propagation can be applied for efficient end-to-end training. Lemma 4 below guarantees that the output of the KernelNet (8) is still a legitimate kernel.
Lemma 4 (DBLP:books/daglib/0026002)
Let and be two kernels over , then is also a kernel.
It is worth noting that if we remove the data-dependent component, the kernel reduces to the one in DBLP:journals/corr/abs-1902-10214. We will show in the experiments that the data-dependent component actually plays an important role, and can lead to performance improvement.
3.2 KernelNet for MMD-GAN
In this section, we incorporate the proposed KernelNet into learning the MMD-GAN model. We seek to develop an algorithm to jointly optimize both the KernelNet and the MMD-GAN model. A straightforward way is to replace the standard kernel in MMD-GAN with the proposed data-dependent kernel (8). However, as the standard MMD-GAN fails to satisfy continuity in weak topology, a property that ensures metric discontinuity in a topological space, it is unclear whether the variant with our KernelNet would satisfy the property. To this end, we first define continuity in weak topology.
is said to endow continuity in weak topology if implies , where means convergence in distribution.
Continuity in weak topology in MMD-GAN is important because it was discovered by DBLP:conf/nips/ArbelSBG18 that continuity in the weak topology makes the loss provide better signal to the generator as approaches . DBLP:conf/nips/ArbelSBG18 also finds that the optimized MMD distance in MMD-GAN is not continuous in the weak topology, leading to training instability and poor performance. To alleviate this problem, a number of methods have been introduced, e.g., through weight-clipping [DBLP:conf/nips/LiCCYP17], spectral normalization [DBLP:conf/iclr/MiyatoKKY18], gradient penalty [GulrajaniAADC:NIPS17], and adopting a scaled objective (SMMD-GAN) [DBLP:conf/nips/ArbelSBG18]. Fortunately, we can prove that adopting our KernelNet in MMD-GAN leads to continuity in weak topology, based on the recent work DBLP:conf/nips/ArbelSBG18.
By parameterizing the kernel with our KernelNet , is continuous in the weak topology if the following are satisfied:
where denotes the Frobenius norm of a matrix, is the injective function in MMD-GAN, i.e., , and denotes its Lipschitz constant.
The proof can be found in Appendix C. Based on Proposition 5, we propose two variants of the MMD-GAN model, respectively corresponding to the SN-MMD-GAN [DBLP:conf/nips/ArbelSBG18, DBLP:conf/iclr/MiyatoKKY18] and the SMMD-GAN [DBLP:conf/nips/ArbelSBG18], by incorporating the conditions in Proposition 5 into the objective functions.
MMD-GAN with the KernelNet
By adopting spectral normalization and the method of Lagrange multipliers to regularize the conditions in Proposition 5, we propose SN-MMD-GAN-DK. Note that is satisfied because of the spectral normalization operation, which normalizes the weight matrix during the training process. The objective is defined as:
Scaled MMD-GAN with the KernelNet
where ; and has the same meaning as that in (11). Note that SMMD has already satisfied continuity in weak-topology, while we found that in SMMD-DK, adding regularizer like 12 would lead to better results.
3.3 KernelNet for Implicit Info-VAE
In this section, we describe how to incorporate our KernelNet into the Info-VAE framework. First, to increase the power of Info-VAE, we make the encoder implicit. That is, instead of adopting a particular distribution family such as Gaussian for the encoder, we add random noise at each layer of the encoder (including input data), and transport these simple noise distributions to a complex implicit distribution by the enncoder network, as done in GAN. One problem with such a formulation is the need of evaluating the density of the implicit encoder distribution for model training, as seen in the original ELBO (1). To deal with this issue, we adopt the Stein gradient estimator (SGE) [LiT:ICLR18] to approximate the gradient of log-density used in the training.
To introduce our KernelNet into the framework, we notice that mutual information in (1) is essentially a divergence metric between the two distributions and . Consequently, we propose to simply replace the mutual information by the MMD parameterized by our KernelNet. The final objective is thus:
which can be reformulated as:
Note that is independent of the model, and thus can be discarded. When and , (3.3) reduces to vanilla VAE.
We conduct experiments to test the performance of our proposed KernelNet applied to MMD-GAN and implicit VAE, and compare them with related methods, including MMD and non-MMD based GANs, parametric and non-parametric implicit VAE models. Our code will be available online.
We evaluated our MMD-GAN variants on the CIFAR-10, CelebA [liu2015faceattributes]
, and ImageNet[DBLP:journals/ijcv/RussakovskyDSKS15] datasets. Following DBLP:conf/nips/ArbelSBG18, the output dimension of discriminator is set to 1, and inputs of the generator are sampled from a uniform distribution . For a fair comparison, all the models are compared under the same architecture in each experiment. Our model architecture follows DBLP:conf/nips/ArbelSBG18. For CIFAR-10, we use an architecture with a 7-layer discriminator and a 4-layer generator. For CelebA, we use a 5-layer discriminator and a 10-layer ResNet generator. For ImageNet, our generator and discriminator are both 10-layer ResNet. We use two 3-layer fully-connected neural networks to parameterize and
. For each neural network, there are 16 neurons in every hidden layer. Spectral normalization[DBLP:conf/iclr/MiyatoKKY18] is used in SN-MMD-GAN-DK, and spectral parameterization [DBLP:conf/nips/ArbelSBG18] is used in SN-SMMD-GAN-DK.
We use Adam optimizer [kingma2014adam] with a learning rate 0.0001, , for CIFAR-10 and CelebA, and a learning rate 0.0002, , for ImageNet. Ratio of learning rate of kernel parameters to learning rate of generator is set to be 0.005. Batch size is 64 for all the models. Models are trained for 150,000 generator update steps for CIFAR-10 and CelebA, 200,000 generator update steps for ImageNet. For every generator step, we update discriminator and kernel parameters 5 steps. At every step, 1024 samples of and are used to compute the values of the kernel function. We set in (11) and in (12) for all the experiments.
For evaluation, Fréchet Inception Distance (FID) [DBLP:journals/corr/HeuselRUNKH17], Inception Score (IS) [SalimansGZCRC:NIPS16] and Kernel Inception Distance (KID) [DBLP:journals/corr/abs-1801-01401] are computed using 100,000 samples. In every 2,000 generator steps, we compare the model with the one 10,000 steps ago based on the relative KID test [DBLP:journals/corr/BounliphoneBBAG15], and decrease the learning rate by 0.8 if no improvement is found in 3 consecutive times.
The main results are summarized in the Table 1, Table 2 and Table 3, with the generated images shown in Figure 2, Figure 4(a) and Figure 4(b). More results can be found in the Appendix E. Our proposed methods with MMD-based and SMMD-based objectives are denoted as SN-MMD-GAN-DK and SN-SMMD-GAN-DK. For comparison, we also include models using kernels constructed without the data-dependent part, which is the same as the IKL method proposed by DBLP:journals/corr/abs-1902-10214. These models are denoted as SN-MMD-GAN-IKL and SN-SMMD-GAN-IKL. As we can see, our KernelNet-based models obtain best results than other methods, including the non-MMD-based GAN methods. The results indicate the importance of effectiveness of modeling data dependent kernels with KernelNet. For clearer illustrations, Figure 3 plots the learning curves of score evaluations for different models during the whole training process on CIFAR-10. Example generated images are also illustrated in Figure 4(a) and 4(b).
4.2 Implicit VAE
Multi-modal distribution sampling
We first illustrate the implicit encoder can learn latent variable with multi-mode distributions. This is done by removing the decoder and only training on the encoder, which essentially learns a parametric sampler. We use a 3-layer fully-connected neural network with 20 hidden units as the encoder, whose inputs are Gaussian noises. Figure 4 plots the learned distributions estimated by samples on two target distribution, which can perfectly generates multi-mode samples.
Next, we test our proposed Implicit Info-VAE model on the MNIST dataset [DBLP:conf/icml/SalakhutdinovM08] to learn an implicit VAE model. We use a fully-connected neural network with 1 hidden layer for both encoder and decoder, whose hidden units are set to 400. and
are parameterized by DNNs consisting of 2 fully connected hidden layers with 32 hidden units. Bernoulli noises are injected into the encoder by using dropout with a dropout rate of 0.3. The latent dimension is 32. The models are trained for 300 epochs. Stochastic gradient descent (SGD) with momentum of 0.9 is used with a batch size of 32. We sample 32for every . The learning rate for the encoder and decoder is 0.002, while it is 0.001 for kernel learning. At every step, 512 and are sampled. For evaluation, we follow DBLP:journals/corr/WuBSG16 and use Annealed Importance Sampling (AIS) to approximate the negative log-likelihood (NLL). Ten independent AIS chains are used, each of which have 1000 intermediate distributions. The final results are computed using 5000 random sampled test data.
The results are shown in Table 4, where we compare with related models including: VAE: vanilla VAE; Stein-VAE: amortized SVGD [FengWL:UAI17]; SIVI: Semi-Implicit VAE [DBLP:conf/icml/YinZ18]; Spectral: implicit VAE with spectral method for gradient estimation [shi2018spectral]; Info-VAE. We denote our Implicit Info-VAE with Stein gradient estimator with objective (1) as Info-IVAE, and models with objective (3.3) as Info-IVAE-RBF (i.e.,MMD with RBF kernel is used) and Info-IVAE-IK (i.e., MMD with implicit kernel without data-dependent component) and Info-IVAE-DK (i.e., MMD with data-dependent KernelNet), respectively. Note that these model have also reported scores related to NLL in their original paper, which are not directly comparable to ours. For fair comparisons, we use the same encoder-decoder structure and rerun all the models. Our model obtains the best NLL score among all the models. Reconstructed and generated samples are shown in Appendix E.
We also plot the -SNE visualization of latent variables learned by Info-IVAE-IK and Info-IVAE-DK in Figure 5. From the figure we can see that latent variables learned using data-dependent kernel looks more separable than implicit kernel without the data-dependent part. More discussions on Info-VAE experiments can be found in Appendix E. An extra semi-supervised experiment is also presented in Appendix E, where we follow [DBLP:conf/nips/KingmaMRW14] to evaluate the quality of the learned latent variables, and obtain better results.
We propose KernelNet, a novel way of parameterizing learnable data-dependent kernels using implicit spectral distributions parameterized by DNNs. We analyze how the proposed KernelNet can be applied to deep generative models, including the state-of-the-art variants of MMD-GAN and InfoVAE. Experiments show that the proposed data-dependent KernelNet leads to performance improvement over related models, demonstrating the effectiveness of data-dependent kernels in practice.
Appendix A Details on proposition 2
By Euler’s formula, we have:
For real-valued kernel, we remove the imaginary part, we have:
Now we show , where b follows a uniform distribution :
The proposition has been proved.
Appendix B Existence of data-dependent kernel
Consider a symmetric positive definite kernel defined on manifold (e.g. heat kernel), whose value depends on geodesic distance between 2 points rather than Euclidean distance.
For two pairs and , if , we may have . With abuse of notation, we may write the kernel as , where we use to denote the concatenation of and . Now the corresponding positive definite function is a mapping from to , rather than from to .
According to Lemma 1, we can write the kernel as
where . If there exists such that
has to be data-dependent, otherwise we’ll have in the previous example, which contradicts with the fact.
Now we show why there always exists such that for any . One can regard as system of linear equations , where is unknown.
Denote , we can regard as coefficient matrix and as augmented matrix. Because . If , then there must be 1 or infinite solutions, hence exists; If , we can conclude that . Simply setting , i.e. the concatenation of two vectors, then always holds for any . Thus we can always write .
Two additional notes: (1). Note that