1 Introduction
In Gaussian process (), people mainly focus on two types of inferences: the exact inference and the variational inference. The exact inference (Wang et al., 2019) directly optimizes the data likelihood
while the variational inference using Kullback-Leibler (KL) divergence (Burt et al., 2019) optimizes a tractable lower bound of data likelihood
where is a Nyström approximation of the exact covariance matrix . This lower bound significantly reduces computational burden and avoids model overfitting as it enforces regularization on the likelihood. In this article we introduce an alternative closed form lower bound on the likelihood based on the Rényi -divergence (Rényi and others, 1961)
Main Results The new lower bound can be expressed in a closed form as follows:
This new lower bound can be viewed as a convex combination of the sparse and the exact . The key advantage of this bound is it capability to control and tune the enforced regularization on the model and thus is a generalization of the traditional sparse variational regression. Please refer to Sec. 2 for more information.
From the theoretical perspective, we show that with probability at least , the Rényi -divergence between the variational distribution and the true posterior converges to 0 as the number of data points increase. Specifically,
As shown in this equation, plays an important role in controlling rate of convergence. The role of and rates of convergence of different kernels are given in in Sec. 3 and 4.
2 The Rényi Gaussian Processes and Variational Inference
Traditional variational inference is seeking to minimize the Kullback-Leibler (KL) divergence between the variational density and the intractable posterior , where
is a vector of parameters and
is the dataset. This minimization problem in turns yields a tractable evidence lower bound (ELBO) of the marginal log-likelihood function of data . The Rényi’s -divergence is a more general distance measure than the KL divergence. In this work, we want to explore the Rényi divergence based .2.1 Rényi Divergence
The Rényi’s -divergence between two distributions and
on a random variable
is defined asThis divergence contains a rich family of distance measure such as KL-divergence. Besides, the domain of can be extended to and .
Claim 1.
.
Therefore, KL-divergence is a special case of -divergence. It is well-known that KL-divergence yields a popular ELBO. Therefore, it would be interesting to derive a similar bound using the -divergence. Let (our data). Starting from , we will reach the variational Rényi (VR) bound (Li and Turner, 2016). This form is defined as
where is a Gaussian process, is the pseudo-input and is the latent variable.
Claim 2.
.
Denote by the as the ELBO, we have the following claim.
Claim 3.
.
The proof is given in the appendix.
2.2 The Variational Rényi Lower Bound
Our bound is
where
is the covariance matrix and . It can be seen that the new lower bound is the convex combination of components from sparse () and components from exact (). We can also see that plays an important role in model regularization.
We also derive a data-dependent upper bound similar to Titsias (2014). See appendix for details.
3 Convergence Analysis
In this section, we will derive some convergence results based on recent works from Titsias (2014); Huggins et al. (2018); Burt et al. (2019). We will provide some extensions to those works. Due to space limit, we move all proofs into appendix.
Theorem 4.
Suppose data points are drawn i.i.d from input distribution and . Sample inducing points from the training data with the probability assigned to any set of size equal to the probability assigned to the corresponding subset by an k-Determinantal Point Process (k-DPP) (Belabbas and Wolfe, 2009) with . If is distributed according to a sample from the prior generative model, with probability at least ,
where are the eigenvalues of the integral operator
As , we obtain the bound for the KL divergence.
Theorem 5.
Suppose data points are drawn i.i.d from input distribution and . Sample inducing points from the training data with the probability assigned to any set of size equal to the probability assigned to the corresponding subset by an k-Determinantal Point Process (k-DPP) (Belabbas and Wolfe, 2009) with . With probability at least ,
where and are the eigenvalues of the integral operator associated to kernel, and .
As , we reach the bound for the KL divergence.
4 Consequences
4.1 Smooth Kernel
We will provide a convergence result with SE kernel. For SE kernel, we have , where , , , and . is the length parameter,
is signal variance and
is the noise parameter. We can obtain (Burt et al., 2019).Corollary 6.
Suppose , where is a constant. Fix and take . Assume the input data is normally distributed and regression in performed with a SE kernel. With probability
when inference is performed with . where .
As , . As , we obtain ).
4.2 Non-smooth Kernel
For the Matérn , . We can obtain by the following claim.
Claim 7.
.
Proof.
It is easy to see that , where is a Riemann zeta function. By the Euler-Maclaurin sum formula, we have the generalized harmonic number (Woon, 1998)
Therefore,
∎
Let . Then by Theorem 19, we have
In order to let , we require . Therefore,
Let , then . Therefore, we have
Another term in the bound can also be simplified as
Appendix
Proof of Claims in Sec. 2.1
Claim 8.
.
Proof.
Applying the L’Hopital rule, we have
By the Leibniz’s rule, we have
∎
Claim 9.
.
Proof.
This is trivial, just let . ∎
Claim 10.
.
Proof.
The first equality follows from the Claim 1. The left inequality can be obtained by the Jensen inequality. ∎
The Variational Rényi Lower Bound
When we apply the VR bound to and assume that , we can further obtain
It has been shown that , where . Besides, we have . Therefore,
Instead of treating as a pool of free parameters, it is desirable to find the optimal to maximize the lower bound. This can be achieved by the special case of the Hölder inequality (i.e., Lyapunov inequality).
Then we have,
The optimal is
Specifically,
It can be shown that
where . Since , we have
where
The last equality comes from the variation of Jacobi’s formula. The approximates well only when is “small”. Therefore, the lower bound can be expressed as
given that . While this form is attractive, it is not practically useful since when is “large”, the approximation does not work well. In the analysis section, we will instead use to prove the convergence result.
The Data-dependent Upper Bound
Lemma 11.
Suppose we have two positive semi-definite (PSD) matrices and such that is also a PSD matrix, then . Furthermore, if and are positive definite (PD), then .
This lemma has been proved in (Horn and Johnson, 2012). Based on this lemma, we can compute a data-dependent upper bound on the log-marginal likelihood (Titsias, 2014).
Claim 12.
.
Proof.
Since
where means . Then, we can obtain since they are both PSD matrix. Therefore,
Let be the eigen-decomposition of . This decomposition exists since the matrix is PD. Then
where , are eigenvalues of and . Therefore, we have . Apparently, . Therefore, we can obtain.
Based on this inequality, it is easy to show that
Finally, we obtain
∎
We will use this upper bound to prove our main theorem.
Detailed Proof of Convergence Result
Let and .
Comments
There are no comments yet.