## 1 Introduction

In Gaussian process (), people mainly focus on two types of inferences: the exact inference and the variational inference. The exact inference (Wang et al., 2019) directly optimizes the data likelihood

while the variational inference using Kullback-Leibler (KL) divergence (Burt et al., 2019) optimizes a tractable lower bound of data likelihood

where is a Nyström approximation of the exact covariance matrix . This lower bound significantly reduces computational burden and avoids model overfitting as it enforces regularization on the likelihood. In this article we introduce an alternative closed form lower bound on the likelihood based on the Rényi -divergence (Rényi and others, 1961)

Main Results The new lower bound can be expressed in a closed form as follows:

This new lower bound can be viewed as a convex combination of the sparse and the exact . The key advantage of this bound is it capability to control and tune the enforced regularization on the model and thus is a generalization of the traditional sparse variational regression. Please refer to Sec. 2 for more information.

From the theoretical perspective, we show that with probability at least , the Rényi -divergence between the variational distribution and the true posterior converges to 0 as the number of data points increase. Specifically,

As shown in this equation, plays an important role in controlling rate of convergence. The role of and rates of convergence of different kernels are given in in Sec. 3 and 4.

## 2 The Rényi Gaussian Processes and Variational Inference

Traditional variational inference is seeking to minimize the Kullback-Leibler (KL) divergence between the variational density and the intractable posterior , where

is a vector of parameters and

is the dataset. This minimization problem in turns yields a tractable evidence lower bound (ELBO) of the marginal log-likelihood function of data . The Rényi’s -divergence is a more general distance measure than the KL divergence. In this work, we want to explore the Rényi divergence based .### 2.1 Rényi Divergence

The Rényi’s -divergence between two distributions and

on a random variable

is defined asThis divergence contains a rich family of distance measure such as KL-divergence. Besides, the domain of can be extended to and .

###### Claim 1.

.

Therefore, KL-divergence is a special case of -divergence. It is well-known that KL-divergence yields a popular ELBO. Therefore, it would be interesting to derive a similar bound using the -divergence. Let (our data). Starting from , we will reach the variational Rényi (VR) bound (Li and Turner, 2016). This form is defined as

where is a Gaussian process, is the pseudo-input and is the latent variable.

###### Claim 2.

.

Denote by the as the ELBO, we have the following claim.

###### Claim 3.

.

The proof is given in the appendix.

### 2.2 The Variational Rényi Lower Bound

Our bound is

where

is the covariance matrix and . It can be seen that the new lower bound is the convex combination of components from sparse () and components from exact (). We can also see that plays an important role in model regularization.

We also derive a data-dependent upper bound similar to Titsias (2014). See appendix for details.

## 3 Convergence Analysis

In this section, we will derive some convergence results based on recent works from Titsias (2014); Huggins et al. (2018); Burt et al. (2019). We will provide some extensions to those works. Due to space limit, we move all proofs into appendix.

###### Theorem 4.

Suppose data points are drawn i.i.d from input distribution and . Sample inducing points from the training data with the probability assigned to any set of size equal to the probability assigned to the corresponding subset by an k-Determinantal Point Process (k-DPP) (Belabbas and Wolfe, 2009) with . If is distributed according to a sample from the prior generative model, with probability at least ,

where

are the eigenvalues of the integral operator

associated to kernel, and .As , we obtain the bound for the KL divergence.

###### Theorem 5.

Suppose data points are drawn i.i.d from input distribution and . Sample inducing points from the training data with the probability assigned to any set of size equal to the probability assigned to the corresponding subset by an k-Determinantal Point Process (k-DPP) (Belabbas and Wolfe, 2009) with . With probability at least ,

where and are the eigenvalues of the integral operator associated to kernel, and .

As , we reach the bound for the KL divergence.

## 4 Consequences

### 4.1 Smooth Kernel

We will provide a convergence result with SE kernel. For SE kernel, we have , where , , , and . is the length parameter,

is signal variance and

is the noise parameter. We can obtain (Burt et al., 2019).###### Corollary 6.

Suppose , where is a constant. Fix and take

. Assume the input data is normally distributed and regression in performed with a SE kernel. With probability

,when inference is performed with . where .

As , . As , we obtain ).

### 4.2 Non-smooth Kernel

For the Matérn , . We can obtain by the following claim.

###### Claim 7.

.

###### Proof.

It is easy to see that , where is a Riemann zeta function. By the Euler-Maclaurin sum formula, we have the generalized harmonic number (Woon, 1998)

Therefore,

∎

Let . Then by Theorem 19, we have

In order to let , we require . Therefore,

Let , then . Therefore, we have

Another term in the bound can also be simplified as

## Appendix

## Proof of Claims in Sec. 2.1

###### Claim 8.

.

###### Proof.

Applying the L’Hopital rule, we have

By the Leibniz’s rule, we have

∎

###### Claim 9.

.

###### Proof.

This is trivial, just let . ∎

###### Claim 10.

.

###### Proof.

The first equality follows from the Claim 1. The left inequality can be obtained by the Jensen inequality. ∎

## The Variational Rényi Lower Bound

When we apply the VR bound to and assume that , we can further obtain

It has been shown that , where . Besides, we have . Therefore,

Instead of treating as a pool of free parameters, it is desirable to find the optimal to maximize the lower bound. This can be achieved by the special case of the Hölder inequality (i.e., Lyapunov inequality).

Then we have,

The optimal is

Specifically,

It can be shown that

where . Since , we have

where

The last equality comes from the variation of Jacobi’s formula. The approximates well only when is “small”. Therefore, the lower bound can be expressed as

given that . While this form is attractive, it is not practically useful since when is “large”, the approximation does not work well. In the analysis section, we will instead use to prove the convergence result.

## The Data-dependent Upper Bound

###### Lemma 11.

Suppose we have two positive semi-definite (PSD) matrices and such that is also a PSD matrix, then . Furthermore, if and are positive definite (PD), then .

This lemma has been proved in (Horn and Johnson, 2012). Based on this lemma, we can compute a data-dependent upper bound on the log-marginal likelihood (Titsias, 2014).

###### Claim 12.

.

###### Proof.

Since

where means . Then, we can obtain since they are both PSD matrix. Therefore,

Let be the eigen-decomposition of . This decomposition exists since the matrix is PD. Then

where , are eigenvalues of and . Therefore, we have . Apparently, . Therefore, we can obtain.

Based on this inequality, it is easy to show that

Finally, we obtain

∎

We will use this upper bound to prove our main theorem.

## Detailed Proof of Convergence Result

Let and .

Comments

There are no comments yet.