# The Rényi Gaussian Process

In this article we introduce an alternative closed form lower bound on the Gaussian process (GP) likelihood based on the Rényi α-divergence. This new lower bound can be viewed as a convex combination of the Nyström approximation and the exact GP. The key advantage of this bound, is its capability to control and tune the enforced regularization on the model and thus is a generalization of the traditional sparse variational GP regression. From the theoretical perspective, we show that with probability at least 1-δ, the Rényi α-divergence between the variational distribution and the true posterior becomes arbitrarily small as the number of data points increase.

There are no comments yet.

## Authors

• 6 publications
• 6 publications
11/02/2012

### Deep Gaussian Processes

In this paper we introduce deep Gaussian process (GP) models. Deep GPs a...
03/08/2019

### Rates of Convergence for Sparse Variational Gaussian Process Regression

Excellent variational approximations to Gaussian process posteriors have...
11/02/2017

### Deep Recurrent Gaussian Process with Variational Sparse Spectrum Approximation

Modeling sequential data has become more and more important in practice....
09/18/2017

### Variational Gaussian Approximation for Poisson Data

The Poisson model is frequently employed to describe count data, but in ...
03/08/2022

### Informative Planning for Worst-Case Error Minimisation in Sparse Gaussian Process Regression

We present a planning framework for minimising the deterministic worst-c...
10/28/2018

### Bounded Regression with Gaussian Process Projection

Examples with bound information on the regression function and density a...
05/30/2017

### Identification of Gaussian Process State Space Models

The Gaussian process state space model (GPSSM) is a non-linear dynamical...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In Gaussian process (), people mainly focus on two types of inferences: the exact inference and the variational inference. The exact inference (Wang et al., 2019) directly optimizes the data likelihood

 Lexact=logN(0,σ2ϵI+Kf,f),

while the variational inference using Kullback-Leibler (KL) divergence (Burt et al., 2019) optimizes a tractable lower bound of data likelihood

 LVI=logN(0,σ2ϵI+Q)−12σ2ϵTr(Kf,f−Q),

where is a Nyström approximation of the exact covariance matrix . This lower bound significantly reduces computational burden and avoids model overfitting as it enforces regularization on the likelihood. In this article we introduce an alternative closed form lower bound on the likelihood based on the Rényi -divergence (Rényi and others, 1961)

Main Results The new lower bound can be expressed in a closed form as follows:

 Lα(q;y)=logN(0,σ2ϵI+(1−α)Kf,f+αQ)+log|I+1−ασ2ϵ(Kf,f−Q)|−α2(1−α).

This new lower bound can be viewed as a convex combination of the sparse and the exact . The key advantage of this bound is it capability to control and tune the enforced regularization on the model and thus is a generalization of the traditional sparse variational regression. Please refer to Sec. 2 for more information.

From the theoretical perspective, we show that with probability at least , the Rényi -divergence between the variational distribution and the true posterior converges to 0 as the number of data points increase. Specifically,

 Dα[q||p]≤1δα2(1−α)log[1+1−ασ2ϵ[(M+1)N∑∞m=M+1λm+2Nvϵ]N]N+α(M+1)N∑∞m=M+1λm+2Nvϵ2δσ2ϵ∥y∥2σ2ϵ.

As shown in this equation, plays an important role in controlling rate of convergence. The role of and rates of convergence of different kernels are given in in Sec. 3 and 4.

## 2 The Rényi Gaussian Processes and Variational Inference

Traditional variational inference is seeking to minimize the Kullback-Leibler (KL) divergence between the variational density and the intractable posterior , where

is a vector of parameters and

is the dataset. This minimization problem in turns yields a tractable evidence lower bound (ELBO) of the marginal log-likelihood function of data . The Rényi’s -divergence is a more general distance measure than the KL divergence. In this work, we want to explore the Rényi divergence based .

### 2.1 Rényi Divergence

The Rényi’s -divergence between two distributions and

on a random variable

is defined as

 Dα[p||q]=1α−1log∫p(θ)αq(θ)1−αdθ,α∈(0,1).

This divergence contains a rich family of distance measure such as KL-divergence. Besides, the domain of can be extended to and .

###### Claim 1.

.

Therefore, KL-divergence is a special case of -divergence. It is well-known that KL-divergence yields a popular ELBO. Therefore, it would be interesting to derive a similar bound using the -divergence. Let (our data). Starting from , we will reach the variational Rényi (VR) bound (Li and Turner, 2016). This form is defined as

 Lα(q;y)\coloneqq11−αlogEq[(p(f,U,y|Z)q(f,U|Z))1−α]=VR(q||p),

where is a Gaussian process, is the pseudo-input and is the latent variable.

###### Claim 2.

.

Denote by the as the ELBO, we have the following claim.

###### Claim 3.

.

The proof is given in the appendix.

### 2.2 The Variational Rényi Lower Bound

Our bound is

 Lα(q;y)=logN(0,σ2ϵI+(1−α)[Kf,f]+αQ)+logCx,

where

 Cx=|I+1−ασ2ϵ(Kf,f−Q)|−α2(1−α)≈{1+1−ασ2ϵTr% (Kf,f−Q)+O((1−α)2σ4ϵ)}−α2(1−α),

is the covariance matrix and . It can be seen that the new lower bound is the convex combination of components from sparse () and components from exact (). We can also see that plays an important role in model regularization.

We also derive a data-dependent upper bound similar to Titsias (2014). See appendix for details.

## 3 Convergence Analysis

In this section, we will derive some convergence results based on recent works from Titsias (2014); Huggins et al. (2018); Burt et al. (2019). We will provide some extensions to those works. Due to space limit, we move all proofs into appendix.

###### Theorem 4.

Suppose data points are drawn i.i.d from input distribution and . Sample inducing points from the training data with the probability assigned to any set of size equal to the probability assigned to the corresponding subset by an k-Determinantal Point Process (k-DPP) (Belabbas and Wolfe, 2009) with . If is distributed according to a sample from the prior generative model, with probability at least ,

 VR[q||p]≤α(M+1)N∑∞m=M+1λm+2Nvϵ2δσ2ϵ+1δα2(1−α)log[1+1−ασ2ϵ[(M+1)N∑∞m=M+1λm+2Nvϵ]N]N.

where

are the eigenvalues of the integral operator

associated to kernel, and .

As , we obtain the bound for the KL divergence.

###### Theorem 5.

Suppose data points are drawn i.i.d from input distribution and . Sample inducing points from the training data with the probability assigned to any set of size equal to the probability assigned to the corresponding subset by an k-Determinantal Point Process (k-DPP) (Belabbas and Wolfe, 2009) with . With probability at least ,

 Dα[q||p]≤1δα2(1−α)log[1+1−ασ2ϵ[(M+1)N∑∞m=M+1λm+2Nvϵ]N]N+α(M+1)N∑∞m=M+1λm+2Nvϵ2δσ2ϵ∥y∥2σ2ϵ

where and are the eigenvalues of the integral operator associated to kernel, and .

As , we reach the bound for the KL divergence.

## 4 Consequences

### 4.1 Smooth Kernel

We will provide a convergence result with SE kernel. For SE kernel, we have , where , , , and . is the length parameter,

is signal variance and

is the noise parameter. We can obtain (Burt et al., 2019).

###### Corollary 6.

Suppose , where is a constant. Fix and take

. Assume the input data is normally distributed and regression in performed with a SE kernel. With probability

,

 VR[q||p]≤2αRσ2ϵ1Nγ+1δα2(1−α)log[1+(1−α)(4δNγ+2)]N,

when inference is performed with . where .

###### Proof.

From Burt et al. (2019), we know . By Theorem 19, we will obtain

 VR[q||p]≤2αRσ2ϵ1Nγ+1δα2(1−α)log[1+(1−α)(2δNγ+2+2δNγ+2)]N<2αRσ2ϵ1Nγ+α(2Nγ+1)=αNγ(2Rσ2ϵ+2N),

As , . As , we obtain ).

### 4.2 Non-smooth Kernel

For the Matérn , . We can obtain by the following claim.

.

###### Proof.

It is easy to see that , where is a Riemann zeta function. By the Euler-Maclaurin sum formula, we have the generalized harmonic number (Woon, 1998)

 M∑m=1(1m)2k+2=ζ(2k+2)+1−2k−1M−2k−1+12M−2k−2−2k+212M−2k−3+O(M−2k−4).

Therefore,

 ∞∑m=M+1λm=O(1M2k+1)=−1−2k−1M−2k−1−12M−2k−2+2k+212M−2k−3−O(M−2k−4)=O(1M2k+1).

Let . Then by Theorem 19, we have

 α(M+1)N∑∞m=M+1λm+2Nvϵ2δσ2ϵ∥y∥2σ2ϵ≤α(M+1)NA1M2k+1+2Nvϵ2δσ2ϵRNσ2ϵ=αR2δσ4ϵ((M+1)N2AM2k+1+2N2vϵ).

In order to let , we require . Therefore,

 (M+1)N2AM2k+1=(Np+1)N2AN(2k+1)p≤AN2kp−2.

Let , then . Therefore, we have

 αR2σ4ϵ((M+1)N2AM2k+1+2N2vϵ)≤αRNγσ2ϵ+αRA2δσ4ϵNγ.

Another term in the bound can also be simplified as

 α2(1−α)log[1+1−ασ2ϵ[(M+1)N∑∞m=M+1λm+2Nvϵ]N]N≤αN2(1−α)log[1+(1−α)(Aσ2ϵNγ+2+2δσ2ϵNγ+2)].

## Proof of Claims in Sec. 2.1

.

###### Proof.

Applying the L’Hopital rule, we have

 limα→1Dα[p||q]=limα→11α−1log∫p(θ)αq(θ)1−αdθ=limα→11ddα(α−1)ddαlog∫p(θ)αq(θ)1−αdθ=limα→1ddαlog∫p(θ)αq(θ)1−αdθ

By the Leibniz’s rule, we have

 limα→1Dα[p||q]=limα→1ddαlog∫p(θ)αq(θ)1−αdθ=limα→1∫p(θ)αq(θ)1−α[logp(θ)−logq(θ)]dθ∫p(θ)αq(θ)1−αdθ=∫p(θ)[logp(θ)−logq(θ)]dθ∫p(θ)dθ=∫p(θ)logp(θ)q(θ)dθ=KL[p||q].

.

###### Proof.

This is trivial, just let . ∎

.

###### Proof.

The first equality follows from the Claim 1. The left inequality can be obtained by the Jensen inequality. ∎

## The Variational Rényi Lower Bound

When we apply the VR bound to and assume that , we can further obtain

It has been shown that , where . Besides, we have . Therefore,

 ∫p(f|U,Z)p(y|f)1−αdf=∫p(f|U,Z)(|2πσ2ϵI|−0.5e−12(y−f)T(σ2ϵI)−1(y−f))1−αdf=|2πσ2ϵI|−0.5(1−α)|2πσ2ϵI/(1−α)|−0.5∫p(f|U,Z)N(f,σ2ϵI1−α)df=|2πσ2ϵI|−0.5(1−α)|2πσ2ϵI/(1−α)|−0.5N(Kf,UK−1U,UU,σ2ϵ1−αI+Kf,f−Q)=(2πσ2ϵ)αN2(11−α)N2N(Kf,UK−1U,UU,σ2ϵ1−αI+Kf,f−Q)=p(y|U,Z).

Instead of treating as a pool of free parameters, it is desirable to find the optimal to maximize the lower bound. This can be achieved by the special case of the Hölder inequality (i.e., Lyapunov inequality).

Then we have,

 Lα(q;y)=11−αlog∫p(y|U,Z)q(U)αp(U|Z)1−αdU=11−αlog∫q(U)(p(y|U,Z)1/(1−α)p(U|Z)q(U))1−αdU=11−αlogEq(p(y|U,Z)1/(1−α)p(U|Z)q(U))1−α≤11−αlog[Eq(p(y|U,Z)1/(1−α)p(U|Z)q(U))]1−α=logEq(p(y|U,Z)1/(1−α)p(U|Z)q(U))=log∫p(y|U,Z)1/(1−α)p(U|Z)dU.

The optimal is

 q∗(U)∝p(y|U,Z)1/(1−α)p(U|Z).

Specifically,

 q∗(U)=p(y|U,Z)1/(1−α)p(U|Z)∫p(y|U,Z)1/(1−α)p(U|Z)dU.

It can be shown that

 p(y|U,Z)11−α=[(2πσ2ϵ)αN2(11−α)N2]11−αN(Kf,UK−1U,UU,σ2ϵ1−αI+Kf,f−Q)11−α=[(2πσ2ϵ)αN2(1−α)(11−α)N2(1−α)]CN(Kf,UK−1U,UU,σ2ϵI+(1−α)[Kf,f−Q]),

where . Since , we have

 Lα(q;y)=log∫p(y|U,Z)1/(1−α)p(U|Z)dU=logCxN(0,σ2ϵI+(1−α)[Kf,f−Q]+Kf,UK−1U,UKU,f)=logCxN(0,σ2ϵI+(1−α)[Kf,f−Q]+Q)=logN(0,σ2ϵI+(1−α)[Kf,f]+αQ)+logCx,

where

 Cx=[(2πσ2ϵ)αN2(1−α)(11−α)N2(1−α)][|2π(σ2ϵ1−αI+Kf,f−Q)|−α2(1−α)(1−α)N/2]=(2πσ2ϵ)αN2(1−α)(1−α)−αN2(1−α)|2π(σ2ϵ1−αI+Kf,f−Q)|−α2(1−α)=|I+1−ασ2ϵ(Kf,f−Q)|−α2(1−α)≈{1+1−ασ2ϵTr% (Kf,f−Q)+O((1−α)2σ4ϵ)}−α2(1−α)

The last equality comes from the variation of Jacobi’s formula. The approximates well only when is “small”. Therefore, the lower bound can be expressed as

 Lα(q;y)≈logN(0,σ2ϵI+(1−α)[Kf,f]+αQ)+log{1+1−ασ2ϵTr(Kf,f−Q)+O((1−α)2σ4ϵ)}−α2(1−α),

given that . While this form is attractive, it is not practically useful since when is “large”, the approximation does not work well. In the analysis section, we will instead use to prove the convergence result.

## The Data-dependent Upper Bound

###### Lemma 11.

Suppose we have two positive semi-definite (PSD) matrices and such that is also a PSD matrix, then . Furthermore, if and are positive definite (PD), then .

This lemma has been proved in (Horn and Johnson, 2012). Based on this lemma, we can compute a data-dependent upper bound on the log-marginal likelihood (Titsias, 2014).

.

###### Proof.

Since

 Kf,f+σ2ϵI=(1−α)Kf,f+αKf,f+σ2ϵI⪰(1−α)Kf,f+αQ+σ2ϵI⪰0,

where means . Then, we can obtain since they are both PSD matrix. Therefore,

 1|2π(Kf,f+σ2ϵI)|12≤1|2π((1−α)Kf,f+αQ+σ2ϵI)|12.

Let be the eigen-decomposition of . This decomposition exists since the matrix is PD. Then

where , are eigenvalues of and . Therefore, we have . Apparently, . Therefore, we can obtain.

 yT(Kf,f+σ2ϵI)y≤yT((1−α)Kf,f+αQ+σ2ϵI)y+αTr(Kf,f−Q)yTy=yT((1−α)Kf,f+αQ+αTr(Kf,f−Q)I+σ2ϵI)y.

Based on this inequality, it is easy to show that

 e−12yT(Kf,f+σ2ϵI)−1y≤e−12yT((1−α)Kf,f+αQ+αTr(Kf,f−Q)I+σ2ϵI)−1y.

Finally, we obtain

 1|2π(Kf,f+σ2ϵI)|12e−12yT(Kf,f+σ2ϵI)−1y≤1|2π((1−α)Kf,f+αQ+σ2ϵI)|12e−12yT((1−α)Kf,f+αQ+αTr(Kf,f−Q)I+σ2ϵI)−1y.

We will use this upper bound to prove our main theorem.

Let and .