# Recursive Estimation for Sparse Gaussian Process Regression

Gaussian Processes (GPs) are powerful kernelized methods for non-parameteric regression used in many applications. However, their plain usage is limited to a few thousand of training samples due to their cubic time complexity. In order to scale GPs to larger datasets, several sparse approximations based on so-called inducing points have been proposed in the literature. The majority of previous work has focused on the batch setting, whereas in this work we focusing on the training with mini-batches. In particular, we investigate the connection between a general class of sparse inducing point GP regression methods and Bayesian recursive estimation which enables Kalman Filter and Information Filter like updating for online learning. Moreover, exploiting ideas from distributed estimation, we show how our approach can be distributed. For unknown parameters, we propose a novel approach that relies on recursively propagating the analytical gradients of the posterior over mini-batches of the data. Compared to state of the art methods, we have analytic updates for the mean and covariance of the posterior, thus reducing drastically the size of the optimization problem. We show that our method achieves faster convergence and superior performance compared to state of the art sequential Gaussian Process regression on synthetic GP as well as real-world data with up to a million of data samples.

## Authors

• 1 publication
• 3 publications
• 8 publications
• 16 publications
• ### Ensemble Kalman Filtering for Online Gaussian Process Regression and Learning

Gaussian process regression is a machine learning approach which has bee...
07/09/2018 ∙ by Danil Kuzin, et al. ∙ 0

• ### Sparse-posterior Gaussian Processes for general likelihoods

Gaussian processes (GPs) provide a probabilistic nonparametric represent...
03/15/2012 ∙ by Yuan, et al. ∙ 0

• ### Efficient Spatio-Temporal Gaussian Regression via Kalman Filtering

In this work we study the non-parametric reconstruction of spatio-tempor...
05/03/2017 ∙ by Marco Todescato, et al. ∙ 0

• ### Constant-Time Predictive Distributions for Gaussian Processes

One of the most compelling features of Gaussian process (GP) regression ...
03/16/2018 ∙ by Geoff Pleiss, et al. ∙ 0

• ### Streaming Sparse Gaussian Process Approximations

Sparse pseudo-point approximations for Gaussian process (GP) models prov...
05/19/2017 ∙ by Thang D. Bui, et al. ∙ 0

• ### Efficient Gaussian Process Classification Using Polya-Gamma Data Augmentation

We propose an efficient stochastic variational approach to GP classifica...
02/18/2018 ∙ by Florian Wenzel, et al. ∙ 0

• ### Large-scale Heteroscedastic Regression via Gaussian Process

Heteroscedastic regression which considers varying noises across input d...
11/03/2018 ∙ by Haitao Liu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Regression methods based on Gaussian processes

(GPs) are used in many machine learning applications due to their modelling flexibility, robustness to overfitting and availability of well-calibrated predictive uncertainty estimates. The areas of applications range from social and natural science through engineering. In the area of control engineering, GPs have been used in system identification for impulse response estimation

[24, 7, 25, 23], nonlinear ARX models [17, 3], learning of ODEs [1, 19] and latent force modelling [37]. The problem of learning the state space of a nonlinear dynamical system using GPs is discussed in [9, 21, 33]. Besides the many benefits, GPs have one main drawback: they are not suitable for large data sets due to their memory requirements and computational cost, where is the number of training samples.
Over the past decade, many different approximations were developed to reduce this cost. The most common and successful methodology to approximate GPs is based on so-called inducing point methods, where the unknown function is represented by its values at a set of pseudo-inputs (called inducing points), with . Early attempts to sparse GP regression were based on the Subset of Regressors (SoR/DIC) approximation due to [30, 36, 31]. However, these methods produce overconfident predictions when leaving the training data. In order to produce sensible uncertainty estimates, Deterministic Training Conditional (DTC) [8, 29], Fully Independent Training Conditional (FITC) [32], Fully Independent Conditional (FIC) [26] and Partially Independent Training Conditional (PITC) [26] were suggested where in each model the joint prior over the latent function and test values is modified differently. Titsias [34] instead proposed to retain the exact prior but to perform approximate (variational) inference, leading to the Variational Free Energy (VFE) method which converges to full GP as increases. Based on the minimization of an -divergence, Bui et al. [5] presented Power Expectation Propagation (PEP), which unifies several of the previous mentioned models. Inference in such models can typically be done in time and

space. In order to find good parameters (inducing input points and kernel hyperparameters), either the marginal log-likelihood of the sparse models or a lower bound are used as an objective function for numerical optimization.

The majority of previous work in GP approximation has focused on the batch setting, that is, all data is available at once and can be processed together. However, for big data, where the number of samples can be many millions, keeping all data in memory is not possible and the data might even arrive sequentially.
In a streaming setup, Bui et al. [4] developed an algorithm to update hyperparameters in an online fashion. While this method is promising in this setting, its accuracy is limited by considering each sample only once. In this work we focus on the common setting where hyperparameters are learned by reconsidering mini-batches several times. In order to speed up the optimization, we would like to update the parameters more frequently for a subset of data and update the posterior in a sequential way.
In this setting, Hensman et al. [12] applied Stochastic Variational Inference (SVI, [14]) to an uncollapsed lower bound of the marginal likelihood. Compared to the collapsed bound of the VFE model, the (variational) posterior distribution is not eliminated analytically and is an explicit part of the objective function. The resulting Stochastic Variational Gaussian Process (SVGP) method allows to optimize the parameters with mini-batches of the data. One limitation of SVGP is the large number of parameters (, where is the input space dimension) to be numerically estimated since the update of the posterior distribution is not given analytically which leads to a high dimensional optimization problem.
Inspired by the work of [12] , the authors in [22] demonstrated the high scalability of this method by exploiting distributed machine learning platforms. The work of [12] was extended by [13] to the DTC, FI(T)C and PITC models for a variational anytime framework based on SVI with the corresponding uncollapsed bounds. However, it was assumed that the parameters are given in advance, that is, only the (variational) posterior distribution is learnt via SVI.
Although showing high scalability and desirable approximation, all these methods based on SVI have two main drawbacks: i) the (variational) posterior is not given analytically, which leads to additional many parameters; ii) the uncollapsed bounds are in practice often less tight than the corresponding collapsed batch bounds because the (variational) posterior is not optimally eliminated. Overall, the huge number of parameters leads to an optimization problem that is hard to tune and relies particularly on appropriately decaying learning rates. Even for fixed parameters, when no non-linearity is involved, there is still the need of reconsidering each sample many times.
An orthogonal direction was pursued by the authors in [11] and [28], where a connection between GPs and State Space models for particular kernels was established for spatio-temporal regression problems, which allows to apply sequential algorithm such as the Kalman Filter, see also [6, 2]. Inspired by this line of research, the authors in [35] focused on efficient implementation and extended the methodology to varying sampling locations over time. Although these approaches can deal with sequential data and solve the problem of temporal time complexity, the space complexity is still cubic in the number of measurements. In addition, the hyperparameters are assumed to be fixed in advance.

In order to improve these shortcomings, we show that sparse inducing point models can be seen as a Bayesian kernelized linear regression model with input dependent observation noise with a particular choice of basis and noise covariance function, respectively. Given these insights, we show how to apply recursive estimation algorithms like the Kalman Filter (KF) and Information Filter (IF)

[15], which allows to train sparse GP methods analytically and exactly in an online or distributed setting (considering each sample only once, as opposed to the work in [12] and [13]) for given parameters. After processing all data, the obtained posterior distribution is equivalent to the corresponding batch version. This might constitute an interesting method on its own for many applications where there is prior knowledge about the parameters or they can be estimated offline.
For unknown parameters, we propose a recursive collapsed

lower bound to the marginal log likelihood, which can be used for stochastic optimization with mini-batches. Our approach is based on recursively exploiting the chain rule for derivatives by recursively propagating the analytical gradients of the posterior which enables to compute the derivatives of the lower bound sequentially. When computing the gradients of the recursive collapsed bound in a non-stochastic way, they exactly match the corresponding batch ones. This approach constitutes an efficient method to train a very general class of sparse GP regression models with much fewer parameters to be estimated numerically (

) than state of the art sequential GP regression methods (). Since the number of inducing points determines the quality of the approximation to full GP, this reduction in number of parameters from to is crucial and results in more accurate and faster convergence than the state of the art approach (SVGP), which we demonstrate in several experiments.

In Section 2, we briefly review exact and sparse GP for regression in the batch case as well as the state of the art for sequential sparse GP learning. In Section 3, we establish the connection between sparse GP regression and recursive estimation by introducing a Bayesian kernelized linear regression model with input dependent observation noise for which we discuss the batch as well as its recursive solutions for given parameters. For unknown parameters, we propose in Section 4 a novel learning procedures - Stochastic Recursive Gradient Propagation (SRGP) 111Code is available on https://github.com/manuelIDSIA/SRGP. - based on recursively propagating the analytic gradients of the posterior, which allows to optimize the parameters with mini-batches. We demonstrate in Section 5 the resulting faster convergence and the superior performance of our proposed algorithm for synthetic as well real-world data with up to a million of samples. Section 6 concludes the work and presents future research directions.

## 2 GP Regression

Suppose we are given a training set of pairs of inputs and noisy scalar outputs generated by adding independent Gaussian noise to a latent function , that is , where . We denote with

the vector of observations and with

.
We can model with a Gaussian Process (GP), which defines a prior over functions and can be converted into a posterior over functions once we have observed some data (see e.g. [27]). To describe a GP, we only need to specify a mean and a covariance function . The latter is a positive definite kernel function [27], for instance the squared exponential (SE) kernel with individual lengthscales for each dimension, that is with . We assume that the mean function is zero for the sake of simplicity and we use the SE kernel throughout this paper even thought all methods introduced here work with any positive definite kernel. Since the joint prior on the training values and a test latent function value at an unknown test point

is multivariate Gaussian, it is straightforward to update this knowledge with observed data and thus doing inference analytically by manipulating multivariate Gaussian distributions. Combining it with the likelihood

yields the posterior predictive distribution

with

 μ∗=K∗X(KXX+σ2nI)−1yΣ∗=K∗∗−K∗X(KXX+σ2nI)−1KX∗. (1)

We define for any and with the corresponding rows . For brevity, we use indicating . This posterior GP depends via the kernel matrices on the hyperparameters which can be estimated by maximizing the log marginal likelihood

 logp(y|θ)=logN(y|0,KXX+σ2nI). (2)

Although this is an elegant approach for probabilistic regression, it is intractable for large datasets since the computations for inference require the inversion of the matrix in Eq. (1) which scales as in time and for memory (for given ).

### 2.1 Batch Sparse GP Regression

In order to scale GPs to large data, several approaches have been proposed based on so-called inducing point methods, where the sparsity is achieved via inducing points to optimally summarize the dependency of the whole training data. The inducing inputs are in the -dimensional input data space and the inducing outputs are the corresponding GP-function values (consider also Figure 1). The GP prior over and is augmented with the inducing outputs , leading to a joint and marginal prior, respectively. By marginalizing out the inducing points, the original prior is recovered. The fundamental approximation in all sparse GP models is that given the inducing outputs , is independent for any . Consequently, inference in these models can be done in time and space [32]. From an abstract point of view, inducing point methods differ from each other in 3 main aspects:

1. training: inferring inducing outputs given inducing inputs and kernel hyperparameters

2. prediction: performing prediction at a new test point given , and

3. optimization: finding optimal inducing inputs and kernel hyperparameters

In the following, we briefly report a general sparse GP model due to [5] which unifies several previous sparse inducing point GP models in the batch case. It is based on the minimization of an -divergence or can be equivalently seen as applying Power Expectation Propagation (PEP). We report the sparse predictive distribution which summarizes the training and prediction part as well as the bound to the marginal log-likelihood, which can be used for optimizing the parameters . In the following, we denote and for any . In the PEP model, the predictive distribution is given by

 μ∗=Q∗X(¯¯¯¯¯KXX+σ2nI)−1y,Σ∗=K∗∗−Q∗X(¯¯¯¯¯KXX+σ2nI)−1QX∗, (3)

where . For this model, a lower bound to the sparse log marginal likelihood is analytically available

 LPEP(Θ)=logN(y|0,¯¯¯¯¯KXX+σ2nI)−1−α2αN∑i=1log(1+ασ2n[DX]ii), (4)

where we omit the explicit dependency on via and for the sake of brevity. Similar as with the marginal likelihood (2) for full GP, this bound can be used to learn the parameters . This model unifies several previous sparse GP inference approaches including FITC () [32] and VFE () [34].

We want to highlight the special case , since it was the first approach where the approximating process was explicitly linked to full GP and it serves as starting point for the sequential case discussed below. Instead of modifying the prior and applying exact inference, the idea is to retain the exact prior and to perform approximate (variational) inference. In its original formulation, [34] proposed to maximize a variational lower bound to the true GP marginal likelihood, obtaining the Variational Free Energy (VFE) or the collapsed lower bound

 LVFE(Θ)=logN(y|0,QXX+σ2nI)−Tr[DX]2σ2n, (5)

where the variational distribution over the inducing points is optimally eliminated and analytically available. The novelty of this bound is that the rightmost term in (5) acts as a regularizer that prevents overfitting and has the effect that the sparse GP predictive distribution (3) converges rigorously [34] to the exact GP predictive distribution (1) for increasing number of inducing points when optimizing with (5).

A more thorough overview of several sparse inducing point GP models is given in the Appendix B and for further details we refer to [5, 18, 26, 27] for recent reviews on the subject.

### 2.2 Sequential Sparse GP Regression

Focusing on the VFE model, the optimization for of the collapsed lower bound (5) requires to process the whole dataset in a batch sense, which is very inefficient and not feasible for large . We would like to update the parameters more frequently, therefore, we split the data into mini-batches of size and denote the corresponding sparse GP value. In order to achieve faster convergence in the optimization, Hensman et al. [12] introduced Stochastic Variational Gaussian Process (SVGP) where they applied stochastic optimization to an uncollapsed lower bound to the log marginal likelihood

 (6)

where the variational distribution is part of the bound and explicitly parametrized as . This uncollapsed bound satisfies with equality when inserting the optimal mean and covariance of the variational distribution of VFE. The key property of this bound is that it can be written as a sum of terms, which allows Stochastic Variational Inference (SVI, [14]). Unfortunately, collapsing the bound, i.e. inserting the optimal distribution, reintroduces dependencies between the observations, and eliminates the global parameter which is needed for SVI. Therefore, all variational parameters are numerically estimated by following the noisy gradients of a stochastic estimate of the lower bound . By passing through the training data a sufficient number of times, the variational distribution converges to the batch solution of VFE method. However, the disadvantage of this approach is the large number of parameters, since in addition to the parameters , all entries in the mean vector and the covariance matrix have to be estimated numerically, which is in order .

## 3 Recursive Sparse GP Regression

In this section we establish the connection between Bayesian recursive estimation and sparse inducing point GP models. In a first step, we explicitly report for a large class of sparse inducing point GP models their weight-space view (see [27], Ch. 2.1) which can also be seen as a particular kernelized version of a Bayesian linear regression model. In particular, we discuss the explicit choices of basis functions, the model specific observation noise, and the predictive transition distribution in a Bayesian kernelized linear regression model with additional input dependent observation noise for several sparse GP methods discussed by [27] for the batch case. Using these insights, we can exploit in a next step the recursive estimation algorithms known as the update equations of the Kalman Filter (KF) and Information Filter (IF) which allows to train many sparse methods analytically either in a online or distributed setting for given parameters. For the purpose of parameter estimation (which we will discuss in the next section), we discuss the recursive marginal log-likelihood with a model specific regularization term.

### 3.1 Weight-Space View of Generic Sparse GP

For any , consider the generic sparse GP model

 f(X) =H(X)u+γ(X) (7)

where the sparse GP value is modeled by a linear combination of basis-functions , (stochastic) weights with a prior and an input dependent error term that takes into account the sparse approximation. For , the noisy observations are obtained by adding independent noise to , yielding the model

 yk =fk+εk; (8) fk=Hku+γk and   f∗=H∗u+γ∗, (9)

where we explicitly distinguish the training and test cases depending on the input and . Assuming , and are independent, by linearity and Gaussianity we can compactly write

 p(yk|fk) =N(yk|fk,σ2nI); (10) p(u) =N(0,Σ0); (11) p(fk|u) =N(fk|Hku,¯¯¯¯Vk); (12) p(f∗|u) =N(f∗|H∗u,V∗), (13)

where and denote different covariances. Combining (10) and (12) by integrating out (40) gives rise to the likelihood where . Thus, the generic sparse GP model given by the prior together with this likelihood can be seen as a Bayesian non-linear regression model with additional input dependent observation noise with a particular choice of basis functions and covariance structure. For inducing inputs , we explicit , and . By different choices of the quantities and , a range of different sparse GP models are obtained. For fixed , these models differ from each other only by the choice of the input dependent observation noise covariance and the prediction covariance . More concretely, for instance choosing , and , the sparse models FITC, VFE and PEP (see also Appendix B), respectively, are recovered. The details of different choices for more sparse models are summarized in the bottom table in Figure 2.

### 3.2 Training

After the prior and the likelihood are specified in a Bayesian regression model, the posterior over the weights conditioned on the data can be computed either in a batch or in a recursive manner.

#### 3.2.1 Batch Estimation

The batch likelihood is with and a block-diagonal matrix with blocks . The posterior over given the data can be obtained by Bayes’ rule, i.e.

 p(u|y)∝p(y|u)p(u) ∝N(u|μK,ΣK), (14)

with and where we used the standard linear Gaussian identity in (41).

#### 3.2.2 Recursive Estimation

An equivalent solution can be obtained by propagating recursively . By interpreting this previous posterior as the prior, the updated posterior can be recursively computed by

 p(u|y1:k)∝p(yk|u)p(u|y1:k−1)∝N(u|μk,Σk), (15)

where and .

Kalman Filter like updating
The Kalman Filter [15] constitutes an efficient way to update the mean and covariance of . Applying the matrix inversion lemma (38) to in Eq. (15) and introducing temporary variables yields

 rk=yk−Hkμk−1;Sk=HkΣk−1HTk+Vk;Gk=Σk−1HTkS−1k;μk=μk−1+Gkrk;Σk=Σk−1−GkSkGTk. (16)

Starting the recursion with and , the posterior distribution at step is equivalent to (14) independent of the order of the data. We want to emphasize that the only difference in the estimation part between the sparse GP models is in the additional noise .

Information Filter like updating
Using the natural parameter representation of a Gaussian , it directly follows from (15) that the posterior can be recursively computed

 ηk=ηk−1+HTkV−1kyk;Λk=Λk−1+HTkV−1kHk, (17)

with and . For any , it holds and We can also write

 ηk=K∑k=1Δηk and Λk=Λ0+K∑k=1ΔΛk, (18)

where and . It is interesting to observe that and are independent of any other or .

Whether using the KF or IF for the recursive computation of the sparse posterior depends on the the number of inducing points and the size of the mini-batches . If the number of inducing points is larger than the size of the mini-batches, i.e. , the KF is cheaper, since only the matrix of size has to be inverted, which is inexpensive regarding to the number of inducing points . On the other hand, if (and assuming diagonal), the recursive IF approach is computationally cheaper since the inversion of the posterior precision is inexpensive regarding to the size of the mini-batch.

Transformation
Instead of running a KF with , and , an equivalent predictive distribution is also obtained when using and together with . For any we then propagate a transformed posterior distribution , and , , respectively. For IF we apply an equivalent transformation, using , and which results in propagating , .
This parametrization constitutes a computational shortcut, since the basis functions are very easy to interpret and do not include any matrix multiplication. Note that also the log marginal likelihood discussed below is not affected by this transformation.

### 3.3 Prediction

Given a new , the predictive distribution after seeing of the sparse GP methods can be obtained by

 p(f∗|y1:k)=∫p(f∗|u)p(u|y1:k)du=N(f∗|H∗μk,H∗ΣkHT∗+V∗) (19)

using with and the model specific prediction covariance. The predictions for are obtained by adding to the covariance of . At step , by applying (38) to the batch covariance in (14), we get for the predictive distribution in (19)

 μ∗K =H∗Σ0HTΣy, (20) Σ∗K =H∗Σ0HT∗−H∗Σ0HTΣHΣ0HT∗+V∗,

where Inserting the particular choices for , and yields the usual formulation for the sparse predictive distribution

 (21)

Depending on the choice of the covariances and , we obtain for instance (3), (42) or (45) for PEP, VFE and FITC, respectively.

### 3.4 Marginal Likelihood

For model selection (or hyperparameters estimation) in the batch setting, the log marginal likelihood of a Bayesian regression model can be computed by marginalizing out , that is

 logp(y)=log∫p(y|u)p(u)du=logN(y|0,HΣ0HT+V). (22)

For the recursive setting, can be factorized into where the relative marginal likelihood is given by

 p(yk|y1:k−1) =N(yk|Hkμk−1,HkΣk−1HTk+Vk) =N(rk|0,Sk). (23)

The log of the joint marginal likelihood involving all terms of (3.4) can be explicitly written as

 logK∏k=1p(yk|y1:k−1)=K∑k=1logN(rk|0,Sk)=−N2log2π−12K∑k=1log|Sk|+rTkS−1krk. (24)

By maximizing iteratively a lower bound of the recursive factorized marginal likelihood in (3.4) leads to the recursive KF updates in (16) for the posterior and the resulting lower bound

 ψ(Θ)=K∑k=1logN(rΘk|0,SΘk)−ak(Θ) (25)

includes a model specific regularization term (consider the corresponding column in Figure 2). We refer to this as the recursive collapsed bound and a detailed derivation for the VFE model is given in Appendix D. Using the model specific quantities and , this recursive computation of the lower bound of the marginal likelihood are equivalent to the batch counterparts for all sparse models, for instance (5) and (4) for VFE and PEP, respectively.

Compared to the collapsed bound in (5), our recursive collapsed bound in (25) decomposes into a recursive sum over the mini-batches which allows to optimize the hyperparameters sequentially. The advantage, compared to the uncollapsed bound in (6), is that the variational distribution is recursively and analytically eliminated, thus reducing the number of parameters to be numerically estimated drastically from to .

### 3.5 Online and Distributed Learning

The connection between sparse GP models and recursive estimation established in the previous sections allows us to train the sparse GP models analytically either online for streaming data or in a distributed setting for fixed . In the first case, we propagate either the mean and covariance (16) or the natural mean and precision (17) (depending on and ) of the current posterior for each recursively. In the second case we compute and (18) distributed among computational nodes, and aggregate afterwards centralized the joint full posterior . At step , after seeing each sample once, we get exactly the same distribution for the inducing outputs and the same predictive distribution as the corresponding batch counterparts. In particular, exploiting the transformation from the Section 3.2.2 allows efficient online and distributed inference.

Suppose we have data samples in (only for illustration purposes, it works as well as for more dimensions) and we want to use the PEP model with with inducing points. We show on a toy example the above proposed online and distributed learning procedures for sparse GPs when is available in advance.

#### 3.5.1 Online Setting

Assuming the data samples arrive sequentially in a stream. Thus we have and . Using the transformation, we have and and we set to . For each data sample we compute the residual

, the innovation variance

and the Kalman gain . The (transformed) posterior distribution over the inducing points can then be updated by

 ~μk=~μk−1+~gkrk   and   ~Σk=~Σk−1−sk~gk~gTk. (26)

In order to make predictions for new data , we use and and apply (19). Note that there is no need to transform back the posterior over the inducing points, since it is already taken into account in the prediction step. After processing all samples, the predictive distribution and the cumulative bound of marginal log-likelihood correspond to the batch version, which is depicted in Figure 3.

#### 3.5.2 Distributed Setting

Assuming the data is available in a batch, but we want to distribute the computation among computational nodes, thus we have . Using the transformation, we have , , where . Each node now computes

 Δ~ηk =~HTkV−1kyk=KRXkDiag[v−1k]yk, Δ~Λk =~HTkV−1k~Hk=KRXkDiag[v−1k]KXkR.

Afterwards, we sum up at a central node the (transformed) posterior distribution over the inducing points with and followed by and . Prediction can be done with and as explained in the online setting. This procedure is illustrated in Figure 4. Note that the samples are only sorted for illustration purposes.

## 4 Hyperparameters Estimation

In the previous section we discussed online and distributed procedures for analytical computation of sparse GP models for fixed where we recovered the batch solution after seeing each sample once. In this section we show how to exploit the connection to recursive estimation when optimizing in a sequential or distributed way. Note that we only estimate these parameters and not all entries in the posterior mean vector and covariance matrix compared to SVGP where no analytic updates of these quantities are available. In particular, we explain how to use the recursive collapsed bound in (28) by using recursive gradient propagation which enables the application of stochastic optimization. Moreover, we present an idea based on distributed gradient cumulation of the natural mean and precision with respect to the parameters which allows to alternatively distribute the computations. Both these techniques can be applied to all sparse GP models discussed in Section 2.1.

Finding a maximizer for of an objective function can be achieved by applying Stochastic Gradient Descent(SGD), where the update can be written as

 θ(t)=θ(t−1)+γ(t−1)∂φk(Θ)∂θ|θ=θ(t−1), (27)

where might be a sophisticated function of (for instance using ADAM [16], where also a bias correction term is included). We call one pass over the

mini-batches an epoch. We denote

the estimate of in epoch for mini-batch .

### 4.1 Recursive Gradient Propagation (RGP)

We recall the recursive collapsed bound in (25)

 ψ(K)(Θ)=K∑k=1dk(Θ)−ak(Θ)=K∑k=1ψk(Θ) (28)

where and the model specific regularization term. Since decomposes into a (recursive) sum over the mini-batches, we directly compute the derivative of w.r.t. . The derivative of is straightforward, for the former we can write

 ∂dk(Θ)∂θ=−12∂log∣∣SΘk∣∣∂θ−12∂(rΘk)T(SΘk)−1rΘk∂θ (29)

with and . It is important to note that ignoring naively the dependency of through and completely forgets the past and thus results in overfitting the current mini-batch. In order to compute the derivatives of and , we exploit the chain rule for derivatives and recursively propagate the gradients of the natural mean and the precision over time, that is

 ∂ηΘk∂θ=∂ηΘk−1∂θ+∂ΔηΘk∂θ and ∂ΛΘk∂θ=∂ΛΘk−1∂θ+∂ΔΛΘk∂θ. (30)

For simplicity, we show here the gradient propagation of the IF, depending on the and , it might be cheaper to use the KF to propagate recursively the gradients of the computations of and in (16), that is

 ∂uk∂θ =∂uk−1∂θ+∂Gk∂θrk+Gk∂rk∂θ (31) ∂Σk∂θ =∂Σk−1∂θ−∂Gk∂θSkGTk−Gk∂Sk∂θGk−GkSk∂GTk∂θ,

where , and are computed recursively according to (16). Computing the derivatives of as explained in (29) and (30)/(31), the stochastic gradient

 ∂ψk(Θ)∂θ=∂dk(Θ)∂θ−∂ak(Θ)∂θ (32)

can be computed for each mini-batch . Suppose keeping constant, the cumulative derivative at step for the recursive collapsed bound is equal to the corresponding derivatives of the batch sparse bound, which can be verified for the toy example in Figure 3. The numbers in the bottom left and right corners show the cumulative recursive collapsed bound and its cumulative derivative (abbreviated as ) with respect to the lengthscale, respectively. We get for the lower bound of the marginal likelihood as well as its derivatives exactly the same value as the corresponding batch counterpart in Figure 1.

For Stochastic Recursive Gradient Propagation (SRGP), in each epoch and mini-batch , we interleave the update step of the inducing points in Equation (15) with the SGD update (27) of the parameters , i.e.

 (33)

More concretely, we update after each mini-batch the parameters with (32),(27) and propagate recursively the posterior with (18) and its derivative (30). In order to compute all the derivatives with respect to , we can exploit several matrix derivative rules which simplifies the computation significantly. A detailed algorithm is provided in the Appendix C.

In the following, we assume that the batch size is larger than the number of inducing points . For one mini-batch, the time complexity to update the posterior is dominated by matrix multiplications of size and , thus . In order to propagate the gradients of the posterior and to compute the derivative of the bound needs for a mini-batch and a parameter . Thus, updating a mini-batch including all parameters costs for the SRGP method. Since SRGP stores the gradients of the posterior, it requires storage.
On the other hand, SVGP needs storage and time per mini-batch, where the latter can be broken down into once and for each of the parameters. This means, for moderate dimensions, our algorithm has the same time complexity as state of the art method SVGP. However, due to the analytic updates of the posterior we achieve an higher accuracy and less epochs are needed which can be confirmed in Figure 5 and in Section 5 empirically.

Figure 5 shows the convergence of SRGP on a 1-D toy example with data samples and inducing points. The parameters are sequentially optimized with our recursive approach (blue) and as comparison with SVGP (green) with a mini-batch size of over several epochs. The root-mean-squared-error (RMSE) computed on test points, the bound of the log-marginal likelihood (LML) as well as the hyperparameters converge in a few iterations to the corresponding batch values of VFE (red). Due to the analytic updates of the posterior, the accuracy is higher and SRGP needs much less epochs until convergence.