1 Introduction
Regression methods based on Gaussian processes
(GPs) are used in many machine learning applications due to their modelling flexibility, robustness to overfitting and availability of wellcalibrated predictive uncertainty estimates. The areas of applications range from social and natural science through engineering. In the area of control engineering, GPs have been used in system identification for impulse response estimation
[24, 7, 25, 23], nonlinear ARX models [17, 3], learning of ODEs [1, 19] and latent force modelling [37]. The problem of learning the state space of a nonlinear dynamical system using GPs is discussed in [9, 21, 33]. Besides the many benefits, GPs have one main drawback: they are not suitable for large data sets due to their memory requirements and computational cost, where is the number of training samples.Over the past decade, many different approximations were developed to reduce this cost. The most common and successful methodology to approximate GPs is based on socalled inducing point methods, where the unknown function is represented by its values at a set of pseudoinputs (called inducing points), with . Early attempts to sparse GP regression were based on the Subset of Regressors (SoR/DIC) approximation due to [30, 36, 31]. However, these methods produce overconfident predictions when leaving the training data. In order to produce sensible uncertainty estimates, Deterministic Training Conditional (DTC) [8, 29], Fully Independent Training Conditional (FITC) [32], Fully Independent Conditional (FIC) [26] and Partially Independent Training Conditional (PITC) [26] were suggested where in each model the joint prior over the latent function and test values is modified differently. Titsias [34] instead proposed to retain the exact prior but to perform approximate (variational) inference, leading to the Variational Free Energy (VFE) method which converges to full GP as increases. Based on the minimization of an divergence, Bui et al. [5] presented Power Expectation Propagation (PEP), which unifies several of the previous mentioned models. Inference in such models can typically be done in time and
space. In order to find good parameters (inducing input points and kernel hyperparameters), either the marginal loglikelihood of the sparse models or a lower bound are used as an objective function for numerical optimization.
The majority of previous work in GP approximation has focused on the batch setting, that is, all data is available at once and can be processed together. However, for big data, where the number of samples can be many millions, keeping all data in memory is not possible and the data might even arrive sequentially.
In a streaming setup, Bui et al. [4] developed an algorithm to update hyperparameters in an online fashion. While this method is promising in this setting, its accuracy is limited by considering each sample only once. In this work we focus on the common setting where hyperparameters are learned by reconsidering minibatches several times. In order to speed up the optimization, we would like to update the parameters more frequently for a subset of data and update the posterior in a sequential way.
In this setting, Hensman et al. [12] applied Stochastic Variational Inference (SVI, [14]) to an uncollapsed lower bound of the marginal likelihood. Compared to the collapsed bound of the VFE model, the (variational) posterior distribution is not eliminated analytically and is an explicit part of the objective function. The resulting Stochastic Variational Gaussian Process (SVGP) method allows to optimize the parameters with minibatches of the data. One limitation of SVGP is the large number of parameters (, where is the input space dimension) to be numerically estimated since the update of the posterior distribution is not given analytically which leads to a high dimensional optimization problem.
Inspired by the work of [12] , the authors in [22] demonstrated the high scalability of this method by exploiting distributed machine learning platforms. The work of [12] was extended by [13] to the DTC, FI(T)C and PITC models for a variational anytime framework based on SVI with the corresponding uncollapsed bounds. However, it was assumed that the parameters are given in advance, that is, only the (variational) posterior distribution is learnt via SVI.
Although showing high scalability and desirable approximation, all these methods based on SVI have two main drawbacks: i) the (variational) posterior is not given analytically, which leads to additional many parameters; ii) the uncollapsed bounds are in practice often less tight than the corresponding collapsed batch bounds because the (variational) posterior is not optimally eliminated. Overall, the huge number of parameters leads to an optimization problem that is hard to tune and relies particularly on appropriately decaying learning rates. Even for fixed parameters, when no nonlinearity is involved, there is still the need of reconsidering each sample many times.
An orthogonal direction was pursued by the authors in [11] and [28], where a connection between GPs and State Space models for particular kernels was established for spatiotemporal regression problems, which allows to apply sequential algorithm such as the Kalman Filter, see also [6, 2]. Inspired by this line of research, the authors in [35] focused on efficient implementation and extended the methodology to varying sampling locations over time. Although these approaches can deal with sequential data and solve the problem of temporal time complexity, the space complexity is still cubic in the number of measurements. In addition, the hyperparameters are assumed to be fixed in advance.
In order to improve these shortcomings, we show that sparse inducing point models can be seen as a Bayesian kernelized linear regression model with input dependent observation noise with a particular choice of basis and noise covariance function, respectively. Given these insights, we show how to apply recursive estimation algorithms like the Kalman Filter (KF) and Information Filter (IF)
[15], which allows to train sparse GP methods analytically and exactly in an online or distributed setting (considering each sample only once, as opposed to the work in [12] and [13]) for given parameters. After processing all data, the obtained posterior distribution is equivalent to the corresponding batch version. This might constitute an interesting method on its own for many applications where there is prior knowledge about the parameters or they can be estimated offline.For unknown parameters, we propose a recursive collapsed
lower bound to the marginal log likelihood, which can be used for stochastic optimization with minibatches. Our approach is based on recursively exploiting the chain rule for derivatives by recursively propagating the analytical gradients of the posterior which enables to compute the derivatives of the lower bound sequentially. When computing the gradients of the recursive collapsed bound in a nonstochastic way, they exactly match the corresponding batch ones. This approach constitutes an efficient method to train a very general class of sparse GP regression models with much fewer parameters to be estimated numerically (
) than state of the art sequential GP regression methods (). Since the number of inducing points determines the quality of the approximation to full GP, this reduction in number of parameters from to is crucial and results in more accurate and faster convergence than the state of the art approach (SVGP), which we demonstrate in several experiments.In Section 2, we briefly review exact and sparse GP for regression in the batch case as well as the state of the art for sequential sparse GP learning. In Section 3, we establish the connection between sparse GP regression and recursive estimation by introducing a Bayesian kernelized linear regression model with input dependent observation noise for which we discuss the batch as well as its recursive solutions for given parameters. For unknown parameters, we propose in Section 4 a novel learning procedures  Stochastic Recursive Gradient Propagation (SRGP) ^{1}^{1}1Code is available on https://github.com/manuelIDSIA/SRGP.  based on recursively propagating the analytic gradients of the posterior, which allows to optimize the parameters with minibatches. We demonstrate in Section 5 the resulting faster convergence and the superior performance of our proposed algorithm for synthetic as well realworld data with up to a million of samples. Section 6 concludes the work and presents future research directions.
2 GP Regression
Suppose we are given a training set of pairs of inputs and noisy scalar outputs generated by adding independent Gaussian noise to a latent function , that is , where . We denote with
the vector of observations and with
.We can model with a Gaussian Process (GP), which defines a prior over functions and can be converted into a posterior over functions once we have observed some data (see e.g. [27]). To describe a GP, we only need to specify a mean and a covariance function . The latter is a positive definite kernel function [27], for instance the squared exponential (SE) kernel with individual lengthscales for each dimension, that is with . We assume that the mean function is zero for the sake of simplicity and we use the SE kernel throughout this paper even thought all methods introduced here work with any positive definite kernel. Since the joint prior on the training values and a test latent function value at an unknown test point
is multivariate Gaussian, it is straightforward to update this knowledge with observed data and thus doing inference analytically by manipulating multivariate Gaussian distributions. Combining it with the likelihood
yields the posterior predictive distribution
with(1) 
We define for any and with the corresponding rows . For brevity, we use indicating . This posterior GP depends via the kernel matrices on the hyperparameters which can be estimated by maximizing the log marginal likelihood
(2) 
Although this is an elegant approach for probabilistic regression, it is intractable for large datasets since the computations for inference require the inversion of the matrix in Eq. (1) which scales as in time and for memory (for given ).
2.1 Batch Sparse GP Regression
In order to scale GPs to large data, several approaches have been proposed based on socalled inducing point methods, where the sparsity is achieved via inducing points to optimally summarize the dependency of the whole training data. The inducing inputs are in the dimensional input data space and the inducing outputs are the corresponding GPfunction values (consider also Figure 1). The GP prior over and is augmented with the inducing outputs , leading to a joint and marginal prior, respectively. By marginalizing out the inducing points, the original prior is recovered. The fundamental approximation in all sparse GP models is that given the inducing outputs , is independent for any . Consequently, inference in these models can be done in time and space [32]. From an abstract point of view, inducing point methods differ from each other in 3 main aspects:

training: inferring inducing outputs given inducing inputs and kernel hyperparameters

prediction: performing prediction at a new test point given , and

optimization: finding optimal inducing inputs and kernel hyperparameters
In the following, we briefly report a general sparse GP model due to [5] which unifies several previous sparse inducing point GP models in the batch case. It is based on the minimization of an divergence or can be equivalently seen as applying Power Expectation Propagation (PEP). We report the sparse predictive distribution which summarizes the training and prediction part as well as the bound to the marginal loglikelihood, which can be used for optimizing the parameters . In the following, we denote and for any . In the PEP model, the predictive distribution is given by
(3) 
where . For this model, a lower bound to the sparse log marginal likelihood is analytically available
(4) 
where we omit the explicit dependency on via and for the sake of brevity. Similar as with the marginal likelihood (2) for full GP, this bound can be used to learn the parameters . This model unifies several previous sparse GP inference approaches including FITC () [32] and VFE () [34].
We want to highlight the special case , since it was the first approach where the approximating process was explicitly linked to full GP and it serves as starting point for the sequential case discussed below. Instead of modifying the prior and applying exact inference, the idea is to retain the exact prior and to perform approximate (variational) inference. In its original formulation, [34] proposed to maximize a variational lower bound to the true GP marginal likelihood, obtaining the Variational Free Energy (VFE) or the collapsed lower bound
(5) 
where the variational distribution
over the inducing points is optimally eliminated and analytically available.
The novelty of this bound is that
the rightmost term in (5)
acts as a regularizer that prevents overfitting and has the effect that the sparse GP predictive distribution (3)
converges rigorously [34] to the exact GP predictive distribution (1) for increasing number of inducing points when optimizing with (5).
A more thorough overview of several sparse inducing point GP models is given in the Appendix B and for further details we refer to [5, 18, 26, 27] for recent reviews on the subject.
2.2 Sequential Sparse GP Regression
Focusing on the VFE model, the optimization for of the collapsed lower bound (5) requires to process the whole dataset in a batch sense, which is very inefficient and not feasible for large . We would like to update the parameters more frequently, therefore, we split the data into minibatches of size and denote the corresponding sparse GP value. In order to achieve faster convergence in the optimization, Hensman et al. [12] introduced Stochastic Variational Gaussian Process (SVGP) where they applied stochastic optimization to an uncollapsed lower bound to the log marginal likelihood
(6) 
where the variational distribution is part of the bound and explicitly parametrized as . This uncollapsed bound satisfies with equality when inserting the optimal mean and covariance of the variational distribution of VFE. The key property of this bound is that it can be written as a sum of terms, which allows Stochastic Variational Inference (SVI, [14]). Unfortunately, collapsing the bound, i.e. inserting the optimal distribution, reintroduces dependencies between the observations, and eliminates the global parameter which is needed for SVI. Therefore, all variational parameters are numerically estimated by following the noisy gradients of a stochastic estimate of the lower bound . By passing through the training data a sufficient number of times, the variational distribution converges to the batch solution of VFE method. However, the disadvantage of this approach is the large number of parameters, since in addition to the parameters , all entries in the mean vector and the covariance matrix have to be estimated numerically, which is in order .
3 Recursive Sparse GP Regression
In this section we establish the connection between Bayesian recursive estimation and sparse inducing point GP models. In a first step, we explicitly report for a large class of sparse inducing point GP models their weightspace view (see [27], Ch. 2.1) which can also be seen as a particular kernelized version of a Bayesian linear regression model. In particular, we discuss the explicit choices of basis functions, the model specific observation noise, and the predictive transition distribution in a Bayesian kernelized linear regression model with additional input dependent observation noise for several sparse GP methods discussed by [27] for the batch case. Using these insights, we can exploit in a next step the recursive estimation algorithms known as the update equations of the Kalman Filter (KF) and Information Filter (IF) which allows to train many sparse methods analytically either in a online or distributed setting for given parameters. For the purpose of parameter estimation (which we will discuss in the next section), we discuss the recursive marginal loglikelihood with a model specific regularization term.
3.1 WeightSpace View of Generic Sparse GP
For any , consider the generic sparse GP model
(7) 
where the sparse GP value is modeled by a linear combination of basisfunctions , (stochastic) weights with a prior and an input dependent error term that takes into account the sparse approximation. For , the noisy observations are obtained by adding independent noise to , yielding the model
(8)  
(9) 
where we explicitly distinguish the training and test cases depending on the input and . Assuming , and are independent, by linearity and Gaussianity we can compactly write
(10)  
(11)  
(12)  
(13) 
where and denote different covariances. Combining (10) and (12) by integrating out (40) gives rise to the likelihood where . Thus, the generic sparse GP model given by the prior together with this likelihood can be seen as a Bayesian nonlinear regression model with additional input dependent observation noise with a particular choice of basis functions and covariance structure. For inducing inputs , we explicit , and . By different choices of the quantities and , a range of different sparse GP models are obtained. For fixed , these models differ from each other only by the choice of the input dependent observation noise covariance and the prediction covariance . More concretely, for instance choosing , and , the sparse models FITC, VFE and PEP (see also Appendix B), respectively, are recovered. The details of different choices for more sparse models are summarized in the bottom table in Figure 2.
3.2 Training
After the prior and the likelihood are specified in a Bayesian regression model, the posterior over the weights conditioned on the data can be computed either in a batch or in a recursive manner.
3.2.1 Batch Estimation
The batch likelihood is with and a blockdiagonal matrix with blocks . The posterior over given the data can be obtained by Bayes’ rule, i.e.
(14) 
with and where we used the standard linear Gaussian identity in (41).
3.2.2 Recursive Estimation
An equivalent solution can be obtained by propagating recursively . By interpreting this previous posterior as the prior, the updated posterior can be recursively computed by
(15) 
where
and
.
Kalman Filter like updating
The Kalman Filter [15] constitutes an efficient way to update the mean and covariance of .
Applying
the matrix inversion lemma
(38) to in Eq. (15)
and introducing temporary variables
yields
(16) 
Starting the recursion with
and
,
the posterior distribution at step is equivalent to (14) independent of the order of the data.
We want to emphasize that the only difference in the estimation part between the sparse GP models is in the additional noise .
Information Filter like updating
Using the natural parameter representation of a Gaussian , it directly follows from (15) that the posterior can be recursively computed
(17) 
with and . For any , it holds and We can also write
(18) 
where and . It is interesting to observe that and are independent of any other or .
Whether using the KF or IF for the recursive computation of the sparse posterior depends on the the number of inducing points and the size of the minibatches . If the number of inducing points is larger than the size of the minibatches, i.e. , the KF is cheaper, since only the matrix of size has to be inverted, which is inexpensive regarding to the number of inducing points . On the other hand, if (and assuming diagonal), the recursive IF approach is computationally cheaper since the inversion of the posterior precision is inexpensive regarding to the size of the minibatch.
Transformation
Instead of running a KF with
,
and
, an equivalent predictive distribution is also obtained when using
and together with
.
For any we then propagate a transformed posterior distribution
,
and
,
,
respectively.
For IF we apply an equivalent transformation, using ,
and
which results in propagating
,
.
This parametrization constitutes a computational shortcut, since the basis functions are very easy to interpret and do not include any matrix multiplication.
Note that also the log marginal likelihood discussed below is not affected by this transformation.
3.3 Prediction
Given a new , the predictive distribution after seeing of the sparse GP methods can be obtained by
(19) 
using with and the model specific prediction covariance. The predictions for are obtained by adding to the covariance of . At step , by applying (38) to the batch covariance in (14), we get for the predictive distribution in (19)
(20)  
where Inserting the particular choices for , and yields the usual formulation for the sparse predictive distribution
(21) 
Depending on the choice of the covariances and , we obtain for instance (3), (42) or (45) for PEP, VFE and FITC, respectively.
3.4 Marginal Likelihood
For model selection (or hyperparameters estimation) in the batch setting, the log marginal likelihood of a Bayesian regression model can be computed by marginalizing out , that is
(22) 
For the recursive setting, can be factorized into where the relative marginal likelihood is given by
(23) 
The log of the joint marginal likelihood involving all terms of (3.4) can be explicitly written as
(24) 
By maximizing iteratively a lower bound of the recursive factorized marginal likelihood in (3.4) leads to the recursive KF updates in (16) for the posterior and the resulting lower bound
(25) 
includes a model specific regularization term (consider the corresponding column in Figure 2).
We refer to this as the recursive collapsed bound and
a detailed derivation
for the VFE model is given in Appendix D. Using the model specific quantities and , this recursive computation of the lower bound of the marginal likelihood are equivalent to the batch counterparts for all sparse models, for instance
(5) and (4) for VFE and PEP, respectively.
Compared to the collapsed bound in (5), our recursive collapsed bound in (25) decomposes into a recursive sum over the minibatches which allows to optimize the hyperparameters sequentially. The advantage, compared to the uncollapsed bound in (6), is that the variational distribution is recursively and analytically eliminated, thus reducing the number of parameters to be numerically estimated drastically from to
.
3.5 Online and Distributed Learning
The connection between sparse GP models and recursive estimation established in the previous sections allows us to train the sparse GP models analytically either online for streaming data or in a distributed setting for fixed . In the first case, we propagate either the mean and covariance (16) or the natural mean and precision (17) (depending on and ) of the current posterior for each recursively. In the second case we compute and (18) distributed among computational nodes, and aggregate afterwards centralized the joint full posterior . At step , after seeing each sample once, we get exactly the same distribution for the inducing outputs and the same predictive distribution as the corresponding batch counterparts. In particular, exploiting the transformation from the Section 3.2.2 allows efficient online and distributed inference.
Suppose we have data samples in (only for illustration purposes, it works as well as for more dimensions) and we want to use the PEP model with with inducing points. We show on a toy example the above proposed online and distributed learning procedures for sparse GPs when is available in advance.
3.5.1 Online Setting
Assuming the data samples arrive sequentially in a stream. Thus we have and . Using the transformation, we have and and we set to . For each data sample we compute the residual
, the innovation variance
and the Kalman gain . The (transformed) posterior distribution over the inducing points can then be updated by(26) 
In order to make predictions for new data , we use and and apply (19). Note that there is no need to transform back the posterior over the inducing points, since it is already taken into account in the prediction step. After processing all samples, the predictive distribution and the cumulative bound of marginal loglikelihood correspond to the batch version, which is depicted in Figure 3.
3.5.2 Distributed Setting
Assuming the data is available in a batch, but we want to distribute the computation among computational nodes, thus we have . Using the transformation, we have , , where . Each node now computes
Afterwards, we sum up at a central node the (transformed) posterior distribution over the inducing points with and followed by and . Prediction can be done with and as explained in the online setting. This procedure is illustrated in Figure 4. Note that the samples are only sorted for illustration purposes.
4 Hyperparameters Estimation
In the previous section we discussed online and distributed procedures for analytical computation of sparse GP models for fixed where we recovered the batch solution after seeing each sample once. In this section we show how to exploit the connection to recursive estimation when optimizing in a sequential or distributed way. Note that we only estimate these parameters and not all entries in the posterior mean vector and covariance matrix compared to SVGP where no analytic updates of these quantities are available.
In particular, we explain how to use the recursive collapsed bound in (28) by using recursive gradient propagation which enables the application of stochastic optimization.
Moreover, we present an idea based on distributed gradient cumulation of the natural mean and precision with respect to the parameters which allows to alternatively distribute the computations.
Both these techniques can be applied to all sparse GP models discussed in Section 2.1.
Finding a
maximizer
for of an objective function can be achieved by applying Stochastic Gradient Descent(SGD), where the update can be written as
(27) 
where might be a sophisticated function of (for instance using ADAM [16], where also a bias correction term is included). We call one pass over the
minibatches an epoch. We denote
the estimate of in epoch for minibatch .4.1 Recursive Gradient Propagation (RGP)
We recall the recursive collapsed bound in (25)
(28) 
where and the model specific regularization term. Since decomposes into a (recursive) sum over the minibatches, we directly compute the derivative of w.r.t. . The derivative of is straightforward, for the former we can write
(29) 
with and . It is important to note that ignoring naively the dependency of through and completely forgets the past and thus results in overfitting the current minibatch. In order to compute the derivatives of and , we exploit the chain rule for derivatives and recursively propagate the gradients of the natural mean and the precision over time, that is
(30) 
For simplicity, we show here the gradient propagation of the IF, depending on the and , it might be cheaper to use the KF to propagate recursively the gradients of the computations of and in (16), that is
(31)  
where , and are computed recursively according to (16). Computing the derivatives of as explained in (29) and (30)/(31), the stochastic gradient
(32) 
can be computed for each minibatch . Suppose keeping constant, the cumulative derivative at step for the recursive collapsed bound is equal to the corresponding derivatives of the batch sparse bound, which can be verified for the toy example in Figure 3. The numbers in the bottom left and right corners show the cumulative recursive collapsed bound and its cumulative derivative (abbreviated as ) with respect to the lengthscale, respectively. We get for the lower bound of the marginal likelihood as well as its derivatives exactly the same value
as the corresponding batch counterpart in Figure 1.
For Stochastic Recursive Gradient Propagation (SRGP),
in each epoch and minibatch , we interleave the update step of the inducing points in Equation (15) with the SGD update (27) of the parameters , i.e.
(33) 
More concretely, we update after each minibatch the parameters with (32),(27) and propagate recursively the posterior with (18) and its derivative (30). In order to compute all the derivatives with respect to , we can exploit several matrix derivative rules which simplifies the computation significantly. A detailed algorithm is provided in the Appendix C.
In the following, we assume that the batch size is larger than the number of inducing points . For one minibatch, the time complexity to update the posterior is dominated by
matrix multiplications of size and , thus
.
In order to propagate the gradients of the posterior and to compute the derivative of the bound needs for a minibatch and a parameter . Thus, updating a minibatch including all parameters costs for the SRGP method.
Since SRGP stores the gradients of the posterior, it requires storage.
On the other hand, SVGP needs storage and time per minibatch, where the latter can be broken down into once and for each of the parameters.
This means, for moderate dimensions, our algorithm has the same time complexity as state of the art method SVGP. However, due to the analytic updates of the posterior we achieve an higher accuracy and less epochs are needed which can be confirmed in Figure 5 and in Section 5 empirically.
Figure 5 shows the convergence of SRGP on a 1D toy example with data samples and inducing points. The parameters are sequentially optimized with our recursive approach (blue) and as comparison with SVGP (green) with a minibatch size of over several epochs. The rootmeansquarederror (RMSE) computed on test points, the bound of the logmarginal likelihood (LML) as well as the hyperparameters converge in a few iterations to the corresponding batch values of VFE (red). Due to the analytic updates of the posterior, the accuracy is higher and SRGP needs much less epochs until convergence.