# Consistent Online Gaussian Process Regression Without the Sample Complexity Bottleneck

Gaussian processes provide a framework for nonlinear nonparametric Bayesian inference widely applicable across science and engineering. Unfortunately, their computational burden scales cubically with the training sample size, which in the case that samples arrive in perpetuity, approaches infinity. This issue necessitates approximations for use with streaming data, which to date mostly lack convergence guarantees. Thus, we develop the first online Gaussian process approximation that preserves convergence to the population posterior, i.e., asymptotic posterior consistency, while ameliorating its intractable complexity growth with the sample size. We propose an online compression scheme that, following each a posteriori update, fixes an error neighborhood with respect to the Hellinger metric centered at the current posterior, and greedily tosses out past kernel dictionary elements until its boundary is hit. We call the resulting method Parsimonious Online Gaussian Processes (POG). For diminishing error radius, exact asymptotic consistency is preserved (Theorem 1(i)) at the cost of unbounded memory in the limit. On the other hand, for constant error radius, POG converges to a neighborhood of the population posterior (Theorem 1(ii))but with finite memory at-worst determined by the metric entropy of the feature space (Theorem 2). Experimental results are presented on several nonlinear regression problems which illuminates the merits of this approach as compared with alternatives that fix the subspace dimension defining the history of past points.

## Authors

• 24 publications
• 2 publications
• 22 publications
• ### Bayesian Fixed-domain Asymptotics: Bernstein-von Mises Theorem for Covariance Parameters in a Gaussian Process Model

Gaussian process models typically contain finite dimensional parameters ...
10/05/2020 ∙ by Cheng Li, et al. ∙ 0

• ### Kernel Interpolation for Scalable Online Gaussian Processes

Gaussian processes (GPs) provide a gold standard for performance in onli...
03/02/2021 ∙ by Samuel Stanton, et al. ∙ 0

• ### Non-asymptotic approximations of neural networks by Gaussian processes

We study the extent to which wide neural networks may be approximated by...
02/17/2021 ∙ by Ronen Eldan, et al. ∙ 0

• ### Uniform Error and Posterior Variance Bounds for Gaussian Process Regression with Application to Safe Control

In application areas where data generation is expensive, Gaussian proces...
01/13/2021 ∙ by Armin Lederer, et al. ∙ 0

• ### Splitting Gaussian Process Regression for Streaming Data

Gaussian processes offer a flexible kernel method for regression. While ...
10/06/2020 ∙ by Nick Terry, et al. ∙ 0

• ### Gaussian Processes with Input Location Error and Applications to the Composite Parts Assembly Process

In this paper, we investigate Gaussian process regression with input loc...
02/04/2020 ∙ by Wenjia Wang, et al. ∙ 0

• ### Lazily Adapted Constant Kinky Inference for Nonparametric Regression and Model-Reference Adaptive Control

Techniques known as Nonlinear Set Membership prediction, Lipschitz Inter...
12/31/2016 ∙ by Jan-Peter Calliess, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Gaussian process regression (Rasmussen, 2004), (kriging (Krige, 1951)) is a framework for nonlinear nonparametric Bayesian inference widely used in chemical processing (Kocijan, 2016), robotics (Deisenroth et al., 2015)

, and machine learning

(Rasmussen, 2004), among other applications. Unfortunately, in the batch setting, its complexity scales cubically with the training sample size . Moreover, in the online/stochastic setting where the total number of training samples may be infinite or samples continually arrive in a streaming fashion, this complexity tends to infinity. Thus, popular perception is that GPs cannot work online. In this work, we upend this perception by developing a method that approximately preserves the distributional properties of GPs in the online setting while yielding a worst-case complexity that is sample size independent, and is instead determined by the metric entropy of the feature space (Zhou, 2002).

Consider the batch setting . The complexity bottleneck comes from the posterior mean and covariance’s dependence on a data matrix, or kernel dictionary, that accumulates all past observations. To address this memory explosion, one must choose a subset of possible model points from the training set and “project the posterior distribution” onto the “subspace” defined by these samples. Various criteria for projection have been proposed based on information gain (Seeger et al., 2003), greedy compression (Smola and Bartlett, 2001), Nyström sampling (Williams and Seeger, 2001), probabilistic criteria (Solin and Särkkä, 2014; McIntire et al., 2016; Bauer et al., 2016), and many others (Bui et al., 2017). Unfortunately, when , all of these methods fail to attain the Bayesian notion of optimality known as asymptotic posterior consistency (Barron et al., 1999; Ghosal et al., 2000). Posterior consistency means that the empirical distribution tends to its population counterpart, and hence the Gaussian mean and covariance tend to their Bayesian ground truth, as the sample size tends to infinity.

Now, let’s shift focus to GPs in the online setting (Roberts et al., 2013)

. When written in a sequential manner, the posterior mean and covariance updates are parametrized by a kernel dictionary that grows by one every time a new sample arrives. Clearly, this is untenable for situations where samples arrive ad infinitum. The question, then, is how to selectively retain a subset of past training examples to ensure the posterior distribution nearly locks onto the population probability measure, while bringing the memory under control.

In the case of supervised learning (frequentist regression/classification), one may break the curse of kernelization

(Koppel et al., 2019, 2018)

by seeking a Lyapunov function of the sequential estimates, and projecting the function iterates in such a way to throw away as much information as possible while preserving the Lyapunov function’s per-step evolution. In supervised learning, this may be achieved by ensuring the projection satisfies a stochastic descent property

(Bottou, 1998; Nemirovski et al., 2009). By contrast, in Bayesian a posteriori estimation, the Lyapunov function is the distance between the current empirical distribution and its population counterpart, which contracts to null as the number of samples becomes large (Ghosal et al., 2000).

More specifically, posterior contraction means the empirical posterior probability measure based upon

samples and its population counterpart become closer according to some metric. The Hellinger metric may computed in closed form for multivariate Gaussians (Abou-Moustafa and Ferrie, 2012), and many rate results can be specified with respect to this metric (Choi and Schervish, 2007; van der Vaart and van Zanten, 2008; van der Vaart et al., 2009; Kruijer et al., 2010; Stuart and Teckentrup, 2018). Thus, we define as a Lyapunov function of the sequential posterior estimates the Hellinger distance between the empirical posterior probability from samples and its population counterpart. Through this insight, our contributions are to:

• develop a new algorithm called Parsimonious Online Gaussian Processes (POG), which executes online compression by fixing an error neighborhood of radius in terms of the Hellinger distance around the current posterior, and greedily tossing out as many model points as possible, while staying inside this neighborhood (Section 3). This compression uses a custom destructive variant of matching pursuit (Pati et al., 1993; Vincent and Bengio, 2002) that quantifies error with respect to the Hellinger metric. Here denotes the compression budget (Koppel et al., 2019), and the greedy compression may be viewed as a hard-thresholding projection (Parikh et al., 2014) onto subspaces of posterior distributions.

• establish that POG exactly preserves asymptotic posterior consistency of the standard GP update when the compression budget approaches null as (Theorem 4.1i, Section 4).

• establish that under constant compression budget , the Bayesian posterior converges to within an -neighborhood of asymptotic posterior consistency (Theorem 4.1ii). Moreover, with this budget selection, the size of the kernel dictionary is at-worst finite, and defined by the metric entropy of the feature space (Theorem 4.2). Thus, we obtain an optimal tradeoff of approximate consistency and model parsimony.

• experimentally (Section 5) on the boston, abalone and kin-40k data sets we observe favorable tradeoffs between performance and complexity relative to fixed subspace approaches in the offline (Snelson and Ghahramani, 2006) and online setting (Csató and Opper, 2002).

Context We expand upon related approaches, as seeking tractable approximations of online GPs has been studied for years. One approach is to focus on the exact GP likelihood, and compute the inference exactly, in the spirit of John Tukey’s truism: “Far better an approximate answer to the right question…than an exact answer to the wrong question…” (Tukey, 1962). This approach, Fully Independent Training Conditional (FITC), originating in (Csató and Opper, 2002), derives an additive decomposition of the GP likelihood, and then projects the posterior parameters onto a subspace of fixed size. This approach, called Sparse Online Gaussian Processes (SOGP), has given rise to numerous variations that modify the projection criterion using, e.g., information gain (Seeger et al., 2003), the negative log-likelihood of the posterior (Smola and Bartlett, 2001; Keerthi and Chu, 2006), or regularizations of the likelihood (McIntire et al., 2016). The commonality of these approaches is that their posterior subspace dimension is fixed in advance. In the offline setting, one may additionally optimize over the retained points in the subspace, an approach dubbed pseudo-input search (Snelson and Ghahramani, 2006).

An alternative approach is to employ a variational upper-bound of the GP likelihood, e.g., the Variational Free Energy (VFE) (Titsias, 2009) or Expectation Propagation (Csató, 2002), which as illuminated by (Bui et al., 2017), can be employed in the online setting by subsuming the GP approximation into streaming variational Bayes (Broderick et al., 2013). Online variational approaches remain an active area of research, which are well-summarized in (Liu et al., 2018). Recently, performance of various approximations has been characterized in terms of approximate convergence to a variational upper-bound on the population likelihood (Reeb et al., 2018; Burt et al., 2019). Moreover, (Toth and Oberhauser, 2019) employs them for time-series (non-i.i.d.) analysis, and builds geometric intuition for the approximation subspace (Shi et al., 2019).

Predominately, the aforementioned approaches fix the subspace size, which is not inherently a drawback, except that it is challenging to know a priori how many points are required for convergence in terms of any law of large numbers

(Ghosal et al., 2000), or a variational upper-bound thereof (Reeb et al., 2018). See (Roberts et al., 2013; Bauer et al., 2016) for reviews of challenges and approaches for the online setting, or (Quinonero-Candela et al., 2007; Banerjee et al., 2012; Rasmussen, 2004)[Ch. 8] for experimentation on offline approximations.

By contrast, in this work, we do not fix the memory that defines the number of past retained points, but instead fix the compression-induced error, and allow the subspace dimension to grow/shrink with the importance of new information. Most similar to this work is (McIntire et al., 2016), which experimentally (but not theoretically) substantiates the merits of allowing the subspace dimension to continually evolve. In this work, we derive a compression rule directly motivated by laws of large numbers for GPs (Ghosal et al., 2000; van der Vaart and van Zanten, 2008), whose tradeoffs we establish both in theory and practice. In particular, the proposed compression routine provably converges to near-optimality, and retains finitely many points in its approximate posterior. See Table 1 for a summary.

## 2 Gaussian Process Regression

In GP regression (Krige, 1951; Rasmussen, 2004) there is some function

that models the relationship between random variables

and , i.e., , that we are trying to estimate upon the basis of training examples . Unlike in expected risk minimization (ERM), we do not learn this estimator by solving an optimization problem that defines its merit of fitness, but instead assume that this function follows some particular parametrized family of distributions, and then we seek to estimate those parameters.

In particular, for GPs, we place a uniform prior on the distribution of

as a Gaussian distribution, namely,

. Here denotes the multivariate Gaussian distribution in

dimensions with mean vector

and covariance . In GP regression, the covariance is constructed from a distance-like kernel function defined over the product set of the feature space. The kernel expresses some prior about how to measure distance between points, a common example of which is itself the Gaussian, with bandwidth hyper-parameter .

In standard GP regression, we further place a zero-mean Gaussian prior on the noise that corrupts to form the observation vector , i.e., where

is some variance parameter. In Section

4 we use to denote the measure associated with the Gaussian prior. We may integrate out the prior on to obtain the marginal likelihood for as

 P(y∣∣S)=N(0,KN+σ2I) (1)

Upon receiving a new data point , we make a Bayesian inference , not by defining a point estimate . Instead, we formulate the entire posterior distribution for as:

 P(yN+1∣∣S∪xN+1)=N(μN+1∣∣S,ΣN+1∣∣S) (2)

where the mean and covariance in (2) are given by

 μN+1∣∣S =kS(xN+1)[KN+σ2I]−1yN ΣN+1∣∣S =κ(xN+1,xN+1) (3) −kTS(xN+1)[KN+σ2I]−1kS(xN+1)+σ2

The expressions (2) are standard – see (Rasmussen, 2004)[Chapter 2]. Here denotes the empirical kernel map. While this approach to sequential Bayesian inference provides a powerful framework for fitting a mean and covariance envelope around observed data, it requires for each the computation of and which crucially depend on computing the inverse of the kernel matrix every time a new data point arrives. It is well-known that matrix inversion has cubic complexity in the variable dimension , which may be reduced through use of Cholesky factorization (Foster et al., 2009) or subspace projections (Banerjee et al., 2012) combined with various compression criteria such as information gain (Seeger et al., 2003), mean square error (Smola and Bartlett, 2001), integral approximation for Nyström sampling (Williams and Seeger, 2001), probabilistic criteria (McIntire et al., 2016; Bauer et al., 2016), and many others (Bui et al., 2017).

However, even with this complexity reduction, if one tries to run GP regression in true streaming applications where the sample size is not necessarily finite , any computational savings is eventually rendered useless unless one ensures the complexity remains independent of the sample size. Thus, our objective is to find an approximate GP whose memory is sample complexity independent, yet is as close as possible in distribution to the fully infinite (as ) dense GP. Next, we shift focus to developing such a method.

## 3 Online Gaussian Processes

In this section, we derive our new algorithm Parsimonious Online Gaussian Process (POG) which is nothing more than a rewriting of the posterior update (2) in time-series manner, and constructing an online sparsification rule that ensures its complexity remains under control, while also nearly preserving posterior consistency. Define the time series of observations as , and then rewrite (2) in a time-varying way in terms of as

 μt+1∣∣St =kSt(xt+1)[Kt+σ2I]−1yt Σt+1∣∣St =κ(xt+1,xt+1) (4) −kTSt(xt+1)[Kt+σ2I]−1kSt(xt+1)+σ2.

Observe that this update causes the kernel dictionary to grow by one at each iteration i.e. , and that the posterior estimates at time use all past observations in the kernel dictionary . Subsequently, we refer to the number of columns in the dictionary matrix as the model order . The GPs posterior estimates at time have model order . For future reference, denote the posterior distributions of and defined by (3), as and , respectively.

### 3.1 Compressing the Posterior Distributions

Our memory-reduction technique relies on the following conceptual link between sparsity in stochastic programming and Bayesian inference: a typical Lyapunov (potential) function of a online supervised learning algorithm is the mean sub-optimality, which quantifies the “energy” in an optimization algorithm. If the method has been appropriately designed, the Lyapunov function is negative semi-definite, and thus flows to the minimum energy state, yielding convergence to optimality (Khalil, 1996). When sparsity considerations are additionally present, the role of a proximal (Atchade et al., 2014)/hard-thresholding (Nguyen et al., 2017) operator in tandem with the descent direction must be analyzed to establish decrement in expectation.

Motivated by the convergence of stochastic gradient hard-thresholding algorithms and their use in sparse nonparametric function estimation (Koppel et al., 2019), we seek a Lyapunov function of the sequential Bayesian MAP update for GPs, and develop a custom hard-thresholding projection that preserves its per-step behavior. In Bayesian inference, there is no optimization iterate or gradient, but instead only posterior estimates and their associated distributions. Thus, to quantify convergence, we propose measuring how far the empirical posterior is from its population ground truth according to some metric, which actually quantifies distance from posterior consistency (Barron et al., 1999).

We fix our choice of metric as the Hellinger distance due to the fact that it is easily computable for two multivariate Gaussians (Abou-Moustafa and Ferrie, 2012). For two continuous distributions and over feature space , the Hellinger distance is defined as

 dH(ν,λ):=12∫x∈X(√ν(x)−√λ(x))2dx, (5)

which, when both distributions are normal and (Abou-Moustafa and Ferrie, 2012), takes the form

 dH(ν,λ) (6)

where . The distribution defined by the Bayesian updates (2) converges to the underlying population distribution in Hellinger distance with probability 1 with respect to the population posterior distribution (see (Choi and Schervish, 2007)[Theorem 6], (van der Vaart and van Zanten, 2008)[Theorem 3.3], (Kruijer et al., 2010)[Theorem 2]). These posterior contraction results motivate the subsequence design of the compression sub-routine. If we select some other kernel dictionary rather than for some model order , the only difference is that the kernel matrix in (3) and the empirical kernel map are substituted by and , respectively, where the entries of , and . Let’s rewrite (3) for sample with as the kernel dictionary rather than as

 μt+1∣∣D =kD(xt+1)[KD,D+σ2I]−1yt Σt+1∣∣D =κ(xt+1,xt+1) (7) −kTD(xt+1)[KD,D+σ2I]−1kD(xt+1)+σ2.

The question, then, is how to select a sequence of dictionaries whose columns comprise a subset of those of in such a way to preserve asymptotic posterior consistency.

Suppose we have a dictionary at time and observe point . We compute its associated posterior distribution , where the expressions for the mean and covariance can be obtained by substituting into (3), assuming that has already been chosen.

We propose compressing dictionary of model order to obtain a dictionary of smaller model complexity by executing the update in (3.1), fixing an error neighborhood centered at in the Hellinger metric. Then, we prune dictionary elements greedily with respect to the Hellinger metric until we hit the boundary of this error neighborhood. This is a destructive variant of matching pursuit (Pati et al., 1993; Vincent and Bengio, 2002) that has been customized to operate with the Hellinger distance, and is motivated by the fact that we can tune its stopping criterion to assure that the intrinsic distributional properties of the Bayesian update are almost unchanged. We call this routine Destructive Hellinger Matching Pursuit (DHMP) with budget parameter .

DHMP with compression budget , summarized in Algorithm 2, operates by taking as input a kernel dictionary and associated posterior mean and covariances estimates and initializing its approximation as the input. Then, it sequentially and greedily removes kernel dictionary elements according to their ability to that cause the least error in Hellinger distance. Then, it terminates when the resulting distribution hits the boundary of an error neighborhood in Hellinger distance from its input. This procedure with input , and parameter , is summarized in Algorithm 2.

The full algorithm, Parsimonious Online Gaussian Processes (POG), is summarized as Algorithm 1. It is the standard sequential Bayesian MAP updates of GP regression (3) with current dictionary operating in tandem with DHMP (Algorithm 2), i.e.,

 (μ~Dt+1, Σ~Dt+1,~Dt+1) =DHMP(μt+1∣∣Dt,Σt+1∣∣Dt,~Dt+1,ϵt). (8)

After compression completes, the latest sample is appended to the dictionary to form the prior for the next iteration as . We denote the Gaussian distribution with mean and covariance at time as .

The compression budget may be chosen to carefully trade off closeness to the population posterior distribution and model parsimony. This tradeoff is the subject of the subsequent section.

## 4 Balancing Consistency and Parsimony

The foundation of our technical results are the well-developed history of convergence of the empirical posterior (3) to the true population posterior (Barron et al., 1999). Various posterior contraction rates are available, but they depend on the choice of prior, the underlying smoothness of the generative process , the choice of metric (van der Vaart and van Zanten, 2008; van der Vaart et al., 2009; Kruijer et al., 2010), the sampling coverage and radius of the feature space (Stuart and Teckentrup, 2018), among other technicalities. To avoid descending down the rabbit hole of measure theory and functional analysis, we opt for the simplest technical setting we could find for asymptotic posterior consistency of (3), which is stated next.

###### Lemma 1

(Choi and Schervish, 2007)[Theorem 6(2) and Sec. 6, Ex. 2] Assume that is the true response function, is the true noise variance that corrupts observations , and is the associated true population posterior. Assume the following conditions:

1. Suppose training examples are sampled from , and , where

is the true joint distribution of

.

2. Denote as the Lebesgue measure on the hypercube. There exists a constant such that whenever , contains at least one sample .

3. For , kernel matrix is positive definite .

4. The kernel is of the form for some strictly positive .

5. For each , there exist positive constants and , such that the covariance kernel parameter satisfies .

Then, for every , the posterior defined by (3) is asymptotically consistent, i.e.,

 PΠ{dH(ρt,ρ0)<γ∣∣St}→1 a.s. with respect to P0 (9)

where the probability computed in (9) is the Gaussian prior density as in Sec. 2.

This result forms the foundation of our Lyapunov-style analysis of our Bayesian learning algorithm (Algorithm 1). To establish our main result, we additionally require the following assumption, which says that the per step change in the posterior likelihood is unchanged by compression.

###### Assumption 1

Define the events and for any constant . The single step likelihood change with respect to the Gaussian prior (Sec. 2, paragraph 2) of the uncompressed posterior is at least as likely as the uncompressed single step likelihood change based upon sample point is the same, i.e.,

 PΠ{ηt}≥PΠ{~ηt}.

where denotes the prior Gaussian likelihood in Sec. 2.

Assumption 1 is reasonable because the uncompressed and compressed updates observe the same sample at time and formulate conditional Gaussian likelihoods based upon them. While they are conditioned on different dictionaries and , the likelihood of the fully dense posterior is at least as likely as the sparse GP. In the analysis, Assumption 1 plays the role of a Bayesian analogue of nonexpansiveness of projection operators. Under this assumption and the conditions of the aforementioned lemma, we can establish almost sure convergence under both diminishing and constant compression budget selections as stated next.

###### Theorem 4.1

Under the same conditions as Lemma 1, Algorithm 1 attains the following posterior consistency results almost surely:

1. for decreasing compression budget , for any , as , we have w.r.t. population posterior .

2. for fixed budget , -approximate convergence with respect to the Hellinger metric is attained, i.e., for any , as , w.r.t. the population posterior .

###### Proof

See Appendix B.

Theorem 4.1 establishes a formal tradeoff between the choice of compression budget and the accuracy with which we are able to lock onto the population posterior distribution. Specifically, for attenuating compression budget, the algorithm retains more and more sample points as time progresses, such that in the limit it exactly locks onto the fully infinite dimensional population posterior. On the other hand, for constant compression budget, we can converge to a neighborhood of the population posterior, but for this selection we can additionally guarantee that the complexity of the distribution’s parameterization never grows out of control. This finite memory property is formalized in the following theorem.

###### Theorem 4.2

Suppose Algorithm 1 is run with constant compression budget . Then the model order of the posterior distributions remains finite for all , and subsequently the limiting distribution has finite model complexity . Moreover, for all . More specifically, we have the following relationship between the model complexity, the compression budget, and the parameter dimension :

 Mt≤O(1ϵ)p for all t
###### Proof

See Appendix C.

Theorem 4.2 establishes a unique result for approximate GPs, namely, that the attainment of approximate posterior consistency comes with the salient feature that the posterior admits a parsimonious representation. In the worst case, the model complexity depends on the metric entropy of the feature space , rather than growing unbounded with the iteration index . The combination of Theorems 4.1ii and 4.2 establish a rate distortion theorem for GPs over compact features spaces. In the subsequent section we empirically validate the aforementioned theoretical results on real data sets.

## 5 Experiments

In this section, we compare the performance of POG with standard online and offline approximations of GP: Sparse Online Gaussian Processes (SOGP)(Csató and Opper, 2002), which fixes the subspace dimension a priori, and an offline fixed-dimension approach which additionally does a gradient-based search over the space of training examples dubbed pseudo-input search (Pseudo-Input Gaussian Processes, or SPGP) (Snelson and Ghahramani, 2006). Along with the above three algorithms, we also consider the performance of the Dense GP, which stores all training samples and does no compression. For our experiments, the Dense GP is implemented in an online fashion for the purpose of visualization, but in practice this is impossible as when the size of training set becomes large. Thus, we report the limiting test error as an offline approach in Table 2.

To ensure competitiveness of each approach, we conduct hyperparameter optimization at the outset over a number of different bandwidth selections using MATLAB’s inbuilt function (”fmincon”) for POG and Dense GP, whereas for SOGP and SPGP we employ their in-built hyper-parameter selection schemes. We note that since the input dimension defines a covariance component for each element of the kernel, rather than use a single kernel bandwidth, we tune a diagonal of matrix of bandwidth parameters often referred to in the literature as “automatic relevance determination.” These parameters are tuned over a randomly selected small subset of the training samples. An analogous procedure was used to tune the noise prior in experiments.

For our experiments, we consider the Gaussian kernel with varied length-scale, i.e.,
with as hyperparameters, where the th superscript in the variable denotes the th component of the vector.

POG being an online algorithm, for every training instant , we run the Algorithm 1 and use the dictionary obtained from the compression Algorithm 2 to evaluate the following measures: the standardized mean squared error (SMSE) and the mean standardized log loss (MSLL), or on the test data set using (3.1), i.e.,

 μtesti∣∣Dt =kDt(xtesti)[KDt,Dt+σ2I]−1yt (10) Σtesti∣∣Dt =κ(xtesti,xtesti) −kTDt(xtesti)[KDt,Dt+σ2I]−1kDt(xtesti)+σ2,

where denotes the index of test data samples and is the th sample from the test data set. The index varies from , where is the total number of samples in the test data set. The specific way SMSE and MSLL are calculated on the test set for every training index is given below:

 SMSE =1NtestNtest∑i=1(ytesti−μtesti∣∣Dt)2var[y] (11) MSLL

where is the variance of the training data and is the actual test value. Observe that MSLL is just the negative log likelihood with a constant of

omitted, since this term does not reflect the accuracy of mean and covariance estimates. Thus, smaller values of MSLL are better in the sense of maximum a posteriori estimation. We next use SMSE and MSLL for the performance evaluation of the algorithms on the Boston

(Harrison Jr and Rubinfeld, 1978), kin40k data (Seeger et al., 2003), and Abalone (Nash et al., 1994) data sets, to faithfully compare against benchmarks that appeared early in this literature, i.e., Seeger et al. (2003).

For fair comparison of POG with SOGP and SPGP, we have kept the model order (total number of dictionary points) of all the three algorithms to be same. However, this restriction is infeasible for the Dense GP whose model order grows by one with every training sample. The performance of the Dense GP stands as a performance benchmark. The exact value of the model order differs for each data set and is reported in Table 2, where the performance in terms of (11) is also presented. The values for SMSE and MSLL are obtained by computing their average over the test data set for the last samples of training set. We categorize approaches as offline and online based upon the tractability of computing the GP posterior on the fly.

### 5.1 Boston Housing data

In this section we study the performance of POG along with three other algorithms mentioned above on a real data set obtained from housing information in the area of Boston111https://www.cs.toronto.edu/ delve/data/boston/bostonDetail.html. There are a total of training samples and test samples with an input dimension of size . The compression budget, was fixed at .

For the fair comparison, we have considered the constant model order of SOGP and SPGP to be equal to the final settled model order of POG, i.e., (can be observed from Fig.0(d)). In Fig. 0(c) and 0(d), we demonstrate the evolution of POG in terms of the Hellinger distance and model order. In Fig. 0(d), we present the evolution of model order of all approaches, from which we may observe that the Dense GP retains all past points into its posterior representation, and hence its complexity grows unbounded. By contrast, the model complexity of POG settles down to . The fixed dimension approach of the SOGP and SPGP algorithm can also be verified from Fig. 0(d).

We compare the performance of POG, SOGP, Dense GP and SPGP in Figs 0(a) and 0(b). Observe that for both the test error metrics (SMSE and MSLL) (11), POG outperforms SOGP and gives comparable performance to Dense GP. However, Dense GP achieves better performance at the cost of growing dictionary size which is evident from Fig. 0(d), i.e., in comparison to the model order of of POG. Note that SPGP achieves the best performance, but this performance gain is attained by conducting hyper-parameter (pseudo-input) search over training examples themselves, which is possible by virtue of offline processing. In contrast, POG, SOGP and the Dense GP operate online. This reasoning is why we clarify that SPGP is labeled in the legend as an “Offline benchmark”. Amongst the online approaches, POG attains a favorable tradeoff of complexity and model fitness.

### 5.2 Abalone data

We shift focus to studying the performance of POG algorithm on a larger data set abalone (Nash et al., 1994) comprised of training samples and test samples relative to the aforementioned comparators. The data set focuses on the task of predicting physical properties of abalone, a type of shellfish. The objective is to determine the age of abalone by predicting the number of rings from the dimensional inputs vector. The input vector consists of different attributes like sex, length, height, weight and other weights. The hyperparameter optimisation and noise prior is obtained exactly in the same way as explained before in Sec. 5. The compression budget, is fixed at .

Similar to the Boston data set, here also we have kept the model order of the SOGP and SPGP constant at (the final settled model order of POG) for fair comparison of both the algorithms. From Fig. 1(a) and Fig. 1(b), it can be observed that POG performs better than SOGP and yields performance close to Dense GP and SPGP for both the test error metrics with a model order growth directly determined by the statistical significance of training points across time (Fig. 1(d)). The evolution of Hellinger distance is plotted in Fig. 1(c). From Fig.1(d), one may observe that although 2983 training samples have been processed, POG determines to retain only dictionary points, in contrast to the full dictionary of 2983 points in the “Dense GP”. Thus, overall one may conclude that POG attains comparable model fitness to the Dense GP and SPGP while being amenable to online operation, in contrast to SOPG.

### 5.3 kin40k data

Next we study the performance of POG on the kin-40k data set. This is a data set belonging to the kin family of datasets from the DELVE archive – see (Seeger et al., 2003). The data set is generated from the realistic simulation of the forward dynamics of an 8 link all-revolute robot arm. We consider a small subset of 4000 training samples and 200 test samples from the original data set for our experiment. The goal is to estimate the distance of the robot arm head from a target, based on the 8 dimensional input consisting of joint positions and twist angles222http://www.cs.toronto.edu/ delve/data/kin/desc.html. The kin-40k data set is generated with maximum non-linearity and little noise, there by resulting in a highly difficult regression task333http://ida.first.fraunhofer.de/∼anton/data.html(Chen and Wang, 2006).

The performance of the POG algorithm on the kin-40k data set is plotted in Fig. 3. The compression budget, was fixed at . Observe that in Fig. 2(a) and Fig. 2(b) POG performs better than SOGP and gives comparable performance to SPGP and Dense GP if not better. The model order for SPGP and SOGP is fixed at , the complexity discerned by POG in Fig. 2(d). This is in contrast to the linear increase of the Dense GP. We further visualize the distributional evolution of POG in Fig. 2(c).

In contrast to the better performance of SPGP in comparison to the Dense GP algorithm for the previous two data sets, i.e., Boston and Abalone, here we see the reverse. This may be an artifact of the fact that we considered a subset of data (4000 training samples from 10000 training samples, 200 test samples from 30000 test samples) from the actual kin40k data set for experimentation. Thus, in the case when one operates with a data subset, SPGP is not able to optimize pseudo-inputs to be representative of the subset, whereas Dense GP which stores all the points in the dictionary, and thus performs favorably. This phenomenon inverts the ranking observed in the Boston and Abalone data sets, and suggests pseudo-input search of SPGP must happen together with selecting the appropriate number of points, which in practice is challenging.

## 6 Conclusion

We presented a new approach to sparse approximation to GPs under streaming settings. In particular the computational complexity of the mean and covariance are proportional to time, which renders standard GPs inoperable in online settings.

To ameliorate this issue, we proposed a sparsification routine which operates on a faster time-scale than Bayesian inference, and approximates the posterior distribution. We consider Hellinger distance between the current posterior and its population counterpart for studying the theoretical guarantees of our sparse GP algorithm. Since Hellinger distance is computable in closed form for Gaussians, thus making it a suitable choice among valid Lyapunov functions considered for study of asymptotic behavior of GPs. By introducing hard-thresholding projections based on matching pursuit, we were able to design sparsification rules that nearly preserve the theoretical statistical behavior of GPs while bringing their memory usage under control.

The performance of the algorithm was tested on three highly non-linear real data sets and also compared against benchmark alternatives. The experimentation illuminates the benefits of not fixing the subspace dimension a priori, but instead the statistical error, under practical settings, and supports our theoretical findings. Future directions include the merits and drawbacks of variable-subspace approaches to variational approximations to the GP likelihood, as well as a better conceptual understanding of methods based on pseudo-input search.

## Appendix A Technical Lemmas

###### Lemma 2

Under the same conditions as Lemma 1, the probability of event with respect to approaches as for any .

###### Proof

Let’s analyze the probability of event with respect to by first looking at the argument of the event, , and applying the triangle inequality

 dH(ρt,ρt−1)≤dH(ρt,ρ0)+dH(ρt−1,ρ0) (12)

From posterior contraction rates Ghosal et al. (2000)[Theorem 2.5] we have in probability (w.r.t ), which we may substitute into the right-hand side of (12) to obtain

 dH(ρt,ρt−1)≤2 dH(ρt−1,ρ0) (13)

We can now write the following relationship for event for any as

 {dH(ρt,ρt−1)<γ}⊂{2 dH(ρt−1,ρ0)<γ}. (14)

Now we compute the prior probability conditioned on

,

 PΠ{dH(ρt,ρt−1)<γ∣∣St}≤PΠ{dH(ρt−1,ρ0)<γ2∣∣St}. (15)

By shifting the time-indices of the term on the right-hand side of (15), we compute the prior probability conditioned on using Lemma 1 [cf.(9)] and can be written as

 limt→∞PΠ{dH(ρt,ρ0)<γ2∣∣St}=1 (16)

Thus using (16), we may conclude that the left-hand side of (15) then has limit less than or equal to 1, and hence its lim sup satisfies

 limsupt→∞PΠ(dH(ρt,ρt−1)<γ)=1 (17)

However, since GP posterior and the Hellinger metric are continuous, the preceding limit of exists, and hence is unique. Therefore, we may conclude the prior probability of the left-hand side of (14) also converges to one, yielding Lemma 2.

## Appendix B Proof of Theorem 4.1

Proof of Theorem 4.1i: We relate the per-step behavior of Algorithm 1 to the sample path of . First, apply the triangle inequality again to obtain that this distance decomposes into two terms:

 dH(ρDt,ρDt−1)≤dH(ρDt,ρ~Dt)+dH(ρ~Dt,ρDt−1) (18)

The first term on the right-hand side of (18) is exactly the DHMP stopping criterion, and thus is no more than . Therefore, we have the following containment relationship for events:

 {dH(ρDt,ρDt−1)<α} ⊂{dH(ρDt,ρ~Dt) +dH(ρ~Dt,ρDt−1)<α} ⊂{dH(ρ~Dt, ρDt−1)+ϵt<α} (19)

Let’s compute the prior probability of (B) for any as

 PΠ{ dH(ρDt,ρDt−1)<α∣∣St} ≤PΠ{dH(ρDt,ρ~Dt)+dH(ρ~Dt,ρDt−1)<α∣∣St} ≤PΠ{dH(ρ~Dt,ρDt−1)+ϵt<α∣∣St} (20)

where in the last expression we have applied the DHMP stopping criterion. Now, subtract the constant from both sides inside the event on the right-hand side of (B) to define the event on the right-hand side of (B) as . Subsequently, define .

The event sequences (defined in Lemma 2) and quantify the effect of the same Bayesian MAP updates across time, but using posterior distributions parameterized by different kernel dictionaries, namely, [cf. (3)] versus [cf. (3.1)]. To the event we can apply Assumption 1 to write

 PΠ{dH(ρ~Dt,ρDt−1)<α−ϵt∣∣St} =PΠ{~ηt∣∣St} ≤PΠ{ηt∣∣St} (21)

Now, suppose as so that . Thus by using Lemma 2 on the right-hand side of (B) we conclude that

 limsupt→∞PΠ(dH(ρDt,ρDt−1)<α−ϵ∣∣St)=1 (22)

Again, we use continuity of the GP posterior and the Hellinger metric to conclude the preceding limit exists, and therefore

 limt→∞PΠ(dH(ρDt,ρDt−1)<α−ϵ∣∣St)=1 (23)

Therefore, for choice of compression budget , we have
for any . Substitute in the definition of to obtain Theorem 4.1i.

Proof of Theorem 4.1ii: We again relate the asymptotic probabilistic behavior of defined by Algorithm 1 to the true uncompressed sequence where is given in (3). Begin with the expression (B), followed by subtracting from both sides to obtain:

 PΠ{ dH(ρDt,ρDt−1)<α∣∣St} ≤PΠ{dH(ρ~Dt,ρDt−1)<α−ϵ∣∣St} (24)

The right-hand side of (B) is , to which we may apply Assumption 1 which states that , provided , to obtain

 PΠ{ dH(ρ~Dt,ρDt−1)<α−ϵ∣∣St} (25) ≤PΠ{dH(ρt,ρt−1)<α−ϵ∣∣St}

By Lemma 2, the supremum of the probability of the right-hand side of (25) approaches as for . Thus now the left-hand side of (B) can be written as:

 limsupt→∞PΠ{ dH(ρDt,ρDt−1)<α∣∣St}=1 (26)

Again, we exploit the continuity of the GP posterior and the Hellinger metric to conclude the preceding limit exists. Theorem 4.1ii follows from substituting in