DeepAI
Log In Sign Up

Bézier Gaussian Processes for Tall and Wide Data

Modern approximations to Gaussian processes are suitable for "tall data", with a cost that scales well in the number of observations, but under-performs on “wide data”, scaling poorly in the number of input features. That is, as the number of input features grows, good predictive performance requires the number of summarising variables, and their associated cost, to grow rapidly. We introduce a kernel that allows the number of summarising variables to grow exponentially with the number of input features, but requires only linear cost in both number of observations and input features. This scaling is achieved through our introduction of the Bézier buttress, which allows approximate inference without computing matrix inverses or determinants. We show that our kernel has close similarities to some of the most used kernels in Gaussian process regression, and empirically demonstrate the kernel's ability to scale to both tall and wide datasets.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/03/2020

Faster Gaussian Processes via Deep Embeddings

Gaussian processes provide a probabilistic framework for quantifying unc...
08/01/2020

Convergence of Sparse Variational Inference in Gaussian Processes Regression

Gaussian processes are distributions over functions that are versatile a...
03/11/2020

General linear-time inference for Gaussian Processes on one dimension

Gaussian Processes (GPs) provide a powerful probabilistic framework for ...
02/15/2021

High-Dimensional Gaussian Process Inference with Derivatives

Although it is widely known that Gaussian processes can be conditioned o...
03/18/2019

Linear-Time Inference for Pairwise Comparisons with Gaussian-Process Dynamics

We present a probabilistic model of pairwise-comparison outcomes that ca...
03/21/2020

Scaling up Kernel Ridge Regression via Locality Sensitive Hashing

Random binning features, introduced in the seminal paper of Rahimi and R...
10/24/2018

Scalable Gaussian Processes on Discrete Domains

Kernel methods on discrete domains have shown great promise for many cha...

1. Background

Bézier curves and surfaces are parametrised geometric objects that have found great usage in computer-aided design and robotics (Prautzsch et al., 2002)

. The simplest Bézier curve is the linear interpolation of two points

and in ; the Bézier curve of order

(1.1)

Higher order curves are generalised in the following way: the order- Bézier curve is defined as

(1.2)

In Bézier terms, are referred to as control points. Notice an order- Bézier curve has control points. denotes the th Bernstein polynomial of order . They are defined as

(1.3)

By going from curves to surfaces we wish to extend from the scalar to a spatial input , for . Here, we can define Bézier -surfaces as

(1.4)
Figure 1. Left: Illustration of d control points (orange) and the d grid spanning the input space . We see how the function (or surface) gets impacted by the control points. Right: How the points cover the hypercube in dimensions. Here a grid in d with orders .

Figure 1 gives a visual illustration of a -dimensional surface embedded in . In the literature, it is difficult to find any studies of -surfaces for . This paper targets especially this high-dimensional input case. We restrict our output dimension to , the regression problem, but the methods naturally extend to multidimensional outputs. The red points in Figure 1 show how each control points has an associated location in the input space. They are placed on a grid-like structure, and the order of each dimension determines how fine the mesh-grid of the hypercube is; i.e. how dense the input-space is filled.

Gaussian processes

(GPs) are meticulously studied in probability and statistics

(Williams and Rasmussen, 2006)

. They provide a way to define a probability distribution over functions. This makes them useful, as priors, to build structures for quantifying uncertainty in prediction. They are defined as a probability measure over functions

, such that any collection of elements in have their associated output

following a joint Gaussian distribution. This distribution is fully determined by a mean function

and a positive semi-definite kernel function . They admit exact Bayesian inference. However, exact inference comes at a prohibitive worst-case computational cost of , where is the number of training points, due to computing inverse and determinant of the kernel matrix.

Sparse Gaussian processes (Snelson and Ghahramani, 2005) overcome this burden by conditioning on inducing points, reducing complexity to , where usually . The inducing points, denoted , are then marginalised to obtain an approximate posterior of

. The variational posterior mean and variance, at a location

are then given by

(1.5)
(1.6)

under the assumption of a constant zero prior mean function. This assumption is easily relaxed if needed. Here denotes the inducing locations in the input space, i.e. Under further assumption of Gaussian observation noise , i.e. , then Titsias (2009) showed the optimal and are known analytically. In a sought analogy to Figure 1, would be the red points and would be the orange.

2. Bézier Gaussian Processes

Inspired by Bézier surfaces, we construct a Gaussian process as

(2.1)

where are Gaussian variables and . Here, for . It is easy to verify that satisfies the definition of a GP since it, for any , is a scaled sum of Gaussians. We assume that all are fully independent. With that assumption, we can make the following observation for the mean and kernel function

(2.2)
(2.3)

The Bernstein polynomials can approximate any continuous function given the order is large enough, thus they make a good basis for GP regression (Hildebrandt and Schoenberg, 1933). Naturally, selecting a prior over comes down to selecting a prior over the random control points . The most common prior mean in GP regression, the constant zero function, is then easily obtained by for all . By construction, the choice of needs consideration to yield a convenient prior over . Mindlessly setting would make collapse to zero quickly in the central region of the domain, especially as dimensions grow. This, of course, gives a much too narrow prior over .

Figure 2

(middle) shows that in the central region the standard deviation of

is smaller due to the nature of the Bernstein polynomials. If we consider instead a two-dimensional input the standard deviation would collapse even more, as we would then see the shrinking effect for both dimension and multiply them. We can, however, adjust for this.

We define the inverse squared Bernstein adjusted prior to counter this effect. In all dimensions , let

(2.4)

and denotes the order of the dimension . Then setting ensures that over the entire domain . Eq. 2.4 solves a linear system, such that , for . This means a prior hardly distinguishable from standard stationary ones such as the RBF kernel. Visual representation of this prior is shown in Figure 2 (right). This adjustment works up to , after which negative values occur.

Figure 2. Left: The Bernstein polynomials that make up the basis of a Bézier GP of order . We observe how they each impact a ‘local’ region along the [0,1] domain. Middle: If not accounting for the Bernstein polynomials behaviour, the variance is of is more narrow in the central region. Right: By adjusting, see Eq. 2.4, we can enforce a uniform variance over the domain.

Summarising, we introduced a kernel based on Bézier surfaces. An alternative viewpoint is that is a polynomial GP, but with Bernstein basis rather than the canonical basis. We remark that is defined outside the domain ; any intuition about the prior there is not considered, and we will not pursue investigating data points outside this domain. Of course, for practical purposes this domain generalises without loss of generality to any rectangular domain . For presentation purposes we keep writing . Next, we show how we infer an approximate posterior given data.

2.1. Variational inference

Let denote the set of all . As the prior of is fully determined by the random control points , the posterior of is determined by the posterior of these. As per above, we set the prior

(2.5)

where . We utilise variational inference to approximate the posterior of the control points, and hence . This means we introduce variational control points. We assume they are fully independent (usually called the mean-field assumption), and have free parameters for the mean and variance, such that .

Assume we have observed data

. The key quantity in variational inference is the Kullback-Leibler divergence between the true posterior

and the variational approximation – which we denote . The smaller divergence, the better approximation of the true posterior. Without access to the true posterior, the quantity is not computable. However, it has been shown this divergence is equal to the slack in Jensen’s inequality used of the log-marginal likelihood: .

(2.6)
(2.7)

Knowing this, we can approximate the true posterior with by maximising Eq. (2.7). This is the evidence lower bound, and it is maximised with respect to the variational parameters and . This is fully analytical when the variational parameters and are known.

We assume our observation model is disturbed with additive Gaussian noise, which in other words means our likelihood is Gaussian

(2.8)

for each and we assume they are independent conditioned on . With these assumption the first term in Eq. (2.7) becomes

(2.9)

where and are given as Eq. (2.2) and (2.3) respectively, but with the variational parameters and used.

The second term in Eq. (2.7) enjoys the independence of control points to split into sums

(2.10)
(2.11)

When inspecting the evidence lower bound, Eq. (2.7), we see it has a data term forcing control points to fit the data, and a KL-term to make control points revert to the prior. Knowing how control points are allocated in the input domain, we expect control points in regions of no data revert to the prior. This is similar to what stationary kernels do in said regions. We verify visually in Section 4.

All together, Bézier GPs can be adjusted to have priors similar to stationary GPs, and have analogous posterior behaviour, which is favourable to many practitioners. But Bézier GPs scale. None of the terms in the evidence lower bound require matrix inversions or determinants. It is simple to mini-batch over the data points, utilising stochastic variational inference (Hoffman et al., 2013), to scale it to large . However, nearly all terms require evaluations of huge sums if the input dimension is high. The next section is aimed at this problem.

2.2. Scalability with the Bézier buttress

Until this point, we have omitted addressing the number of random control points needed for Bézier GPs. Let us denote this number . It can quickly be checked that . This implies that to evaluate we must sum over summands, which as increases, quickly becomes computationally cumbersome. increases exponentially with . It is justifiable to view the random control points as inducing points; after all, they are Gaussian variables in the output space. Thus, it would be extremely valuable to manage exponentially many of them.

To overcome this, we introduce the Bézier buttress111A buttress is an architectural structure that provides support to a building.. We assume parameters of the random control points, say , can parametrise , where . This assumption is the key of the Bézier buttress. Figure 3 provides visualisation. It visualises a source-sink graph, where each unique path from source to sink represents one unique control point with above parametrisation. The cyan highlighted path represents the , where we multiply the values along the path from source to sink. Notice last edges have value .

Figure 3. A Bézier buttress visualised. Here, the input dimension is , hence layers. Each layer has nodes, since each dimension has order . There exist paths from source (left square) to sink (right square), each path represents one unique control point. We can sum over all control points by sequential matrix multiplication from source to sink.

In the Bézier buttress there are layers, one for each input dimension, and nodes in each layer

. Borrowing from neural network terminology, a forward-pass is a sequential series of matrix multiplications which are element-wise warped with non-linearities, such as

or ReLU. If we let our sequence of matrices be , where is the matrix with entries , then fixing ’the input’ to and the last matrix to , a forward pass is

(2.12)

We see this forward pass computes a sum over all control points means. Naturally, we construct a Bézier buttress for summing over variances too, which must be restricted to positive weights. It was these sums that formed our bottleneck, in this parametrisation it is just a sequence of matrix products.

What about computing ? It comes down to a use of ‘non-linearities’ in the buttress. Multiplying element-wise the Bernstein polynomials as seen in Figure 3 (visualised only on rd layer), a forward pass computes either , or if using squared Bernstein polynomials. Each control point is then exactly multiplied by the correct polynomials from its way from source to sink. Notice, the ‘input’ is fixed to in the source, but the observed is appearing via the Bernstein polynomials along the way. We can write this too as a sequence of matrix products

(2.13)

where is a diagonal matrix with the Bernstein polynomials on its diagonal. For the variance of there exists a similar expression, of course with squared polynomials, and with the positive weights associated with the variance Bézier buttress.

On this inspection, all terms needed to compute Eq. (2.9) are available. All terms for the KL divergences are algebraic manipulations of Eq. (2.12) – these are explicit in the supplementary material. The key takeaway for Bézier buttress is that we parametrise each random control point as a product of weights. This can be seen as an amortisation

, and we do inference on these weights rather than the control points themselves. Hence, no matrix inversions are needed to backpropagate through our objective, and a forward pass is only a sequence of matrix products.

2.3. Marginalising matrix commutativity

Matrix multiplication is not commutative. This implies the ordering of the matrices in Eq. (2.12) matter, which again implies how we order the input dimensions in the Bézier buttress is of importance. This is the price for the computational benefit this parametrisation gives. An ordering of the input dimensions is somewhat unnatural for spatial regression, so we present a way to overcome this in approximate Bayesian manner. Define , where each of these individual , , are Bézier GPs with a random permutation of the ordering in the associated Bézier buttress. In other words, we let be an ensemble of Bézier GPs to (approximately) marginalise over all possible orderings. The number of all possible orderings quickly becomes too large for which to account; in practice, we set in a feasible region, say , which gives satisfactory empirical performance.

Another lens on this is that each control point is a sum of control points – each control point’s standard deviation is scaled by

, to obtain the same prior as so far discussed. Oddly, we have then circumvented the problem of too many control points by introducing even more control points. That is, each control point mean parametrise

, where denotes a random permutation of . We remind again that similar expression exist for control point variances, restricted to positive weights.

As remarked, inference comes down to a forward pass in the Bézier buttress, a sequence of matrix multiplications. Assume all dimension are of order , then the computational complexity of one forward pass is . Now we need forward passes to marginalise the ordering, and forward passes, one for each observation, leaving final complexity of . Linear in and .

3. Related Work

Variational inference in GPs was initially considered by Csató et al. (1999) and Gibbs and MacKay (2000). In recent times, the focus has shifted focus to scalable solutions to accommodate the big data era. In this respect, Titsias (2009) took a variational approach to the inducing points methods (Quinonero-Candela and Rasmussen, 2005); later Hensman et al. (2013) further enhanced scalability to allow for mini-batching and a wider class of likelihoods. Still, the need for more inducing points is of importance, especially as the number of input features grows.

The response to this has mostly revolved around exploiting some structure of the kernel. Wilson and Nickisch (2015) and Wu et al. (2021) exploit specific structure that allow for fast linear algebra methods; similar to our method inducing locations tend to lie on grid. These grids expand fast as the input dimension increases, as also pointed out earlier in the article. Kapoor et al. (2021) remedy this by instead of a rectangular grid, they consider the permutohedral lattice, such that each observation only embeds to neighbours instead of , as in (Wilson and Nickisch, 2015).

Another approach to allowing for more inducing points is incorporating nearest neighbour search in the approximation (Kim et al., 2005; Nguyen-Tuong et al., 2008). Tran et al. (2021) introduced sparse-within-sparse where they have many inducing points in memory, but each time search for the -nearest ones to any observation. They discard the remaining ones as they have little to no influence on the observation. Wu et al. (2022) made a variational pendant to this method.

Lastly, when dealing with high-dimensional inputs it is worth mentioning feature learning. That is, learning more low-dimensional features where GPs have better performance. The success of deep models has been transported to GPs by Damianou and Lawrence (2013) and later scaled to large datasets in (Salimbeni and Deisenroth, 2017). Another approach is Deep Kernel Learning (Wilson et al., 2016; Bradshaw et al., 2017)

, where feature extraction happens inside the kernel function; lately

Ober et al. (2021) has investigated the limitations and benefits of these models.

We have treated structured control points as our version of inducing points; and by parametrising them with a Bézier buttress, we limit the expansion of grids to linear growth in parameters. We are not the first to consider the Bernstein polynomials as a basis for learning functions. Petrone (1999b)

used it to model kernel estimate probability density functions, and several follow up works

(Petrone, 1999a; Petrone and Wasserman, 2002). Hug et al. (2020) recently introduced Bézier GPs, but with a focus on time series (Hug et al., 2022). Our emphasis has been on spatial input; even the Bézier surface literature contain close to nothing on more than -dimensional surfaces.

4. Evaluation

We split our evaluation into four parts. First, we visually inspect the posterior on a one dimensional toy dataset to show how the control points behave, and indicate that there indeed is a stationary-like behaviour on the domain of the hypercube. Next, we test empirically on some standard UCI Benchmark datasets, to gives insight into when Bézier GPs are applicable. After that, we switch to tall and wide data – large both in the input dimension and in number of data points. These experiments give certainty that the method delivers on its key promise: scalability. Lastly, we turn our eyes to the method itself and investigate how performance is influenced by the ordering of dimensions.

Care is needed in optimising a Bézier GP – not all parameters are born equal. We split optimisation into two phases. First, we optimise all variational parameters, keeping the likelihood variance fixed as , with being the number of control points. After this initial phase, we optimise with all variational parameters fixed. We let both phases run for 10000 iterations with a mini-batch size of 500, for all datasets. Both phases use the Adam optimiser (Kingma and Ba, 2015), the first phase with learning rate , and the second with learning rate

. If not following a such a bespoke training scheme, we see a tendency for the posterior to revert to the prior, because the KL-term becomes too dominating initially. This training scheme is designed for the Gaussian likelihood, but we wish to emphasise that, in principle, the loss function is accurate for any choice of likelihood.

4.1. One dimensional visual inspection

We hypothesised the objective function, Eq. 16, would ensure, within the hypercube domain, that reverts to its prior in regions where data is scarce. To verify this we construct a small dataset to inspect. We generate one-dimensional inputs uniformly in the regions and

; we sample 20 observation in each region. The responsive variable is generated as

. According to the hypothesis, should in the region , tend towards zero in mean, and increase its variation here. We use a BézierGP of order to model the observations, since they are highly non-linear. Figure 4 shows the posterior distribution of to the left; we observe tends towards the prior in the middle region. The middle plot illustrates the distribution, both prior and posterior, of the control points. There is a clear tendency for the central-most points to align the posterior and posterior, enforcing this behaviour in . The non-equal priors are due to the inverse-squared Bernstein adjusted prior which ensures a uniform variation in over the domain, see Figure 2. The plot to the right in Figure 4 shows the behaviour foundational to practitioners of Bayesian optimisation and active learning etc., the variance increase away from data regions.

Figure 4. Left: Posterior distribution of . Middle: Posterior and prior distribution of control points. Right: Posterior variance as function over the domain. Variance increases in scarce data regions.
power protein energy boston bike keggdirected concrete elevators
Test log-likelihood
SGPR
SimplexGP NA NA
BezierGP
Test RMSE
SGPR
SimplexGP NA NA
BezierGP
Table 1. Results on eight standard UCI Benchmark datasets. We list test-set log-likehood and RMSE, both are averages over train/test splits. On top are test log-likelihoods (higher is better), and bottom is test RMSE (lower is better). NA indicates Cholesky error.

4.2. UCI Benchmark

We evaluate on eight small to mid-size real world datasets commonly used to benchmark regression (Hernandez-Lobato and Adams, 2015). We split each dataset into train/test-split with the ratio . We do this over random splits and report test set RMSE and log-likelihood average and standard deviation over splits. We choose baselines to be SGPR, following the method from Titsias (2009); we do both for and inducing variables. SimplexGP is another baseline, they suggest their approximation is beneficial for semi-high dimensional inputs (between and ) (Kapoor et al., 2021), hence they are an obvious baseline. SimplexGP usually use a validation set to choose the final model. This is due to a highly noisy optimisation scheme using a high error-tolerance () for conjugate gradients. We remedy this by setting the error-tolerance to (), which harms scalability, but we can omit using a validation set for better comparability. Wang et al. (2019) recommend this error-tolerance, but remark it is more stable for RMSE than for log-likelihood. For BézierGP, we fix the number of permutations to , and vary the order in . The order is identical over input dimensions. The inputs are pre-processed such that the training set is contained in .

Table 1 contains the results of this experiment. We make the following observations about our presented BézierGP. On keggdirected and elevators there are test points outside the defined domain on some splits, which cause the RMSE to be extreme, but the likelihood is more forgiving. This highlights the constraint of our model: it needs a box-bounded domain to be a priori known. We could not reproduce results from Kapoor et al. (2021) on keggdirected, the optimisation was too noisy and with no use of validation. On concrete, boston and energy we see overfitting tendencies. Even though BézierGP is the optimal choice on energy there is a mismatch between train and test error. On concrete this shows in better test RMSE, than the baselines, but the variance is overfitted yielding non-optimal likelihood. We conjecture this happens because the -ratio is low; which makes it more likely to overfit the control points – especially for higher orders . Knowing these model fallacies, we observe that BézierGP outperforms on baselines on multiple datasets, most notably the -dimensional bike dataset.

4.3. Large scale regression

Figure 5. Test negative log-likelihood (lower is better) and RMSE (lower is better) for large scale regression. Colours indicate the origin of numbers: red from Salimbeni and Deisenroth (2017), purple from Wang et al. (2019), and cyan from Kapoor et al. (2021). Orange is ours. We observe our BézierGP is highly competitive on large datasets.

Figure 5 shows the results of regression tasks in regimes of high dimensions and one in high number of observations. Here, we follow exactly the experimental setup of either Salimbeni and Deisenroth (2017) or Wang et al. (2019). If the latter, we use the validation-split they use as training data. Our optimisation scheme for BézierGP is consistent with above, except for slice, where the first training phase runs for iterations. We discard test points that are not in the domain – in no situation did this remove more than of the test set. The number after DGP, denotes the number of hidden layers in a Deep GP (Salimbeni and Deisenroth, 2017), after SGPR and SVGP it denotes the number of inducing points. SVGP refers to the method from Hensman et al. (2013). After B, it denotes the order used in BézierGP.

On year, we observe our (non-deep) model is on-par with -layered Deep GPs, and closer to in RMSE. The highest dimensional dataset, slice, sees us in the low -ratio again, and we are again faced with a too flexible model. This is why we report results for orders and , rather than , since the overfitting kicks in. Even for these small orders the test log-likelihood has high variance and under-performs compared to RMSE. With respect to RMSE it is top-performer signalling again it is overfitting the variance. On the remaining two datasets BézierGP is best-performing among baselines.

4.4. Influence of number of permutations

All experiments so far used ; that is, random permutations of the ordering of dimension used in the Bézier buttress. For a problem with input dimension , there exist possible permutations. Table 2 shows results with varying ; for each dataset, the results are over the same train/test split (). We fixed . Up to some noise in the optimisation phase, we see for the two highest dimensional datasets, protein and bike, performance improves with higher . Bike has over reduction in RMSE from to .

power () protein () bike ()
RMSE LL RMSE LL RMSE LL
Table 2. Performance in test RMSE and log-likelihood against the number of permutations, , on three datasets. In parenthesis shows the number of possible permutations of input dimensions. Higher dimensional datasets show improving performance with increasing .

Table 2

emphasises the results we have presented are not optimised over hyperparameter

and . They also illustrate an interesting direction of future research: optimising these hyperparameters. We chose permutations by random sampling, but choosing them in a principled deliberate manner could yield good performance with a computationally manageable . This result indicates, at least, protein and bike would see increased performance in Table 1 from better (or just more) permutations.

5. Discussion

We introduced the Bézier Gaussian Process – a GP, with a polynomial kernel in the Bernstein basis, that scales to a large number of observations, and remains space-filling for high number of input features (limited to a box-bounded domain). We illustrated that, with slight adjustments, the prior and posterior have similar behaviour to ‘usual’ stationary kernels. We presented the Bézier buttress, a weighted graph to which we amortise the inference, rather than inferring the control points themselves. The Bézier buttress allows GP inference without any matrix inversion. Using the Bézier buttress, we inferred control points for the high-dimensional slice dataset.

We highlighted weaknesses of the proposed model: most crucially the tendency of overfitting when the -ratio is low. The results demonstrate scalability in both and , but does not solve the short, but wide problem. The paper did not optimise over the hyperparameters of the proposed kernel, namely and , but it showcased briefly that doing so might enhance BézierGPs empirically; especially smart selection of the permutations is an interesting direction for future research.

Acknowledgements

MJ is supported by the Carlsberg Foundation.

References

  • J. Bradshaw, A. G. d. G. Matthews, and Z. Ghahramani (2017) Adversarial examples, uncertainty, and transfer testing robustness in Gaussian process hybrid deep networks. arXiv preprint arXiv:1707.02476. Cited by: §3.
  • D. R. Burt, C. E. Rasmussen, and M. van der Wilk (2020) Convergence of sparse variational inference in Gaussian processes regression.

    Journal of Machine Learning Research (JMLR)

    21 (131), pp. 1–63.
    Cited by: Bézier Gaussian Processes for Tall and Wide Data.
  • L. Csató, E. Fokoué, M. Opper, B. Schottky, and O. Winther (1999) Efficient approaches to gaussian process classification. Advances in neural information processing systems 12. Cited by: §3.
  • A. Damianou and N. D. Lawrence (2013) Deep Gaussian processes. In

    Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS)

    ,
    Cited by: §3.
  • J. Gardner, G. Pleiss, R. Wu, K. Weinberger, and A. Wilson (2018) Product kernel interpolation for scalable gaussian processes. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, A. Storkey and F. Perez-Cruz (Eds.), Proceedings of Machine Learning Research, Vol. 84, pp. 1407–1416. External Links: Link Cited by: Bézier Gaussian Processes for Tall and Wide Data.
  • M. N. Gibbs and D. J. MacKay (2000)

    Variational gaussian process classifiers

    .
    IEEE Transactions on Neural Networks 11 (6), pp. 1458–1464. Cited by: §3.
  • J. Hensman, N. Fusi, and N. D. Lawrence (2013) Gaussian processes for big data. In Proceedings of the 29th Conference on Uncertainty in Artifical Intelligence (UAI), Cited by: §3, §4.3.
  • J. M. Hernandez-Lobato and R. Adams (2015) Probabilistic backpropagation for scalable learning of bayesian neural networks. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1861–1869. External Links: Link Cited by: §4.2.
  • T. Hildebrandt and I. Schoenberg (1933)

    On linear functional operations and the moment problem for a finite interval in one or several dimensions

    .
    Annals of Mathematics, pp. 317–328. Cited by: §2.
  • M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley (2013) Stochastic variational inference. Journal of Machine Learning Research. Cited by: §2.1.
  • R. Hug, S. Becker, W. Hübner, M. Arens, and J. Beyerer (2022) Bézier curve gaussian processes. arXiv preprint arXiv:2205.01754. Cited by: §3.
  • R. Hug, W. Hübner, and M. Arens (2020) Introducing probabilistic bézier curves for n-step sequence prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 10162–10169. Cited by: §3.
  • S. Kapoor, M. Finzi, K. A. Wang, and A. G. G. Wilson (2021) SKIing on simplices: kernel interpolation on the permutohedral lattice for scalable gaussian processes. In International Conference on Machine Learning, pp. 5279–5289. Cited by: §3, Figure 5, §4.2, §4.2, Bézier Gaussian Processes for Tall and Wide Data.
  • H. Kim, B. K. Mallick, and C. C. Holmes (2005) Analyzing nonstationary spatial data using piecewise gaussian processes. Journal of the American Statistical Association 100 (470), pp. 653–668. Cited by: §3.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR), Cited by: §4.
  • D. Nguyen-Tuong, J. Peters, and M. Seeger (2008) Local gaussian process regression for real time online model learning. Advances in neural information processing systems 21. Cited by: §3.
  • S. W. Ober, C. E. Rasmussen, and M. van der Wilk (2021) The promises and pitfalls of deep kernel learning. In Proceedings of the 37th Conference on Uncertainty in Artifical Intelligence (UAI), Cited by: §3.
  • S. Petrone and L. Wasserman (2002) Consistency of bernstein polynomial posteriors. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (1), pp. 79–100. Cited by: §3.
  • S. Petrone (1999a) Bayesian density estimation using bernstein polynomials. Canadian Journal of Statistics 27 (1), pp. 105–126. Cited by: §3.
  • S. Petrone (1999b) Random bernstein polynomials. Scandinavian Journal of Statistics 26 (3), pp. 373–393. Cited by: §3.
  • H. Prautzsch, W. Boehm, and M. Paluszny (2002) Bézier and b-spline techniques. Vol. 6, Springer. Cited by: §1.
  • J. Quinonero-Candela and C. E. Rasmussen (2005) A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research 6, pp. 1939–1959. Cited by: §3.
  • H. Salimbeni and M. Deisenroth (2017) Doubly stochastic variational inference for deep gaussian processes. Advances in neural information processing systems 30. Cited by: §3, Figure 5, §4.3.
  • E. Snelson and Z. Ghahramani (2005) Sparse gaussian processes using pseudo-inputs. Advances in neural information processing systems 18. Cited by: §1.
  • M. Titsias (2009) Variational learning of inducing variables in sparse gaussian processes. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §1, §3, §4.2.
  • G. Tran, D. Milios, P. Michiardi, and M. Filippone (2021) Sparse within sparse gaussian processes using neighbor information. In International Conference on Machine Learning, pp. 10369–10378. Cited by: §3.
  • K. Wang, G. Pleiss, J. Gardner, S. Tyree, K. Q. Weinberger, and A. G. Wilson (2019) Exact gaussian processes on a million data points. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: Figure 5, §4.2, §4.3.
  • C. K. I. Williams and C. E. Rasmussen (2006) Gaussian processes for machine learning. MIT Press Cambridge, MA. Cited by: §1.
  • A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing (2016) Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §3.
  • A. Wilson and H. Nickisch (2015) Kernel interpolation for scalable structured gaussian processes (kiss-gp). In International conference on machine learning, pp. 1775–1784. Cited by: §3, Bézier Gaussian Processes for Tall and Wide Data.
  • L. Wu, A. Miller, L. Anderson, G. Pleiss, D. Blei, and J. Cunningham (2021) Hierarchical inducing point gaussian process for inter-domian observations. In International Conference on Artificial Intelligence and Statistics, pp. 2926–2934. Cited by: §3.
  • L. Wu, G. Pleiss, and J. Cunningham (2022) Variational nearest neighbor gaussian processes. arXiv preprint arXiv:2202.01694. Cited by: §3.

Appendix A Computations in the Bézier Buttress

This section seeks to explain how the KL-divergence is computed using the Bézier buttress. It further explains more detailed parametrisation in the architecture. For completeness we here give a forward pass to compute .

(A.1)

here we make the choice that . This ensures positive weights and hence a positive output for the variance of . comes from the inverse squared Bernstein adjusted prior (see Section 2). are free parameters to be inferred in the variational posterior. This parametrisation makes computing the KL terms easier.

We remark all the following calculation are only for one Bézier buttress. Are there multiple Bézier buttresses, with different orderings of layers, the computations are equivalent for all of them.

For computing the KL we first recall from the paper

(A.2)
(A.3)

For easier reference we declare

(A.4)
(A.5)
(A.6)

We remind again that “hat” notation refers to parameters from variational posterior . We also remind that the prior variance is given . Because we have included in the parametrisation in the posterior , they are cancelling out in the expression in and . We get

(A.7)

where is element-wise on the matrices.

For we make the observation, based again on cancelling out in the fraction, that

(A.8)

That is, summing over is basically counting how many paths (i.e. control points) use . That is determined as . Here . Hence,

(A.9)

where denotes summing the elements in the matrix.

Notice how the variational parametrisation of was carefully chosen for easily computing and .

is more close to what described in main paper. We simply just need to square all the weights and correct with the prior variance. That is, correct with . Hence,

(A.10)

where here is the diagonal matrix with along its diagonal, for . Notice further here are the weights in the mean Bézier buttress.

Now

(A.11)

all of which are computed in a single forward pass in the Bézier buttress. is the number of all control points (in one buttress).

Appendix B Numerical results

For reproducibility we give the values used to generate Figure 4. These are given in Table 3.

year buzz houseelectric slice
Test log-likelihood
B20: B20: B20: B3: B5:
Test RMSE
B20: B20: B20: B3: B5:
Table 3. Numerical values used create Figure 4. Here is listed average and standard deviation over 3 splits. On year the test-set was not standardised to compare with baselines there.