1. Background
Bézier curves and surfaces are parametrised geometric objects that have found great usage in computeraided design and robotics (Prautzsch et al., 2002)
. The simplest Bézier curve is the linear interpolation of two points
and in ; the Bézier curve of order(1.1) 
Higher order curves are generalised in the following way: the order Bézier curve is defined as
(1.2) 
In Bézier terms, are referred to as control points. Notice an order Bézier curve has control points. denotes the th Bernstein polynomial of order . They are defined as
(1.3) 
By going from curves to surfaces we wish to extend from the scalar to a spatial input , for . Here, we can define Bézier surfaces as
(1.4) 
Figure 1 gives a visual illustration of a dimensional surface embedded in . In the literature, it is difficult to find any studies of surfaces for . This paper targets especially this highdimensional input case. We restrict our output dimension to , the regression problem, but the methods naturally extend to multidimensional outputs. The red points in Figure 1 show how each control points has an associated location in the input space. They are placed on a gridlike structure, and the order of each dimension determines how fine the meshgrid of the hypercube is; i.e. how dense the inputspace is filled.
Gaussian processes
(GPs) are meticulously studied in probability and statistics
(Williams and Rasmussen, 2006). They provide a way to define a probability distribution over functions. This makes them useful, as priors, to build structures for quantifying uncertainty in prediction. They are defined as a probability measure over functions
, such that any collection of elements in have their associated outputfollowing a joint Gaussian distribution. This distribution is fully determined by a mean function
and a positive semidefinite kernel function . They admit exact Bayesian inference. However, exact inference comes at a prohibitive worstcase computational cost of , where is the number of training points, due to computing inverse and determinant of the kernel matrix.Sparse Gaussian processes (Snelson and Ghahramani, 2005) overcome this burden by conditioning on inducing points, reducing complexity to , where usually . The inducing points, denoted , are then marginalised to obtain an approximate posterior of
. The variational posterior mean and variance, at a location
are then given by(1.5)  
(1.6) 
under the assumption of a constant zero prior mean function. This assumption is easily relaxed if needed. Here denotes the inducing locations in the input space, i.e. Under further assumption of Gaussian observation noise , i.e. , then Titsias (2009) showed the optimal and are known analytically. In a sought analogy to Figure 1, would be the red points and would be the orange.
2. Bézier Gaussian Processes
Inspired by Bézier surfaces, we construct a Gaussian process as
(2.1) 
where are Gaussian variables and . Here, for . It is easy to verify that satisfies the definition of a GP since it, for any , is a scaled sum of Gaussians. We assume that all are fully independent. With that assumption, we can make the following observation for the mean and kernel function
(2.2)  
(2.3) 
The Bernstein polynomials can approximate any continuous function given the order is large enough, thus they make a good basis for GP regression (Hildebrandt and Schoenberg, 1933). Naturally, selecting a prior over comes down to selecting a prior over the random control points . The most common prior mean in GP regression, the constant zero function, is then easily obtained by for all . By construction, the choice of needs consideration to yield a convenient prior over . Mindlessly setting would make collapse to zero quickly in the central region of the domain, especially as dimensions grow. This, of course, gives a much too narrow prior over .
Figure 2
(middle) shows that in the central region the standard deviation of
is smaller due to the nature of the Bernstein polynomials. If we consider instead a twodimensional input the standard deviation would collapse even more, as we would then see the shrinking effect for both dimension and multiply them. We can, however, adjust for this.We define the inverse squared Bernstein adjusted prior to counter this effect. In all dimensions , let
(2.4) 
and denotes the order of the dimension . Then setting ensures that over the entire domain . Eq. 2.4 solves a linear system, such that , for . This means a prior hardly distinguishable from standard stationary ones such as the RBF kernel. Visual representation of this prior is shown in Figure 2 (right). This adjustment works up to , after which negative values occur.
Summarising, we introduced a kernel based on Bézier surfaces. An alternative viewpoint is that is a polynomial GP, but with Bernstein basis rather than the canonical basis. We remark that is defined outside the domain ; any intuition about the prior there is not considered, and we will not pursue investigating data points outside this domain. Of course, for practical purposes this domain generalises without loss of generality to any rectangular domain . For presentation purposes we keep writing . Next, we show how we infer an approximate posterior given data.
2.1. Variational inference
Let denote the set of all . As the prior of is fully determined by the random control points , the posterior of is determined by the posterior of these. As per above, we set the prior
(2.5) 
where . We utilise variational inference to approximate the posterior of the control points, and hence . This means we introduce variational control points. We assume they are fully independent (usually called the meanfield assumption), and have free parameters for the mean and variance, such that .
Assume we have observed data
. The key quantity in variational inference is the KullbackLeibler divergence between the true posterior
and the variational approximation – which we denote . The smaller divergence, the better approximation of the true posterior. Without access to the true posterior, the quantity is not computable. However, it has been shown this divergence is equal to the slack in Jensen’s inequality used of the logmarginal likelihood: .(2.6)  
(2.7) 
Knowing this, we can approximate the true posterior with by maximising Eq. (2.7). This is the evidence lower bound, and it is maximised with respect to the variational parameters and . This is fully analytical when the variational parameters and are known.
We assume our observation model is disturbed with additive Gaussian noise, which in other words means our likelihood is Gaussian
(2.8) 
for each and we assume they are independent conditioned on . With these assumption the first term in Eq. (2.7) becomes
(2.9) 
where and are given as Eq. (2.2) and (2.3) respectively, but with the variational parameters and used.
The second term in Eq. (2.7) enjoys the independence of control points to split into sums
(2.10)  
(2.11) 
When inspecting the evidence lower bound, Eq. (2.7), we see it has a data term forcing control points to fit the data, and a KLterm to make control points revert to the prior. Knowing how control points are allocated in the input domain, we expect control points in regions of no data revert to the prior. This is similar to what stationary kernels do in said regions. We verify visually in Section 4.
All together, Bézier GPs can be adjusted to have priors similar to stationary GPs, and have analogous posterior behaviour, which is favourable to many practitioners. But Bézier GPs scale. None of the terms in the evidence lower bound require matrix inversions or determinants. It is simple to minibatch over the data points, utilising stochastic variational inference (Hoffman et al., 2013), to scale it to large . However, nearly all terms require evaluations of huge sums if the input dimension is high. The next section is aimed at this problem.
2.2. Scalability with the Bézier buttress
Until this point, we have omitted addressing the number of random control points needed for Bézier GPs. Let us denote this number . It can quickly be checked that . This implies that to evaluate we must sum over summands, which as increases, quickly becomes computationally cumbersome. increases exponentially with . It is justifiable to view the random control points as inducing points; after all, they are Gaussian variables in the output space. Thus, it would be extremely valuable to manage exponentially many of them.
To overcome this, we introduce the Bézier buttress^{1}^{1}1A buttress is an architectural structure that provides support to a building.. We assume parameters of the random control points, say , can parametrise , where . This assumption is the key of the Bézier buttress. Figure 3 provides visualisation. It visualises a sourcesink graph, where each unique path from source to sink represents one unique control point with above parametrisation. The cyan highlighted path represents the , where we multiply the values along the path from source to sink. Notice last edges have value .
In the Bézier buttress there are layers, one for each input dimension, and nodes in each layer
. Borrowing from neural network terminology, a forwardpass is a sequential series of matrix multiplications which are elementwise warped with nonlinearities, such as
or ReLU. If we let our sequence of matrices be , where is the matrix with entries , then fixing ’the input’ to and the last matrix to , a forward pass is(2.12) 
We see this forward pass computes a sum over all control points means. Naturally, we construct a Bézier buttress for summing over variances too, which must be restricted to positive weights. It was these sums that formed our bottleneck, in this parametrisation it is just a sequence of matrix products.
What about computing ? It comes down to a use of ‘nonlinearities’ in the buttress. Multiplying elementwise the Bernstein polynomials as seen in Figure 3 (visualised only on rd layer), a forward pass computes either , or if using squared Bernstein polynomials. Each control point is then exactly multiplied by the correct polynomials from its way from source to sink. Notice, the ‘input’ is fixed to in the source, but the observed is appearing via the Bernstein polynomials along the way. We can write this too as a sequence of matrix products
(2.13) 
where is a diagonal matrix with the Bernstein polynomials on its diagonal. For the variance of there exists a similar expression, of course with squared polynomials, and with the positive weights associated with the variance Bézier buttress.
On this inspection, all terms needed to compute Eq. (2.9) are available. All terms for the KL divergences are algebraic manipulations of Eq. (2.12) – these are explicit in the supplementary material. The key takeaway for Bézier buttress is that we parametrise each random control point as a product of weights. This can be seen as an amortisation
, and we do inference on these weights rather than the control points themselves. Hence, no matrix inversions are needed to backpropagate through our objective, and a forward pass is only a sequence of matrix products.
2.3. Marginalising matrix commutativity
Matrix multiplication is not commutative. This implies the ordering of the matrices in Eq. (2.12) matter, which again implies how we order the input dimensions in the Bézier buttress is of importance. This is the price for the computational benefit this parametrisation gives. An ordering of the input dimensions is somewhat unnatural for spatial regression, so we present a way to overcome this in approximate Bayesian manner. Define , where each of these individual , , are Bézier GPs with a random permutation of the ordering in the associated Bézier buttress. In other words, we let be an ensemble of Bézier GPs to (approximately) marginalise over all possible orderings. The number of all possible orderings quickly becomes too large for which to account; in practice, we set in a feasible region, say , which gives satisfactory empirical performance.
Another lens on this is that each control point is a sum of control points – each control point’s standard deviation is scaled by
, to obtain the same prior as so far discussed. Oddly, we have then circumvented the problem of too many control points by introducing even more control points. That is, each control point mean parametrise
, where denotes a random permutation of . We remind again that similar expression exist for control point variances, restricted to positive weights.As remarked, inference comes down to a forward pass in the Bézier buttress, a sequence of matrix multiplications. Assume all dimension are of order , then the computational complexity of one forward pass is . Now we need forward passes to marginalise the ordering, and forward passes, one for each observation, leaving final complexity of . Linear in and .
3. Related Work
Variational inference in GPs was initially considered by Csató et al. (1999) and Gibbs and MacKay (2000). In recent times, the focus has shifted focus to scalable solutions to accommodate the big data era. In this respect, Titsias (2009) took a variational approach to the inducing points methods (QuinoneroCandela and Rasmussen, 2005); later Hensman et al. (2013) further enhanced scalability to allow for minibatching and a wider class of likelihoods. Still, the need for more inducing points is of importance, especially as the number of input features grows.
The response to this has mostly revolved around exploiting some structure of the kernel. Wilson and Nickisch (2015) and Wu et al. (2021) exploit specific structure that allow for fast linear algebra methods; similar to our method inducing locations tend to lie on grid. These grids expand fast as the input dimension increases, as also pointed out earlier in the article. Kapoor et al. (2021) remedy this by instead of a rectangular grid, they consider the permutohedral lattice, such that each observation only embeds to neighbours instead of , as in (Wilson and Nickisch, 2015).
Another approach to allowing for more inducing points is incorporating nearest neighbour search in the approximation (Kim et al., 2005; NguyenTuong et al., 2008). Tran et al. (2021) introduced sparsewithinsparse where they have many inducing points in memory, but each time search for the nearest ones to any observation. They discard the remaining ones as they have little to no influence on the observation. Wu et al. (2022) made a variational pendant to this method.
Lastly, when dealing with highdimensional inputs it is worth mentioning feature learning. That is, learning more lowdimensional features where GPs have better performance. The success of deep models has been transported to GPs by Damianou and Lawrence (2013) and later scaled to large datasets in (Salimbeni and Deisenroth, 2017). Another approach is Deep Kernel Learning (Wilson et al., 2016; Bradshaw et al., 2017)
, where feature extraction happens inside the kernel function; lately
Ober et al. (2021) has investigated the limitations and benefits of these models.We have treated structured control points as our version of inducing points; and by parametrising them with a Bézier buttress, we limit the expansion of grids to linear growth in parameters. We are not the first to consider the Bernstein polynomials as a basis for learning functions. Petrone (1999b)
used it to model kernel estimate probability density functions, and several follow up works
(Petrone, 1999a; Petrone and Wasserman, 2002). Hug et al. (2020) recently introduced Bézier GPs, but with a focus on time series (Hug et al., 2022). Our emphasis has been on spatial input; even the Bézier surface literature contain close to nothing on more than dimensional surfaces.4. Evaluation
We split our evaluation into four parts. First, we visually inspect the posterior on a one dimensional toy dataset to show how the control points behave, and indicate that there indeed is a stationarylike behaviour on the domain of the hypercube. Next, we test empirically on some standard UCI Benchmark datasets, to gives insight into when Bézier GPs are applicable. After that, we switch to tall and wide data – large both in the input dimension and in number of data points. These experiments give certainty that the method delivers on its key promise: scalability. Lastly, we turn our eyes to the method itself and investigate how performance is influenced by the ordering of dimensions.
Care is needed in optimising a Bézier GP – not all parameters are born equal. We split optimisation into two phases. First, we optimise all variational parameters, keeping the likelihood variance fixed as , with being the number of control points. After this initial phase, we optimise with all variational parameters fixed. We let both phases run for 10000 iterations with a minibatch size of 500, for all datasets. Both phases use the Adam optimiser (Kingma and Ba, 2015), the first phase with learning rate , and the second with learning rate
. If not following a such a bespoke training scheme, we see a tendency for the posterior to revert to the prior, because the KLterm becomes too dominating initially. This training scheme is designed for the Gaussian likelihood, but we wish to emphasise that, in principle, the loss function is accurate for any choice of likelihood.
4.1. One dimensional visual inspection
We hypothesised the objective function, Eq. 16, would ensure, within the hypercube domain, that reverts to its prior in regions where data is scarce. To verify this we construct a small dataset to inspect. We generate onedimensional inputs uniformly in the regions and
; we sample 20 observation in each region. The responsive variable is generated as
. According to the hypothesis, should in the region , tend towards zero in mean, and increase its variation here. We use a BézierGP of order to model the observations, since they are highly nonlinear. Figure 4 shows the posterior distribution of to the left; we observe tends towards the prior in the middle region. The middle plot illustrates the distribution, both prior and posterior, of the control points. There is a clear tendency for the centralmost points to align the posterior and posterior, enforcing this behaviour in . The nonequal priors are due to the inversesquared Bernstein adjusted prior which ensures a uniform variation in over the domain, see Figure 2. The plot to the right in Figure 4 shows the behaviour foundational to practitioners of Bayesian optimisation and active learning etc., the variance increase away from data regions.power  protein  energy  boston  bike  keggdirected  concrete  elevators  
Test loglikelihood  
SGPR  
SimplexGP  NA  NA  
BezierGP  
Test RMSE  
SGPR  
SimplexGP  NA  NA  
BezierGP 
4.2. UCI Benchmark
We evaluate on eight small to midsize real world datasets commonly used to benchmark regression (HernandezLobato and Adams, 2015). We split each dataset into train/testsplit with the ratio . We do this over random splits and report test set RMSE and loglikelihood average and standard deviation over splits. We choose baselines to be SGPR, following the method from Titsias (2009); we do both for and inducing variables. SimplexGP is another baseline, they suggest their approximation is beneficial for semihigh dimensional inputs (between and ) (Kapoor et al., 2021), hence they are an obvious baseline. SimplexGP usually use a validation set to choose the final model. This is due to a highly noisy optimisation scheme using a high errortolerance () for conjugate gradients. We remedy this by setting the errortolerance to (), which harms scalability, but we can omit using a validation set for better comparability. Wang et al. (2019) recommend this errortolerance, but remark it is more stable for RMSE than for loglikelihood. For BézierGP, we fix the number of permutations to , and vary the order in . The order is identical over input dimensions. The inputs are preprocessed such that the training set is contained in .
Table 1 contains the results of this experiment. We make the following observations about our presented BézierGP. On keggdirected and elevators there are test points outside the defined domain on some splits, which cause the RMSE to be extreme, but the likelihood is more forgiving. This highlights the constraint of our model: it needs a boxbounded domain to be a priori known. We could not reproduce results from Kapoor et al. (2021) on keggdirected, the optimisation was too noisy and with no use of validation. On concrete, boston and energy we see overfitting tendencies. Even though BézierGP is the optimal choice on energy there is a mismatch between train and test error. On concrete this shows in better test RMSE, than the baselines, but the variance is overfitted yielding nonoptimal likelihood. We conjecture this happens because the ratio is low; which makes it more likely to overfit the control points – especially for higher orders . Knowing these model fallacies, we observe that BézierGP outperforms on baselines on multiple datasets, most notably the dimensional bike dataset.
4.3. Large scale regression
Figure 5 shows the results of regression tasks in regimes of high dimensions and one in high number of observations. Here, we follow exactly the experimental setup of either Salimbeni and Deisenroth (2017) or Wang et al. (2019). If the latter, we use the validationsplit they use as training data. Our optimisation scheme for BézierGP is consistent with above, except for slice, where the first training phase runs for iterations. We discard test points that are not in the domain – in no situation did this remove more than of the test set. The number after DGP, denotes the number of hidden layers in a Deep GP (Salimbeni and Deisenroth, 2017), after SGPR and SVGP it denotes the number of inducing points. SVGP refers to the method from Hensman et al. (2013). After B, it denotes the order used in BézierGP.
On year, we observe our (nondeep) model is onpar with layered Deep GPs, and closer to in RMSE. The highest dimensional dataset, slice, sees us in the low ratio again, and we are again faced with a too flexible model. This is why we report results for orders and , rather than , since the overfitting kicks in. Even for these small orders the test loglikelihood has high variance and underperforms compared to RMSE. With respect to RMSE it is topperformer signalling again it is overfitting the variance. On the remaining two datasets BézierGP is bestperforming among baselines.
4.4. Influence of number of permutations
All experiments so far used ; that is, random permutations of the ordering of dimension used in the Bézier buttress. For a problem with input dimension , there exist possible permutations. Table 2 shows results with varying ; for each dataset, the results are over the same train/test split (). We fixed . Up to some noise in the optimisation phase, we see for the two highest dimensional datasets, protein and bike, performance improves with higher . Bike has over reduction in RMSE from to .
power ()  protein ()  bike ()  

RMSE  LL  RMSE  LL  RMSE  LL  
Table 2
emphasises the results we have presented are not optimised over hyperparameter
and . They also illustrate an interesting direction of future research: optimising these hyperparameters. We chose permutations by random sampling, but choosing them in a principled deliberate manner could yield good performance with a computationally manageable . This result indicates, at least, protein and bike would see increased performance in Table 1 from better (or just more) permutations.5. Discussion
We introduced the Bézier Gaussian Process – a GP, with a polynomial kernel in the Bernstein basis, that scales to a large number of observations, and remains spacefilling for high number of input features (limited to a boxbounded domain). We illustrated that, with slight adjustments, the prior and posterior have similar behaviour to ‘usual’ stationary kernels. We presented the Bézier buttress, a weighted graph to which we amortise the inference, rather than inferring the control points themselves. The Bézier buttress allows GP inference without any matrix inversion. Using the Bézier buttress, we inferred control points for the highdimensional slice dataset.
We highlighted weaknesses of the proposed model: most crucially the tendency of overfitting when the ratio is low. The results demonstrate scalability in both and , but does not solve the short, but wide problem. The paper did not optimise over the hyperparameters of the proposed kernel, namely and , but it showcased briefly that doing so might enhance BézierGPs empirically; especially smart selection of the permutations is an interesting direction for future research.
Acknowledgements
MJ is supported by the Carlsberg Foundation.
References
 Adversarial examples, uncertainty, and transfer testing robustness in Gaussian process hybrid deep networks. arXiv preprint arXiv:1707.02476. Cited by: §3.

Convergence of sparse variational inference in Gaussian processes regression.
Journal of Machine Learning Research (JMLR)
21 (131), pp. 1–63. Cited by: Bézier Gaussian Processes for Tall and Wide Data.  Efficient approaches to gaussian process classification. Advances in neural information processing systems 12. Cited by: §3.

Deep Gaussian processes.
In
Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS)
, Cited by: §3.  Product kernel interpolation for scalable gaussian processes. In Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics, A. Storkey and F. PerezCruz (Eds.), Proceedings of Machine Learning Research, Vol. 84, pp. 1407–1416. External Links: Link Cited by: Bézier Gaussian Processes for Tall and Wide Data.

Variational gaussian process classifiers
. IEEE Transactions on Neural Networks 11 (6), pp. 1458–1464. Cited by: §3.  Gaussian processes for big data. In Proceedings of the 29th Conference on Uncertainty in Artifical Intelligence (UAI), Cited by: §3, §4.3.
 Probabilistic backpropagation for scalable learning of bayesian neural networks. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1861–1869. External Links: Link Cited by: §4.2.

On linear functional operations and the moment problem for a finite interval in one or several dimensions
. Annals of Mathematics, pp. 317–328. Cited by: §2.  Stochastic variational inference. Journal of Machine Learning Research. Cited by: §2.1.
 Bézier curve gaussian processes. arXiv preprint arXiv:2205.01754. Cited by: §3.
 Introducing probabilistic bézier curves for nstep sequence prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 10162–10169. Cited by: §3.
 SKIing on simplices: kernel interpolation on the permutohedral lattice for scalable gaussian processes. In International Conference on Machine Learning, pp. 5279–5289. Cited by: §3, Figure 5, §4.2, §4.2, Bézier Gaussian Processes for Tall and Wide Data.
 Analyzing nonstationary spatial data using piecewise gaussian processes. Journal of the American Statistical Association 100 (470), pp. 653–668. Cited by: §3.
 Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR), Cited by: §4.
 Local gaussian process regression for real time online model learning. Advances in neural information processing systems 21. Cited by: §3.
 The promises and pitfalls of deep kernel learning. In Proceedings of the 37th Conference on Uncertainty in Artifical Intelligence (UAI), Cited by: §3.
 Consistency of bernstein polynomial posteriors. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (1), pp. 79–100. Cited by: §3.
 Bayesian density estimation using bernstein polynomials. Canadian Journal of Statistics 27 (1), pp. 105–126. Cited by: §3.
 Random bernstein polynomials. Scandinavian Journal of Statistics 26 (3), pp. 373–393. Cited by: §3.
 Bézier and bspline techniques. Vol. 6, Springer. Cited by: §1.
 A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research 6, pp. 1939–1959. Cited by: §3.
 Doubly stochastic variational inference for deep gaussian processes. Advances in neural information processing systems 30. Cited by: §3, Figure 5, §4.3.
 Sparse gaussian processes using pseudoinputs. Advances in neural information processing systems 18. Cited by: §1.
 Variational learning of inducing variables in sparse gaussian processes. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §1, §3, §4.2.
 Sparse within sparse gaussian processes using neighbor information. In International Conference on Machine Learning, pp. 10369–10378. Cited by: §3.
 Exact gaussian processes on a million data points. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: Figure 5, §4.2, §4.3.
 Gaussian processes for machine learning. MIT Press Cambridge, MA. Cited by: §1.
 Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §3.
 Kernel interpolation for scalable structured gaussian processes (kissgp). In International conference on machine learning, pp. 1775–1784. Cited by: §3, Bézier Gaussian Processes for Tall and Wide Data.
 Hierarchical inducing point gaussian process for interdomian observations. In International Conference on Artificial Intelligence and Statistics, pp. 2926–2934. Cited by: §3.
 Variational nearest neighbor gaussian processes. arXiv preprint arXiv:2202.01694. Cited by: §3.
Appendix A Computations in the Bézier Buttress
This section seeks to explain how the KLdivergence is computed using the Bézier buttress. It further explains more detailed parametrisation in the architecture. For completeness we here give a forward pass to compute .
(A.1) 
here we make the choice that . This ensures positive weights and hence a positive output for the variance of . comes from the inverse squared Bernstein adjusted prior (see Section 2). are free parameters to be inferred in the variational posterior. This parametrisation makes computing the KL terms easier.
We remark all the following calculation are only for one Bézier buttress. Are there multiple Bézier buttresses, with different orderings of layers, the computations are equivalent for all of them.
For computing the KL we first recall from the paper
(A.2)  
(A.3) 
For easier reference we declare
(A.4)  
(A.5)  
(A.6) 
We remind again that “hat” notation refers to parameters from variational posterior . We also remind that the prior variance is given . Because we have included in the parametrisation in the posterior , they are cancelling out in the expression in and . We get
(A.7) 
where is elementwise on the matrices.
For we make the observation, based again on cancelling out in the fraction, that
(A.8) 
That is, summing over is basically counting how many paths (i.e. control points) use . That is determined as . Here . Hence,
(A.9) 
where denotes summing the elements in the matrix.
Notice how the variational parametrisation of was carefully chosen for easily computing and .
is more close to what described in main paper. We simply just need to square all the weights and correct with the prior variance. That is, correct with . Hence,
(A.10) 
where here is the diagonal matrix with along its diagonal, for . Notice further here are the weights in the mean Bézier buttress.
Now
(A.11) 
all of which are computed in a single forward pass in the Bézier buttress. is the number of all control points (in one buttress).
Appendix B Numerical results
For reproducibility we give the values used to generate Figure 4. These are given in Table 3.
year  buzz  houseelectric  slice 
Test loglikelihood  
B20:  B20:  B20:  B3: B5: 
Test RMSE  
B20:  B20:  B20:  B3: B5: 