Gaussian processes (GPs) have become an often-used tool for regression due to their flexibility and good generalization properties. In addition to being successful at approximating unknown functions, GPs also come equipped with a measure of model uncertainty in the form of the posterior variance Rasmussen and Williams (2006). This quantity has shown promising results for estimating the error between the posterior mean and the underlying function Srinivas et al. (2012); Chowdhury and Gopalan (2017a); Lederer, Umlauft, and Hirche (2019); Maddalena, Scharnhorst, and Jones (2020); Sun et al. (2021). Due to this characteristic, GPs have become particularly interesting for safety-critical settings, i.e., whenever safety or performance constraints need to be considered during decision-making. Some examples include controller tuning Capone and Hirche (2019); Lederer, Capone, and Hirche (2020), estimating safe operating regions Berkenkamp et al. (2017), as well as recommendation engines Sui et al. (2015).
Uniform error bounds for Gaussian process regression are typically obtained by scaling the posterior standard deviation, where the scaling factor depends on the number of data, the input domain, as well as regularity assumptions on the unknown functionSrinivas et al. (2012); Chowdhury and Gopalan (2017b); Maddalena, Scharnhorst, and Jones (2020); Lederer, Umlauft, and Hirche (2019)
. The resulting error bound is generally of a probabilistic nature, i.e., it holds with high probability, with few exceptionsMaddalena, Scharnhorst, and Jones (2020); Wu and Schaback (1993). In Lederer, Umlauft, and Hirche (2019), an error bound is derived under the assumption that the true function is drawn from a GP. The bound is achieved by estimating the error on a finite grid over the input space, and then extending the bound to the whole input space using Lipschitz properties of the GP. In Srinivas et al. (2012); Chowdhury and Gopalan (2017b); Maddalena, Scharnhorst, and Jones (2020)
, error bounds are derived under the assumption that the underlying function belongs to a reproducing kernel Hilbert space. Additionally, error bounds from methods related to GPs can be directly exploited under weak restrictions. Since radial basis function interpolation yields regressors that are identical to the GP posterior mean for noise-free training dataKanagawa et al. (2018), deterministic error bounds can be proven using the theory of reproducing kernel Hilbert spaces Wu and Schaback (1993); Wendland (2004). Moreover, error bounds from regularized kernel regression, as derived in Mendelson (2002); Shi (2013); Dicker, Foster, and Hsu (2017), can be straightforwardly extended to GPs due to the equivalence of regularized kernel and GP regression under weak assumptions Rasmussen and Williams (2006).
While using the posterior variance to estimate the model error can prove efficient in practice, this type of approach hinges on a critical assumption that seldom holds in practice, namely that the GP hyperparameters have been specified accurately. This is because the error bounds are derived by using smoothness assumptions about the GP kernel to bound the model error. Hence, if the choice of hyperparameters is poor, then the posterior variance is typically not a good measure of the model error. This potentially leads to overconfident error bounds, rendering the model inapplicable in safety-critical settings, as illustrated in Figure 1.
Related work. This paper addresses GP uniform error bounds under misspecified kernel hyperparameters. To mitigate misspecified error bounds and guarantee no regret in a Bayesian optimization setting, Berkenkamp, Schoellig, and Krause (2019) developed an approach that gradually decreases the lengthscales of a GP, yielding convergence of the optimization algorithm towards the maximum of an unknown function. However, this is not useful for safety-critical settings, where the posterior variance, which varies strongly with the hyperparameters, is used to estimate the model error and determine the risk associated with decisions. For Matérn kernels, Tuo and Wang (2020); Wang, Tuo, and Jeff Wu (2020) have developed robust error bounds based on the fill distance under a misspecified smoothness parameter. Furthermore, Tuo and Wang (2020) provides a more general bound based on upper and lower bounds for the decay rate of the true function. Fiedler, Scherer, and Trimpe (2021) have derived error bounds for settings where the norm of the difference between the kernel used for regression and that of the reproducing kernel Hilbert space (RKHS) of the true function can be bounded. Similarly, Beckers, Umlauft, and Hirche (2018) provides an error bound when choosing a covariance function from a predefined set, under the assumption that the unknown function corresponds to a GP with a kernel from the same set. A drawback of these approaches is that they do not provide a principled or intuitive approach for obtaining hyperparameter bounds or candidate kernels.
Our contribution. In this work, we mitigate the risk of making a poor choice of hyperparameters by equipping a Gaussian process with an error bound that accounts for the lack of prior knowledge with regard to the hyperparameters. We additionally present a principled way of choosing robust uniform error bounds in a Bayesian setting without requiring any prior upper and lower bounds for the hyperparameters. The corresponding theoretical guarantees hold for a class of frequently used kernels, extending the applicability of Gaussian processes in safety-critical settings.
2 Gaussian Processes
In this section, we briefly review GPs and then discuss how error bounds and hyperparameters are chosen in practice.
We use GPs for regression, where we aim to infer an unknown function with . To this end, we treat function values2006). A Gaussian process is fully specified by a mean function and a positive-definite kernel . In this paper, we set
without loss of generality, and restrict ourselves to stationary kernels with radially non-increasing Fourier transform. This is specified by the following assumption.
The kernel is of the form
with . Furthermore, has Fourier transform for some non-increasing non-negative function .
Here denotes the kernel lengthscales, which scale the kernel inputs and specify the variability of the underlying function. A similar assumption can be found in other papers that analyze algorithms based on stationary kernels Bull (2011); Berkenkamp, Schoellig, and Krause (2019). When using this type of kernel, the similarity between any two function evaluations decreases with the weighted distance between the corresponding inputs. Many frequently encountered kernels satisfy these properties. Examples include the Gaussian and Matérn kernels, which are often employed because they satisfy the universal approximation property Micchelli, Xu, and Zhang (2006). Note that we assume only for the sake of simplicity, and all the results and tools presented in this paper can be straightforwardly extended to the more general case where the scaling factor of , i.e., the signal variance, is also a hyperparameter. This is discussed in Section 3.1.
Given a set of (potentially noisy) measurements , where , and , we are able to condition a Gaussian process on to obtain the posterior distribution of at an arbitrary test point . The corresponding distribution is normal with mean and variance
where and the entries of the covariance matrix are given by . In the case of stationary kernels, the posterior variance is typically low for test inputs that are close to the measurement data and vice versa.
2.1 Choosing Hyperparameters
By far the most common technique for choosing the hyperparameters of a Gaussian process is maximizing the log marginal likelihood of the measurements given the hyperparameters
By maximizing (1), we obtain a trade-off between model complexity and data fit. Coupled with the fact that the gradient of (1) can be computed analytically, choosing hyperparameters in this manner yields many practical benefits. Furthermore, if the marginal likelihood is well peaked, then the true hyperparameters are likely to be situated near the selected ones. However, if this is not the case, then this approach can lead to overconfident hyperparameters, as the marginal likelihood can decrease slowly away from the maximum, implying that the lengthscales are potentially smaller than the ones obtained. This type of behavior is particularly frequent if little data has been observed, as both short and long lengthscales explain the data consistently Rasmussen and Williams (2006). Less common approaches for choosing the Gaussian process hyperparameters include the cross validation Cressie (2015) and log-pseudo likelihood maximization Sundararajan and Keerthi (2001). In some settings, the hyperparameters are chosen based on prior knowledge about the system Kirschner et al. (2019).
3 Uniform Error Bounds for Unknown Hyperparameters
We now introduce a modified version of standard GP error bounds that aims to overcome the limitations mentioned in the previous chapter. The proofs of all results stated here can be found in the appendix.
In the Gaussian process regression literature, there are typically two different types of assumptions that can be made for analysis. On the one hand, the Bayesian case can be considered, where we assume the unknown function to be sampled from a Gaussian process Sun et al. (2021). On the other hand, the frequentist setting can be considered, where is assumed to have a bounded RKHS norm Srinivas et al. (2012); Chowdhury and Gopalan (2017b) with respect to the kernel . While some of the techniques presented in this paper can be useful both in a Bayesian and frequentist setting, the Bayesian paradigm provides a straightforward way of narrowing the set of candidate hyperparameters used to derive an error bound, whereas no such path is evident in the frequentist setting. Hence, we henceforth assume a Bayesian setting, as described in the following assumption.
Assumption 2 (Bayesian setting).
The unknown function corresponds to a sample from a Gaussian process with kernel , where the hyperparameters are drawn from a hyperprior
are drawn from a hyperprior.
This assumption is not very restrictive, and is often encountered in control and reinforcement learning settingsDeisenroth, Fox, and Rasmussen (2015); Kocijan et al. (2005); Hewing, Kabzan, and Zeilinger (2019). The choice of prior
can be based on any prior beliefs about the underlying function, e.g., Lipschitz continuity, which can be encoded into the prior using a chi-squared or uniform distribution.
We also assume to have a scaling function that specifies a uniform error bound for an arbitrary fixed vector of hyperparameters with high probability.
Assumption 3 (Scaling function).
There exists a known positive function , such that for any vector of lengthscales
holds, where denotes a sample from a Gaussian process with prior mean zero and kernel , conditioned on measurement data .
Error bounds for fixed hyperparameters in the Bayesian case can be derived by assuming that the input space is compact and that the kernel satisfies some regularity requirements Srinivas et al. (2012). Improved error bounds can be obtained if we are only interested in samples that satisfy a predefined Lipschitz continuity requirement Lederer, Umlauft, and Hirche (2019); Sun et al. (2021). Additionally, if we assume the input space to be finite, then is a constant, i.e., does not depend on Srinivas et al. (2012).
Our approach is then based on the following result.
Let be a kernel, vectors of lengthscales with , and let denote a measurement data set of an unknown function . Furthermore, choose with . Then
holds for all .
In essence, Lemma 1 states that the scaled posterior variance decreases with the lengthscales. In some cases, the scaling factor specified in Lemma 1 can be somewhat crude. However, in practice the unscaled posterior variance often decreases with the lengthscale and can be chosen. This is illustrated in Section 5. As a direct consequence, if a pair of bounding hyperparameters is available, we can use Lemma 1 to construct an error bound that contains all error bounds corresponding to lengthscales within then interval . This is specified in the following result.
Let Assumptions 1 and 3 hold. Let be vectors of lengthscales with , let be a measurement data set, and let be a sample from a Gaussian process with mean zero and kernel , conditioned on . Furthermore, let denote the posterior mean for arbitrary lengthscales with . Define
holds for all with probability at least .
creftypecap 1 implies that we can obtain a uniform error bound using the scaled posterior variance corresponding to the smallest vector of lengthscales . The term , which we add to in creftypecap 1, originates from the discrepancy between the posterior mean of the working lengthscales and those of the bounding lengthscales . Note that the probabilistic nature of creftypecap 1 stems from Assumption 3 also being probabilistic, and no additional uncertainty is introduced when deriving creftypecap 1. In other words, the confidence region generated by the robust uniform error bound fully contains all confidence regions corresponding to with . This is illustrated in Figure 2.
In order to be able to apply creftypecap 1 in the more general setting where the bounding vectors are not given a priori, the next step is to determine permissible . Since we are in a Bayesian scenario, an intuitive approach is to choose the bounding lengthscales such that they form a confidence interval for some . Note that deriving a similar technique in the frequentist setting, e.g., as in Chowdhury and Gopalan (2017b) or Srinivas et al. (2012), is not straightforward since the corresponding tools are not directly compatible with a distribution over the hyperparameters.
Formally, our approach consists of obtaining a pair of hyperparameters that lies within the set
where the posterior over the lengthscales given the data is computed using Bayes’ rule as
and the posterior is computed similarly to (1). The normalizing factor is then given by
By choosing the bounding lengthscales in this fashion, a vector of lengthscales sampled from the posterior distribution lies within the interval with high probability. We are then able to estimate the error in the setting with unknown hyperparameters by applying creftypecap 1, obtaining the following result.
3.1 Extension to Noise and Signal Variance
So far, we discussed the setting where the signal and noise variance are constant and known. However, the proposed tools extend straightforwardly to the setting where they are also unknown, provided that a corresponding prior distribution is also available. This is because the posterior variance increases with the noise and signal variance, hence a result of the form of creftypecap 2 can be easily obtained.
3.2 Choosing Bounding Hyperparameters
Typically, the set of bounding pairs contains more than one element. Hence, creftypecap 2 allows us some flexibility when deriving the uniform error bound, since we can choose any pair that lies within
. In the following, we derive a heuristic that aims to approximate the smallest possibleconfidence region around the working hyperparameters , which would automatically meet the requirements to apply creftypecap 2. Formally, this is achieved by solving the optimization problem
In other words, we aim to choose the pair of bounding hyperparameters that deviate the least from .
) generally cannot be computed analytically, we can resort to different approximations. In low-dimensional settings, we can solve the integral expression using numerical integration. A further option is to employ Markov chain Monte Carlo (MCMC) methodsRasmussen and Williams (2006). Alternatively, we can apply approximate inference methods, such as Laplace’s method or expectation propagation, which aim to approximate the posterior with a normal distribution. The latter methods are particularly well suited for settings with a high number of data points, as they do not require a high number of evaluations of the posterior distribution. Note that this is an advantage of the proposed technique compared to fully Bayesian GPs, which typically require some form of MCMC approach. Moreover, after computing the bounding hyperparameters , the computational complexity of evaluating the error bound is only twice that of a standard GP, whereas in a fully Bayesian GP a potentially high number of GPs has to be evaluated per prediction. A further advantage compared to fully Bayesian GPs is that, in practice, it is reasonable expect better error estimates, particularly if we choose a low risk parameter and a sufficiently smooth prior . This is because the marginal likelihood is typically well behaved in the hyperparameter space, and the fully Bayesian posterior mean and variance will lie within the uniform error bound computed with creftypecap 2. This is supported by the experimental results in Section 5. We note, however, that the proposed approach can also be combined with a fully Bayesian GP, i.e., by using the fully Bayesian posterior mean only for regression, and the presented techniques to bound the regression error.
It is also worth noting that there is no direct analogy between the approach presented in this paper and robust uniform error bounds in the frequentist setting, i.e., where the unknown function is assumed to be fixed with a bounded RKHS norm. We believe this to be a strong argument in favor of employing a Bayesian perspective instead of a frequentist one when performing regression.
4 Control with Performance Guarantees
creftypecap 2 can be employed straightforwardly to derive safety guarantees in different settings. In the following, we show how to apply it in a learning-based control setting with a commonly encountered system structure.
4.1 Control Problem
We consider an -dimensional dynamical system of the form
where and denote the system’s states and control input, respectively. The functions are unknown and modeled using GPs, whereas we assume to know . These assumptions are common in this setting Kwan and Lewis (2000); Capone and Hirche (2019).
Our goal is to design a control law that steers the subsystem towards a desired time-dependent trajectory . In other words, we aim to reduce the norm of the time-dependent error . In the following, we assume that is times continuously differentiable, and that all its derivatives are bounded, which is not a restrictive assumption. For the sake of simplicity, we henceforth use the notation and
In order to provide performance guarantees for (7), we require some additional smoothness assumptions with respect to the kernels used to model the functions . This is expressed formally in the following.
Each function is drawn from a Gaussian process with times differentiable kernel .
4.2 Backstepping Control
In order to track the desired trajectory , we employ a backstepping technique similar to the one proposed in Capone and Hirche (2019). The idea behind backstepping is to recursively design fictitious control inputs for each subsystem , which we then aim to track by means of the true control input . For more details, the reader is referred to Krstic, Kokotovic, and Kanellakopoulos (1995).
To model the unknown functions , we assume to have noisy measurements . The control input can then be computed recursively using
where denotes the posterior mean of the -th Gaussian process, the tracking error of the -th subsystem, and are state-dependent control gains.
The stability analysis of the closed-loop system is carried out using Lyapunov theory. By using a quadratic Lyapunov function, it is straightforward to show that there exists an ultimate upper bound for the tracking error , i.e., holds for large enough , where depends on the control gains (Capone and Hirche, 2019, Lemma 2). By using this result and creftypecap 2, we are able to determine adaptive control gains that yield a tracking error smaller or equal than a predetermined value , which is crucial in safety-critical applications. This is formally stated in the following result.
Let Assumptions 4 and 1 hold, and let the control input be given by (8). Furthermore, let be bounding hyperparameters for the subsystems , obtained with (6) and risk parameter , and let be the corresponding scaling functions. Choose , as in creftypecap 1, and choose each state-dependent control gain as
where are chosen as in Lemma 1. Then, with probability , there exists a , such that
holds for all .
Hence, we can use (9) to obtain the control gains required for the control performance specification .
We now present experimental results where the performance of the proposed error bound is compared to that of vanilla and fully Bayesian GPs. We first showcase the prediction error on regression benchmarks, then apply the proposed technique to design a learning-based control law. A Gaussian kernel is used in all cases except the Mauna Loa experiments, and we also consider uncertainty in the signal and noise variances, as discussed in Section 3. In all except the Sarcos experiments, the fully Bayesian GPs and integral expression in (6) are approximated using the No-U Turn Sampler algorithm for MCMC Hoffman, Gelman et al. (2014). The implementations are carried out using GPyTorch Gardner et al. (2018).
The values for proposed in theory are typically very conservative Srinivas et al. (2012), which makes their use impractical. Similarly, the scaling factor proposed by creftypecap 2 assumes very high values. In many safety-critical applications, it is common to choose the scaling factor as , independently of the hyperparameters Berkenkamp et al. (2017); Umlauft et al. (2017). Moreover, empirical results indicate that Lemma 1 holds with in the case of a Gaussian kernel, i.e., the error bound will be at least as robust as that of the vanilla GP for . Hence, to obtain a fair and not overly conservative comparison, we set in the following experiments. The working hyperparameters are chosen by maximizing the log marginal likelihood (1). We employ a confidence parameter of for (6), and compute the integral expressions using Monte Carlo integration. We assume uniform distributions as hyperpriors . Note that even though this imposes hard bounds on the hyperparameters, these are still very large.
5.1 Regression - Toy Problem and Benchmarks
In the regression experiments, to additionally illustrate the behavior of the proposed technique as more data becomes available, we train the GPs with data sets of varying sizes .
We first investigate how the proposed technique performs when estimating the regression error of a function that is sampled from a GP. The results can be seen in Figure 4. Our approach always captures the behavior of the underlying sample . Both the vanilla GP error bound and the fully Bayesian error bound fail to do so if little data is available (Figure 3(a)) or if the data is sparse (Figure 3(b)). The estimated error is only accurate for the vanilla and fully Bayesian GPs if the data covers the state space sufficiently well (Figure 3(c)).
We now apply the proposed technique to estimate the regression error of the Boston house prices data set Pedregosa et al. (2011), the UCI wine quality data set Cortez et al. (2009), the Mauna Loa CO time series, and the Sarcos data set. For the Mauna Loa data set, we employ a spectral mixture kernel with mixtures Wilson and Adams (2013), which consists of a sum of Gaussian kernels multiplied with sinusoidal kernels. It is straightforward to show that its Fourier transform increases with the lengthscales, hence we are able to apply creftypecap 2. The mixture means, which specify the frequency of the periodic components, are assumed to be fixed except for the fully Bayesian case. For the two largest Sarcos data sets, we perform a Laplace approximation around the maximum of the posterior, which yields a normal distribution MacKay (2002). To ensure that the corresponding covariance matrix is positive definite, we employ an empirical Bayes approach and specify a quadratic hyperprior around the log likelihood maximum. The bounding hyperparameters are then obtained by computing the corresponding rectangular confidence region as in Šidák (1967).
We run each scenario multiple times and select the training and test points randomly every time. The Sarcos experiments are repeated times due to high computational requirements, all other experiments are repeated times. To evaluate performance, we check how often the error bound is incorrect, which corresponds to the measuring the quantity
where the superscript test denotes test inputs/outputs, and
is the indicator function. Note that this essentially corresponds to the calibration error metric for regression suggested by Kuleshov, Fenner, and Ermon (2018).
5.2 Control Design with Little Data
We now apply the proposed technique to design a control law for a safety-critical setting, where a one-link planar manipulator with motor dynamics is to be steered towards the origin using the method presented in Section 4 and training data points. Note that is not very small for control purposes in the proposed setting, since good performance can already be achieved with as little as data points Capone and Hirche (2019). The data is obtained using a low-gain sinusoidal input.
The manipulator dynamics are given by
where and are the system’s angle and torque, respectively, is the motor voltage, which we can control directly, and , , , , , are system parameters. By approximating the differential equation of each state , , using a separate Gaussian process, we are able to employ a backstepping technique to track a desired trajectory, as described in Section 4. We then use creftypecap 3 to choose the control gains , , by setting the desired error to .
To evaluate the control performance, we run simulations for each setting, where the initial conditions of the manipulator are randomly sampled from a normal distribution. The norm of the resulting tracking error is displayed in Figure 5. As can be seen, the gains obtained using the proposed approach perform considerably better than the ones obtained with vanilla and fully Bayesian GPs. This is because the posterior variance of the Gaussian process decreases too slowly away from the collected data points, expressing too much confidence in the posterior mean.
We have presented robust uniform error bounds for Gaussian processes with unknown hyperparameters. Our approach is applicable for stationary radially non-decreasing kernels, which are commonly employed in practice. It hinges on computing a confidence region for the hyperparameters, which does not require pre-specified bounds for the hyperparameters. The presented theoretical results make them flexible and easily applicable to safety-critical scenarios, where theoretical guarantees are often required. In numerical regression benchmarks, the proposed error bound was shown to outperform the error bound obtained with standard and fully Bayesian Gaussian processes. Furthermore, the presented tool resulted in better control performance in a control problem, indicating better suitability for safety-critical settings.
Appendix A Proofs
We begin by listing some well-known results required to prove the main statements from the paper.
Lemma 2 (Wendland, 2004, Theorem 6.11).
Suppose that is a continuous function. Then is positive definite if and only if its Fourier transform is nonnegative and nonvanishing.
Lemma 3 (Rao et al., 1973, p.70).
Let be two symmetric positive-definite matrices, such that is symmetric positive-definite. Then is symmetric positive-definite.
The following result is a direct consequence of the block matrix inversion formula, which can be found, e.g., in Horn and Johnson, 2012, p.18.
Lemma 4 (Horn and Johnson, 2012, p.18).
be an non-singular partitioned matrix. Then the lower-right entry of its inverse is given by
be symmetric positive definite partitioned matrices, such that is also positive definite. Then
Using Lemma 2, we then obtain the following.
Let , be two kernels with lengthscales , let denote a measurement data set of an unknown function , and let and denote the corresponding covariance matrices. Then the matrix
is positive semi-definite.
For an arbitrary , define . By Assumption 1, the Fourier transform of is given by
Since is non-increasing, for any pair of lengthschales with ,
holds, where . Hence, (11) implies
for all . Since , it then follows from Lemma 2 that
is a positive-definite function, which implies the desired result. ∎
In order to prove creftypecap 2, we aim to bound the difference between posterior means . To this end, we employ the two following results.
Let denote the RKHS norm with respect to a kernel , and consider a function with bounded RKHS norm . Then, for all ,
holds, where is the noise variance and are the measurement inputs.
See Srinivas et al., 2012, Appendix B, eq. (11). ∎
Lemma 8 (Bull, 2011).
Let be a function with bounded reproducing kernel Hilbert space norm with respect to the kernel . Then, for all ,
We are now able to bound the difference between the means of two Gaussian processes conditioned on the same data, as described in the next statement.
Let Assumption 1 hold, let be vectors of lengthscales with , and let denote GP posterior means conditioned on some measurement data set . Then
holds for all , where is the noise variance, and is the measurement data.
Define . From Lemma 7, we have
Now, the RKHS norm with respect to of a posterior mean function can be upper-bounded as
where the first inequality is due to the eigenvalues ofbeing smaller or equal to one, whereas the last inequality holds because the eigenvalues of are greater or equal to . Furthermore, the summation in (12) can be bounded as