DeepAI

# Gaussian Process Uniform Error Bounds with Unknown Hyperparameters for Safety-Critical Applications

Gaussian processes have become a promising tool for various safety-critical settings, since the posterior variance can be used to directly estimate the model error and quantify risk. However, state-of-the-art techniques for safety-critical settings hinge on the assumption that the kernel hyperparameters are known, which does not apply in general. To mitigate this, we introduce robust Gaussian process uniform error bounds in settings with unknown hyperparameters. Our approach computes a confidence region in the space of hyperparameters, which enables us to obtain a probabilistic upper bound for the model error of a Gaussian process with arbitrary hyperparameters. We do not require to know any bounds for the hyperparameters a priori, which is an assumption commonly found in related work. Instead, we are able to derive bounds from data in an intuitive fashion. We additionally employ the proposed technique to derive performance guarantees for a class of learning-based control problems. Experiments show that the bound performs significantly better than vanilla and fully Bayesian Gaussian processes.

• 8 publications
• 19 publications
• 43 publications
06/04/2019

### Uniform Error Bounds for Gaussian Process Regression with Application to Safe Control

Data-driven models are subject to model errors due to limited and noisy ...
02/23/2022

### Networked Online Learning for Control of Safety-Critical Resource-Constrained Systems based on Gaussian Processes

Safety-critical technical systems operating in unknown environments requ...
08/22/2022

### Scale invariant process regression

Gaussian processes are the leading method for non-parametric regression ...
05/06/2021

### Practical and Rigorous Uncertainty Bounds for Gaussian Process Regression

Gaussian Process Regression is a popular nonparametric regression method...
01/13/2021

### Uniform Error and Posterior Variance Bounds for Gaussian Process Regression with Application to Safe Control

In application areas where data generation is expensive, Gaussian proces...
10/29/2018

### Learning Gaussian Processes by Minimizing PAC-Bayesian Generalization Bounds

Gaussian Processes (GPs) are a generic modelling tool for supervised lea...
10/20/2022

### Scalable Bayesian Transformed Gaussian Processes

The Bayesian transformed Gaussian process (BTG) model, proposed by Kedem...

## 1 Introduction

Gaussian processes (GPs) have become an often-used tool for regression due to their flexibility and good generalization properties. In addition to being successful at approximating unknown functions, GPs also come equipped with a measure of model uncertainty in the form of the posterior variance Rasmussen and Williams (2006). This quantity has shown promising results for estimating the error between the posterior mean and the underlying function Srinivas et al. (2012); Chowdhury and Gopalan (2017a); Lederer, Umlauft, and Hirche (2019); Maddalena, Scharnhorst, and Jones (2020); Sun et al. (2021). Due to this characteristic, GPs have become particularly interesting for safety-critical settings, i.e., whenever safety or performance constraints need to be considered during decision-making. Some examples include controller tuning Capone and Hirche (2019); Lederer, Capone, and Hirche (2020), estimating safe operating regions Berkenkamp et al. (2017), as well as recommendation engines Sui et al. (2015).

Uniform error bounds for Gaussian process regression are typically obtained by scaling the posterior standard deviation, where the scaling factor depends on the number of data, the input domain, as well as regularity assumptions on the unknown function

Srinivas et al. (2012); Chowdhury and Gopalan (2017b); Maddalena, Scharnhorst, and Jones (2020); Lederer, Umlauft, and Hirche (2019)

. The resulting error bound is generally of a probabilistic nature, i.e., it holds with high probability, with few exceptions

Maddalena, Scharnhorst, and Jones (2020); Wu and Schaback (1993). In Lederer, Umlauft, and Hirche (2019), an error bound is derived under the assumption that the true function is drawn from a GP. The bound is achieved by estimating the error on a finite grid over the input space, and then extending the bound to the whole input space using Lipschitz properties of the GP. In Srinivas et al. (2012); Chowdhury and Gopalan (2017b); Maddalena, Scharnhorst, and Jones (2020)

, error bounds are derived under the assumption that the underlying function belongs to a reproducing kernel Hilbert space. Additionally, error bounds from methods related to GPs can be directly exploited under weak restrictions. Since radial basis function interpolation yields regressors that are identical to the GP posterior mean for noise-free training data

Kanagawa et al. (2018), deterministic error bounds can be proven using the theory of reproducing kernel Hilbert spaces Wu and Schaback (1993); Wendland (2004). Moreover, error bounds from regularized kernel regression, as derived in Mendelson (2002); Shi (2013); Dicker, Foster, and Hsu (2017), can be straightforwardly extended to GPs due to the equivalence of regularized kernel and GP regression under weak assumptions Rasmussen and Williams (2006).

While using the posterior variance to estimate the model error can prove efficient in practice, this type of approach hinges on a critical assumption that seldom holds in practice, namely that the GP hyperparameters have been specified accurately. This is because the error bounds are derived by using smoothness assumptions about the GP kernel to bound the model error. Hence, if the choice of hyperparameters is poor, then the posterior variance is typically not a good measure of the model error. This potentially leads to overconfident error bounds, rendering the model inapplicable in safety-critical settings, as illustrated in Figure 1.

Related work. This paper addresses GP uniform error bounds under misspecified kernel hyperparameters. To mitigate misspecified error bounds and guarantee no regret in a Bayesian optimization setting, Berkenkamp, Schoellig, and Krause (2019) developed an approach that gradually decreases the lengthscales of a GP, yielding convergence of the optimization algorithm towards the maximum of an unknown function. However, this is not useful for safety-critical settings, where the posterior variance, which varies strongly with the hyperparameters, is used to estimate the model error and determine the risk associated with decisions. For Matérn kernels, Tuo and Wang (2020); Wang, Tuo, and Jeff Wu (2020) have developed robust error bounds based on the fill distance under a misspecified smoothness parameter. Furthermore, Tuo and Wang (2020) provides a more general bound based on upper and lower bounds for the decay rate of the true function. Fiedler, Scherer, and Trimpe (2021) have derived error bounds for settings where the norm of the difference between the kernel used for regression and that of the reproducing kernel Hilbert space (RKHS) of the true function can be bounded. Similarly, Beckers, Umlauft, and Hirche (2018) provides an error bound when choosing a covariance function from a predefined set, under the assumption that the unknown function corresponds to a GP with a kernel from the same set. A drawback of these approaches is that they do not provide a principled or intuitive approach for obtaining hyperparameter bounds or candidate kernels.

Our contribution. In this work, we mitigate the risk of making a poor choice of hyperparameters by equipping a Gaussian process with an error bound that accounts for the lack of prior knowledge with regard to the hyperparameters. We additionally present a principled way of choosing robust uniform error bounds in a Bayesian setting without requiring any prior upper and lower bounds for the hyperparameters. The corresponding theoretical guarantees hold for a class of frequently used kernels, extending the applicability of Gaussian processes in safety-critical settings.

## 2 Gaussian Processes

In this section, we briefly review GPs and then discuss how error bounds and hyperparameters are chosen in practice.

We use GPs for regression, where we aim to infer an unknown function with . To this end, we treat function values

as random variables, of which any subset is jointly normally distributed

Rasmussen and Williams (2006). A Gaussian process is fully specified by a mean function and a positive-definite kernel . In this paper, we set

without loss of generality, and restrict ourselves to stationary kernels with radially non-increasing Fourier transform. This is specified by the following assumption.

###### Assumption 1.

The kernel is of the form

 kϑ(x,x′)\coloneqqk(x1−x′1ϑ1,…,xd−x′dϑd),

with . Furthermore, has Fourier transform for some non-increasing non-negative function .

Here denotes the kernel lengthscales, which scale the kernel inputs and specify the variability of the underlying function. A similar assumption can be found in other papers that analyze algorithms based on stationary kernels Bull (2011); Berkenkamp, Schoellig, and Krause (2019). When using this type of kernel, the similarity between any two function evaluations decreases with the weighted distance between the corresponding inputs. Many frequently encountered kernels satisfy these properties. Examples include the Gaussian and Matérn kernels, which are often employed because they satisfy the universal approximation property Micchelli, Xu, and Zhang (2006). Note that we assume only for the sake of simplicity, and all the results and tools presented in this paper can be straightforwardly extended to the more general case where the scaling factor of , i.e., the signal variance, is also a hyperparameter. This is discussed in Section 3.1.

Given a set of (potentially noisy) measurements , where , and , we are able to condition a Gaussian process on to obtain the posterior distribution of at an arbitrary test point . The corresponding distribution is normal with mean and variance

 μϑ(x∗)\coloneqq−kϑ(x∗)T(Kϑ+σ2I)−1yσ2ϑ(x∗)\coloneqqkϑ(x∗,x∗)−kϑ(x∗)T(Kϑ+σ2I)−1kϑ(x∗),

where and the entries of the covariance matrix are given by . In the case of stationary kernels, the posterior variance is typically low for test inputs that are close to the measurement data and vice versa.

### 2.1 Choosing Hyperparameters

By far the most common technique for choosing the hyperparameters of a Gaussian process is maximizing the log marginal likelihood of the measurements given the hyperparameters

 logp(y|X,ϑ)=−12log|Kϑ+σ2I|−N2log(2π)−12y⊺(Kϑ+σ2I)−1y. (1)

By maximizing (1), we obtain a trade-off between model complexity and data fit. Coupled with the fact that the gradient of (1) can be computed analytically, choosing hyperparameters in this manner yields many practical benefits. Furthermore, if the marginal likelihood is well peaked, then the true hyperparameters are likely to be situated near the selected ones. However, if this is not the case, then this approach can lead to overconfident hyperparameters, as the marginal likelihood can decrease slowly away from the maximum, implying that the lengthscales are potentially smaller than the ones obtained. This type of behavior is particularly frequent if little data has been observed, as both short and long lengthscales explain the data consistently Rasmussen and Williams (2006). Less common approaches for choosing the Gaussian process hyperparameters include the cross validation Cressie (2015) and log-pseudo likelihood maximization Sundararajan and Keerthi (2001). In some settings, the hyperparameters are chosen based on prior knowledge about the system Kirschner et al. (2019).

## 3 Uniform Error Bounds for Unknown Hyperparameters

We now introduce a modified version of standard GP error bounds that aims to overcome the limitations mentioned in the previous chapter. The proofs of all results stated here can be found in the appendix.

In the Gaussian process regression literature, there are typically two different types of assumptions that can be made for analysis. On the one hand, the Bayesian case can be considered, where we assume the unknown function to be sampled from a Gaussian process Sun et al. (2021). On the other hand, the frequentist setting can be considered, where is assumed to have a bounded RKHS norm Srinivas et al. (2012); Chowdhury and Gopalan (2017b) with respect to the kernel . While some of the techniques presented in this paper can be useful both in a Bayesian and frequentist setting, the Bayesian paradigm provides a straightforward way of narrowing the set of candidate hyperparameters used to derive an error bound, whereas no such path is evident in the frequentist setting. Hence, we henceforth assume a Bayesian setting, as described in the following assumption.

###### Assumption 2 (Bayesian setting).

The unknown function corresponds to a sample from a Gaussian process with kernel , where the hyperparameters

are drawn from a hyperprior

.

This assumption is not very restrictive, and is often encountered in control and reinforcement learning settings

Deisenroth, Fox, and Rasmussen (2015); Kocijan et al. (2005); Hewing, Kabzan, and Zeilinger (2019). The choice of prior

can be based on any prior beliefs about the underlying function, e.g., Lipschitz continuity, which can be encoded into the prior using a chi-squared or uniform distribution.

We also assume to have a scaling function that specifies a uniform error bound for an arbitrary fixed vector of hyperparameters with high probability.

###### Assumption 3 (Scaling function).

There exists a known positive function , such that for any vector of lengthscales

 P(|f(x)−μϑ(x)|≤β12(ϑ)σϑ(x)  ∀x∈X)≥1−ρ,

holds, where denotes a sample from a Gaussian process with prior mean zero and kernel , conditioned on measurement data .

Error bounds for fixed hyperparameters in the Bayesian case can be derived by assuming that the input space is compact and that the kernel satisfies some regularity requirements Srinivas et al. (2012). Improved error bounds can be obtained if we are only interested in samples that satisfy a predefined Lipschitz continuity requirement Lederer, Umlauft, and Hirche (2019); Sun et al. (2021). Additionally, if we assume the input space to be finite, then is a constant, i.e., does not depend on Srinivas et al. (2012).

Our approach is then based on the following result.

###### Lemma 1.

Let be a kernel, vectors of lengthscales with , and let denote a measurement data set of an unknown function . Furthermore, choose with . Then

 σϑ(x)≤γσϑ′(x)

holds for all .

In essence, Lemma 1 states that the scaled posterior variance decreases with the lengthscales. In some cases, the scaling factor specified in Lemma 1 can be somewhat crude. However, in practice the unscaled posterior variance often decreases with the lengthscale and can be chosen. This is illustrated in Section 5. As a direct consequence, if a pair of bounding hyperparameters is available, we can use Lemma 1 to construct an error bound that contains all error bounds corresponding to lengthscales within then interval . This is specified in the following result.

###### Theorem 1.

Let Assumptions 1 and 3 hold. Let be vectors of lengthscales with , let be a measurement data set, and let be a sample from a Gaussian process with mean zero and kernel , conditioned on . Furthermore, let denote the posterior mean for arbitrary lengthscales with . Define

 ¯β=γ((2+2σ−2n∥y∥22)12+maxϑ′≤ϑ≤ϑ′′β12(ϑ))

Then,

 |f(x)−μϑ0(x)|≤¯β12σϑ′(x)

holds for all with probability at least .

creftypecap 1 implies that we can obtain a uniform error bound using the scaled posterior variance corresponding to the smallest vector of lengthscales . The term , which we add to in creftypecap 1, originates from the discrepancy between the posterior mean of the working lengthscales and those of the bounding lengthscales . Note that the probabilistic nature of creftypecap 1 stems from Assumption 3 also being probabilistic, and no additional uncertainty is introduced when deriving creftypecap 1. In other words, the confidence region generated by the robust uniform error bound fully contains all confidence regions corresponding to with . This is illustrated in Figure 2.

In order to be able to apply creftypecap 1 in the more general setting where the bounding vectors are not given a priori, the next step is to determine permissible . Since we are in a Bayesian scenario, an intuitive approach is to choose the bounding lengthscales such that they form a confidence interval for some . Note that deriving a similar technique in the frequentist setting, e.g., as in Chowdhury and Gopalan (2017b) or Srinivas et al. (2012), is not straightforward since the corresponding tools are not directly compatible with a distribution over the hyperparameters.

Formally, our approach consists of obtaining a pair of hyperparameters that lies within the set

 Pδ=⎧⎪ ⎪⎨⎪ ⎪⎩(ϑ′,ϑ′′)∈Θ2 ∣∣∣∫ϑ′≤ϑ≤ϑ′′p(ϑ|D)dϑ≥1−δ⎫⎪ ⎪⎬⎪ ⎪⎭, (2)

where the posterior over the lengthscales given the data is computed using Bayes’ rule as

 p(ϑ|y,X)=p(y|X,ϑ)p(ϑ)p(y|X), (3)

and the posterior is computed similarly to (1). The normalizing factor is then given by

 p(y|X)=∫Θp(y|X,ϑ)p(ϑ)dϑ. (4)

By choosing the bounding lengthscales in this fashion, a vector of lengthscales sampled from the posterior distribution lies within the interval with high probability. We are then able to estimate the error in the setting with unknown hyperparameters by applying creftypecap 1, obtaining the following result.

###### Theorem 2.

Let Assumptions 1 and 3 hold. Let be vectors of lengthscales with and choose as in creftypecap 1. Furthermore, let , and let denote a sample from a Gaussian process with kernel as specified by Assumption 2. Then

 (5)

holds for all with probability .

### 3.1 Extension to Noise and Signal Variance

So far, we discussed the setting where the signal and noise variance are constant and known. However, the proposed tools extend straightforwardly to the setting where they are also unknown, provided that a corresponding prior distribution is also available. This is because the posterior variance increases with the noise and signal variance, hence a result of the form of creftypecap 2 can be easily obtained.

### 3.2 Choosing Bounding Hyperparameters

Typically, the set of bounding pairs contains more than one element. Hence, creftypecap 2 allows us some flexibility when deriving the uniform error bound, since we can choose any pair that lies within

. In the following, we derive a heuristic that aims to approximate the smallest possible

confidence region around the working hyperparameters , which would automatically meet the requirements to apply creftypecap 2. Formally, this is achieved by solving the optimization problem

 minϑ′,ϑ′′∈Θ∥ϑ′′−ϑ′∥2s.t.∫ϑ′≤ϑ≤ϑ′′p(ϑ|D)dϑ≥1−δ. (6)

In other words, we aim to choose the pair of bounding hyperparameters that deviate the least from .

### 3.3 Discussion

Although the integrals in Equation 2 and (4

) generally cannot be computed analytically, we can resort to different approximations. In low-dimensional settings, we can solve the integral expression using numerical integration. A further option is to employ Markov chain Monte Carlo (MCMC) methods

Rasmussen and Williams (2006). Alternatively, we can apply approximate inference methods, such as Laplace’s method or expectation propagation, which aim to approximate the posterior with a normal distribution. The latter methods are particularly well suited for settings with a high number of data points, as they do not require a high number of evaluations of the posterior distribution. Note that this is an advantage of the proposed technique compared to fully Bayesian GPs, which typically require some form of MCMC approach. Moreover, after computing the bounding hyperparameters , the computational complexity of evaluating the error bound is only twice that of a standard GP, whereas in a fully Bayesian GP a potentially high number of GPs has to be evaluated per prediction. A further advantage compared to fully Bayesian GPs is that, in practice, it is reasonable expect better error estimates, particularly if we choose a low risk parameter and a sufficiently smooth prior . This is because the marginal likelihood is typically well behaved in the hyperparameter space, and the fully Bayesian posterior mean and variance will lie within the uniform error bound computed with creftypecap 2. This is supported by the experimental results in Section 5. We note, however, that the proposed approach can also be combined with a fully Bayesian GP, i.e., by using the fully Bayesian posterior mean only for regression, and the presented techniques to bound the regression error.

It is also worth noting that there is no direct analogy between the approach presented in this paper and robust uniform error bounds in the frequentist setting, i.e., where the unknown function is assumed to be fixed with a bounded RKHS norm. We believe this to be a strong argument in favor of employing a Bayesian perspective instead of a frequentist one when performing regression.

## 4 Control with Performance Guarantees

creftypecap 2 can be employed straightforwardly to derive safety guarantees in different settings. In the following, we show how to apply it in a learning-based control setting with a commonly encountered system structure.

### 4.1 Control Problem

We consider an -dimensional dynamical system of the form

 ˙x1=f1(x1)+g(x1)x2˙x2=f2(x1,x2)+g(x1,x2)x3AAa⋮˙xm=fm(x1,…,xm)+g(x1,…,xm)u, (7)

where and denote the system’s states and control input, respectively. The functions are unknown and modeled using GPs, whereas we assume to know . These assumptions are common in this setting Kwan and Lewis (2000); Capone and Hirche (2019).

Our goal is to design a control law that steers the subsystem towards a desired time-dependent trajectory . In other words, we aim to reduce the norm of the time-dependent error . In the following, we assume that is times continuously differentiable, and that all its derivatives are bounded, which is not a restrictive assumption. For the sake of simplicity, we henceforth use the notation and

In order to provide performance guarantees for (7), we require some additional smoothness assumptions with respect to the kernels used to model the functions . This is expressed formally in the following.

###### Assumption 4.

Each function is drawn from a Gaussian process with times differentiable kernel .

Assumption 4 implies that the functions are times differentiable, which is not a restrictive assumption, as it applies for many systems of the form given by (7), e.g., robotic manipulators, jet engines, and induction motors Kwan and Lewis (2000).

### 4.2 Backstepping Control

In order to track the desired trajectory , we employ a backstepping technique similar to the one proposed in Capone and Hirche (2019). The idea behind backstepping is to recursively design fictitious control inputs for each subsystem , which we then aim to track by means of the true control input . For more details, the reader is referred to Krstic, Kokotovic, and Kanellakopoulos (1995).

To model the unknown functions , we assume to have noisy measurements . The control input can then be computed recursively using

 x2,d=g−11(−μ1+˙xd−C1e1)AAa⋮xi,d=g−1i−1(−μi−1+˙xi−1,d−Ci−2ei−2)AAa⋮u=g−1m(−μm+˙xm,d−em−gm−1em−1), (8)

where denotes the posterior mean of the -th Gaussian process, the tracking error of the -th subsystem, and are state-dependent control gains.

The stability analysis of the closed-loop system is carried out using Lyapunov theory. By using a quadratic Lyapunov function, it is straightforward to show that there exists an ultimate upper bound for the tracking error , i.e., holds for large enough , where depends on the control gains (Capone and Hirche, 2019, Lemma 2). By using this result and creftypecap 2, we are able to determine adaptive control gains that yield a tracking error smaller or equal than a predetermined value , which is crucial in safety-critical applications. This is formally stated in the following result.

###### Theorem 3.

Let Assumptions 4 and 1 hold, and let the control input be given by (8). Furthermore, let be bounding hyperparameters for the subsystems , obtained with (6) and risk parameter , and let be the corresponding scaling functions. Choose , as in creftypecap 1, and choose each state-dependent control gain as

 Ci(x)\coloneqq1εd ⎷m∑j=1¯βjσ2ϑ′j(x), (9)

where are chosen as in Lemma 1. Then, with probability , there exists a , such that

 ∥x(t)−xd(t)∥2≤εd (10)

holds for all .

Hence, we can use (9) to obtain the control gains required for the control performance specification .

## 5 Results

We now present experimental results where the performance of the proposed error bound is compared to that of vanilla and fully Bayesian GPs. We first showcase the prediction error on regression benchmarks, then apply the proposed technique to design a learning-based control law. A Gaussian kernel is used in all cases except the Mauna Loa experiments, and we also consider uncertainty in the signal and noise variances, as discussed in Section 3. In all except the Sarcos experiments, the fully Bayesian GPs and integral expression in (6) are approximated using the No-U Turn Sampler algorithm for MCMC Hoffman, Gelman et al. (2014). The implementations are carried out using GPyTorch Gardner et al. (2018).

The values for proposed in theory are typically very conservative Srinivas et al. (2012), which makes their use impractical. Similarly, the scaling factor proposed by creftypecap 2 assumes very high values. In many safety-critical applications, it is common to choose the scaling factor as , independently of the hyperparameters Berkenkamp et al. (2017); Umlauft et al. (2017). Moreover, empirical results indicate that Lemma 1 holds with in the case of a Gaussian kernel, i.e., the error bound will be at least as robust as that of the vanilla GP for . Hence, to obtain a fair and not overly conservative comparison, we set in the following experiments. The working hyperparameters are chosen by maximizing the log marginal likelihood (1). We employ a confidence parameter of for (6), and compute the integral expressions using Monte Carlo integration. We assume uniform distributions as hyperpriors . Note that even though this imposes hard bounds on the hyperparameters, these are still very large.

### 5.1 Regression - Toy Problem and Benchmarks

In the regression experiments, to additionally illustrate the behavior of the proposed technique as more data becomes available, we train the GPs with data sets of varying sizes .

We first investigate how the proposed technique performs when estimating the regression error of a function that is sampled from a GP. The results can be seen in Figure 4. Our approach always captures the behavior of the underlying sample . Both the vanilla GP error bound and the fully Bayesian error bound fail to do so if little data is available (Figure 3(a)) or if the data is sparse (Figure 3(b)). The estimated error is only accurate for the vanilla and fully Bayesian GPs if the data covers the state space sufficiently well (Figure 3(c)).

We now apply the proposed technique to estimate the regression error of the Boston house prices data set Pedregosa et al. (2011), the UCI wine quality data set Cortez et al. (2009), the Mauna Loa CO time series, and the Sarcos data set. For the Mauna Loa data set, we employ a spectral mixture kernel with mixtures Wilson and Adams (2013), which consists of a sum of Gaussian kernels multiplied with sinusoidal kernels. It is straightforward to show that its Fourier transform increases with the lengthscales, hence we are able to apply creftypecap 2. The mixture means, which specify the frequency of the periodic components, are assumed to be fixed except for the fully Bayesian case. For the two largest Sarcos data sets, we perform a Laplace approximation around the maximum of the posterior, which yields a normal distribution MacKay (2002). To ensure that the corresponding covariance matrix is positive definite, we employ an empirical Bayes approach and specify a quadratic hyperprior around the log likelihood maximum. The bounding hyperparameters are then obtained by computing the corresponding rectangular confidence region as in Šidák (1967).

We run each scenario multiple times and select the training and test points randomly every time. The Sarcos experiments are repeated times due to high computational requirements, all other experiments are repeated times. To evaluate performance, we check how often the error bound is incorrect, which corresponds to the measuring the quantity

 1Ntest⎛⎝Ntest∑m=1I(|ytestm−μϑ0(xtestm)|−β12σϑ′(xtestm))⎞⎠,

where the superscript test denotes test inputs/outputs, and

 I(z)={0,if z≤01,otherwise

is the indicator function. Note that this essentially corresponds to the calibration error metric for regression suggested by Kuleshov, Fenner, and Ermon (2018).

The results are summarized in Table 1. As can be seen, our bound is always more accurate than that of vanilla or fully Bayesian GPs, particularly in low-data regimes. This is to be expected, as the marginal likelihood function is known to be poorly peaked for small MacKay (1999).

### 5.2 Control Design with Little Data

We now apply the proposed technique to design a control law for a safety-critical setting, where a one-link planar manipulator with motor dynamics is to be steered towards the origin using the method presented in Section 4 and training data points. Note that is not very small for control purposes in the proposed setting, since good performance can already be achieved with as little as data points Capone and Hirche (2019). The data is obtained using a low-gain sinusoidal input.

The manipulator dynamics are given by

 D¨φ+B˙φ+Gsin(φ)= τ M+˙τ+Hτ+Z˙φ= u,

where and are the system’s angle and torque, respectively, is the motor voltage, which we can control directly, and , , , , , are system parameters. By approximating the differential equation of each state , , using a separate Gaussian process, we are able to employ a backstepping technique to track a desired trajectory, as described in Section 4. We then use creftypecap 3 to choose the control gains , , by setting the desired error to .

To evaluate the control performance, we run simulations for each setting, where the initial conditions of the manipulator are randomly sampled from a normal distribution. The norm of the resulting tracking error is displayed in Figure 5. As can be seen, the gains obtained using the proposed approach perform considerably better than the ones obtained with vanilla and fully Bayesian GPs. This is because the posterior variance of the Gaussian process decreases too slowly away from the collected data points, expressing too much confidence in the posterior mean.

## 6 Conclusion

We have presented robust uniform error bounds for Gaussian processes with unknown hyperparameters. Our approach is applicable for stationary radially non-decreasing kernels, which are commonly employed in practice. It hinges on computing a confidence region for the hyperparameters, which does not require pre-specified bounds for the hyperparameters. The presented theoretical results make them flexible and easily applicable to safety-critical scenarios, where theoretical guarantees are often required. In numerical regression benchmarks, the proposed error bound was shown to outperform the error bound obtained with standard and fully Bayesian Gaussian processes. Furthermore, the presented tool resulted in better control performance in a control problem, indicating better suitability for safety-critical settings.

## Appendix A Proofs

We begin by listing some well-known results required to prove the main statements from the paper.

###### Lemma 2 (Wendland, 2004, Theorem 6.11).

Suppose that is a continuous function. Then is positive definite if and only if its Fourier transform is nonnegative and nonvanishing.

###### Lemma 3 (Rao et al., 1973, p.70).

Let be two symmetric positive-definite matrices, such that is symmetric positive-definite. Then is symmetric positive-definite.

The following result is a direct consequence of the block matrix inversion formula, which can be found, e.g., in Horn and Johnson, 2012, p.18.

###### Lemma 4 (Horn and Johnson, 2012, p.18).

Let

 K=(~Kkk⊤k)

be an non-singular partitioned matrix. Then the lower-right entry of its inverse is given by

 [K−1]N,N=(k−k⊤~K−1k)−1.

By putting Lemma 3 and Lemma 4 together, we obtain the following statement.

###### Lemma 5.

Let

 K1=(~K1k1k⊤1k1),K2=(~K2k2k⊤2k2)

be symmetric positive definite partitioned matrices, such that is also positive definite. Then

 k1−k⊤1~K−11k1≥k2−k⊤2~K−12k2.
###### Proof.

From Lemma 3, we obtain . In particular, for , this implies, together with Lemma 4,

 (k2−k⊤2~K−12k2)−1=q⊤K−12q ≥ q⊤K−11q=(k1−k⊤1~K−11k1)−1,

i.e., . ∎

Using Lemma 2, we then obtain the following.

###### Lemma 6.

Let , be two kernels with lengthscales , let denote a measurement data set of an unknown function , and let and denote the corresponding covariance matrices. Then the matrix

 Kϑ′−d∏i=1ϑ′iϑiKϑ

is positive semi-definite.

###### Proof.

For an arbitrary , define . By Assumption 1, the Fourier transform of is given by

 ^kϑ(ω)=∫Rdk(Tx)e−i2πx⊤ωdx\mathclapz=Tx=∫Rdk(z)e−i2πz⊤T−⊤ωdet(T−1)dz=(d∏i=1ϑi)∫Rdk(z)e−i2πz⊤T−⊤ωdz=(d∏i=1ϑi)^k(T−1ω)=(d∏i=1ϑi)κ(∥T−1ω∥2). (11)

Since is non-increasing, for any pair of lengthschales with ,

 κ(∥(T′)−1ω∥2)≥κ(∥T−1ω∥2)

holds, where . Hence, (11) implies

 d∏i=1ϑiϑ′i^kϑ′(ω)−^kϑ(ω)≥0

for all . Since , it then follows from Lemma 2 that

 d∏i=1ϑiϑ′ikϑ′(⋅,⋅)−kϑ′(⋅,⋅)

is a positive-definite function, which implies the desired result. ∎

###### Proof of Lemma 1.

The result follows directly from Lemma 5 and Lemma 6. ∎

In order to prove creftypecap 2, we aim to bound the difference between posterior means . To this end, we employ the two following results.

###### Lemma 7.

Let denote the RKHS norm with respect to a kernel , and consider a function with bounded RKHS norm . Then, for all ,

 |μ(x)|2≤σ2ϑ(x)(∥μ∥2kϑ+N∑i=1(μ(xi)σn)2),

holds, where is the noise variance and are the measurement inputs.

###### Proof.

See Srinivas et al., 2012, Appendix B, eq. (11). ∎

###### Lemma 8 (Bull, 2011).

Let be a function with bounded reproducing kernel Hilbert space norm with respect to the kernel . Then, for all ,

 ∥μ∥2kϑ′≤d∏i=1ϑiϑ′i∥μ∥2kϑ.

We are now able to bound the difference between the means of two Gaussian processes conditioned on the same data, as described in the next statement.

###### Lemma 9.

Let Assumption 1 hold, let be vectors of lengthscales with , and let denote GP posterior means conditioned on some measurement data set . Then

 |μϑ0(x)−μϑ(x)|2≤σ2ϑ′(x)γ2(2+2∥y∥22σ2n)σ2ϑ(x)

holds for all , where is the noise variance, and is the measurement data.

###### Proof.

Define . From Lemma 7, we have

 |μϑ0(x)−μϑ(x)|2≤σ2ϑ′(x)(∥μϑ0−μϑ∥2kϑ′+N∑i=1(μϑ(xi)−μϑ0(xi))2σ2n) (12)

Now, the RKHS norm with respect to of a posterior mean function can be upper-bounded as

 ∥μϑ∥2kϑ=y⊤~K−1ϑKϑ~K−1ϑy≤y⊤~K−1ϑy≤∥y∥2σ2n,

where the first inequality is due to the eigenvalues of

being smaller or equal to one, whereas the last inequality holds because the eigenvalues of are greater or equal to . Furthermore, the summation in (12) can be bounded as

 N∑i=1(μϑ(xi)−μϑ(xi))2σ2n = 1σ2n∥Kϑ~K−1ϑy−Kϑ0~K−1ϑ0y∥22 ≤ 1σ2n∥Kϑ~K−1ϑy∥22+1σ2n∥Kϑ0~K−1ϑ0y∥22≤2∥y∥22σ2n.

Plugging this into (12) together with Lemmas 8 and 8 and yields

 |μϑ0(x)−μϑϑ(x)|2≤σ2ϑ′(x)(∥μϑ0−μϑ∥2kϑ′+N∑i=1(μϑ(xi)−μϑ0(xi))2σ2n)≤σ2ϑ′(x)(∥μϑ0∥2kϑ′+∥μϑ∥2kϑ′+2∥y∥22σ2n)≤σ2ϑ′(x)(d∏i=1ϑ0,iϑ′i∥μϑ0∥2kϑ0+d∏i=1ϑiϑ′i∥μϑ∥2kϑ+2∥y∥22σ2n)≤σ2ϑ′(x)(γ2∥μϑ0∥2kϑ0+γ2∥μϑ∥2kϑ+2∥y∥22σ2n)≤σ2ϑ′(x)γ2(2+2∥y∥22σ2n).

###### Proof of creftypecap 1.

By applying Lemmas 9, 1 and 3, we obtain

 |f(x)−μϑ0(x)| ≤ |f(x)−μϑ(x)|+|μϑ(x)−μϑ0(x)| ≤ (maxϑ′≤ϑ≤ϑ′′β12(ϑ))γσϑ′(x)+γ(2+2∥y∥22σ2n)12σϑ′(x) = ¯β12σϑ′(x).

. ∎

###### Proof of creftypecap 2.

It follows from the definition of that holds with probability at least . creftypecap 2 is then a direct consequence of creftypecap 1. ∎

###### Proof of creftypecap 3.

The dynamics of the control error can be written in the compact form (Capone and Hirche, 2019)

 ˙e=˙x−˙xd=Δf−(C+G)e, (13)

where , ,

 G\coloneqq⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0g10⋯0−g10g2⋮0−g20⋮⋱gd−10⋯−gd−10