# Uniform Error Bounds for Gaussian Process Regression with Application to Safe Control

Data-driven models are subject to model errors due to limited and noisy training data. Key to the application of such models in safety-critical domains is the quantification of their model error. Gaussian processes provide such a measure and uniform error bounds have been derived, which allow safe control based on these models. However, existing error bounds require restrictive assumptions. In this paper, we employ the Gaussian process distribution and continuity arguments to derive a novel uniform error bound under weaker assumptions. Furthermore, we demonstrate how this distribution can be used to derive probabilistic Lipschitz constants and analyze the asymptotic behavior of our bound. Finally, we derive safety conditions for the control of unknown dynamical systems based on Gaussian process models and evaluate them in simulations of a robotic manipulator.

## Authors

• 16 publications
• 8 publications
• 37 publications
01/13/2021

### Uniform Error and Posterior Variance Bounds for Gaussian Process Regression with Application to Safe Control

In application areas where data generation is expensive, Gaussian proces...
09/06/2021

### Gaussian Process Uniform Error Bounds with Unknown Hyperparameters for Safety-Critical Applications

Gaussian processes have become a promising tool for various safety-criti...
06/14/2020

### Learning Stable Nonparametric Dynamical Systems with Gaussian Process Regression

Modelling real world systems involving humans such as biological process...
05/25/2020

### How Training Data Impacts Performance in Learning-based Control

When first principle models cannot be derived due to the complexity of t...
06/04/2019

### Posterior Variance Analysis of Gaussian Processes with Application to Average Learning Curves

The posterior variance of Gaussian processes is a valuable measure of th...
12/15/2017

### Safe Policy Search with Gaussian Process Models

We propose a method to optimise the parameters of a policy which will be...
09/08/2020

### ℛℒ_1-𝒢𝒫: Safe Simultaneous Learning and Control

We present ℛℒ_1-𝒢𝒫, a control framework that enables safe simultaneous l...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The application of machine learning techniques in control tasks bears significant promises. The identification of highly nonlinear systems through supervised learning techniques

norgaard2000neural

and the automated policy search in reinforcement learning

deisenroth2013survey enables the control of complex unknown systems. Nevertheless, the application in safety-critical domains, like autonomous driving, robotics or aviation is rare. Even though the data-efficiency and performance of self-learning controllers is impressive, engineers still hesitate to rely on learning approaches if the physical integrity of systems is at risk, in particular if humans are involved. Empirical evaluations e.g. for autonomous driving huval2015empirical are available, however, this might not be sufficient to reach the desired level of reliability and autonomy.

Limited and noisy training data lead to imperfections in data-driven models. This makes the quantification of the uncertainty in the model and the knowledge about a model’s ignorance a key for the utilization of learning approaches in safety-critical applications. Gaussian process models provide this measure for their own imprecision and have therefore gained attention in the control community beckers2019stable ; Berkenkamp2016a ; Umlauft2018 . These approaches heavily rely on error bounds of Gaussian process regression and are therefore limited by the strict assumptions made in previous works on GP uniform error bounds Srinivas2012 ; Chowdhury2017a .

The main contribution of this paper is therefore the derivation of a novel GP uniform error bound, which requires less prior knowledge and assumptions than previous approaches and is therefore applicable to a wider range of problems. Furthermore, we derive a Lipschitz constant for the samples of GPs and investigate the asymptotic behavior in order to demonstrate that arbitrarily small error bounds can be guaranteed with sufficient computational resources and data. The proposed GP bounds are employed to derive safety guarantees for unknown dynamical systems which are controlled based on a GP model. By employing Lyapunov theory Khalil2002 , we prove that the closed-loop system - here we take a robotic manipulator as example - converges to a small fraction of the state space and can therefore be considered as safe.

The remainder of this paper is structured as follows: We briefly introduce Gaussian process regression and discuss related error bounds in Section 2

. The novel proposed GP uniform error bound, the probabilistic Lipschitz constant and the asymptotic analysis are presented in

Section 3. In Section 4 we show safety of a GP model based controller and evaluate it on a robotic manipulator in Section 5.

## 2 Background

### 2.1 Gaussian Process Regression

Gaussian process regression is a Bayesian machine learning method based on the assumption that any finite collection of random variables

111Notation: Lower/upper case bold symbols denote vectors/matrices and / all real positive/non-negative numbers. denotes all natural numbers, the identity matrix, the dot in  the derivative of  with respect to time and the Euclidean norm. A function is said to admit a modulus of continuity if and only if . The -covering number of a set (with respect to the Euclidean metric) is defined as the minimum number of spherical balls with radius which is required to completely cover . Big notation is used to describe the asymptotic behavior of functions.

follows a joint Gaussian distribution with prior mean

and covariance kernel Rasmussen2006 . Therefore, the variables are observations of a sample function

of the GP distribution perturbed by zero mean Gaussian noise with variance

. By concatenating input data points in a matrix the elements of the GP kernel matrix are defined as and

denotes the kernel vector, which is defined accordingly. The probability distribution of the GP at a point

conditioned on the training data concatenated in and

is then given as a normal distribution with mean

and variance .

### 2.2 Related Work

For many methods closely related to Gaussian process regression uniform error bounds are very common. When dealing with noise free data, i.e. in interpolation of multivariate functions, results from the field of scattered data approximation with radial basis functions can be applied

Wendland2005 . In fact, many of the results from interpolation with radial basis functions can be directly applied to noise free GP regression with stationary kernels. The classical result in Wu1993

employs Fourier transform methods to derive an error bound for functions in the reproducing kernel Hilbert space (RKHS) attached to the interpolation kernel. By additionally exploiting properties of the RKHS a uniform error bound with increased convergence rate is derived in

Schaback2002

. Prototypically for this form of bound it crucially depends on the so called power function, which corresponds to the posterior standard deviation of Gaussian process regression under certain conditions

Kanagawa2018 . In Hubbert2004 a error bound for data distributed on a sphere is developed, while the bound in Narcowich2006 extends existing approaches to functions from Sobolev spaces. Bounds for anisotropic kernels and the derivatives of the interpolant are developed in Beatson2010 . A Sobolev type error bound for interpolation with Matérn kernels is derived in Stuart2018 . Moreover, it is shown that convergence of the interpolation error implies convergence of the posterior GP variance.

Regularized kernel regression is a method which extends many ideas from scattered data interpolation to noisy observations and it is highly related to Gaussian process regression as pointed out in Kanagawa2018

. In fact, the GP posterior mean function is identical to kernel ridge regression with squared cost function

Rasmussen2006 . Many error bounds such as Mendelson2002 depend on the empirical covering number and the norm of the unknown function in the RKHS attached to the regression kernel. In Zhang2005 the effective dimension of the feature space, in which regression is performed, is employed to derive a uniform error bound. The effect of approximations of the kernel, e.g. with the Nyström method, on the regression error is analyzed in Cortes2010 . Tight error bounds using empirical covering numbers are derived under mild assumptions in Shi2013 . Finally, error bounds for general regularization are developed in Dicker2017 , which depend on regularization and the RKHS norm of the function.

Using similar RKHS based methods for Gaussian process regression, uniform error bounds depending on the maximum information gain and the RKHS norm have been developed in Srinivas2012 . While regularized kernel regression allows a wide range of observation noise distributions, the bound in Srinivas2012 only holds for bounded sub-Gaussian noise. Based on this work an improved bound is derived in Chowdhury2017a

in order to analyze the regret of an upper confidence bound algorithm in multi-armed bandit problems. Although these bounds are frequently used in safe reinforcement learning and control, they suffer from several issues. On the hand they depend on constants which are very difficult to calculate. While this is no problem for theoretical analysis, it prohibits the integration of these bounds into algorithms and often estimates of the constants must be used. On the other hand, they suffer from the general problem of RKHS approaches that the space of functions, for which the bounds hold, becomes smaller the smoother the kernel is

Narcowich2006 . In fact, the RKHS attached to a covariance kernel is usual small compared to the support of the prior distribution of a Gaussian process VanderVaart2011 .

The latter issue has been addressed by considering the support of the prior distribution of the Gaussian process as belief space. Based on bounds for the suprema of GPs Adler2007 and existing error bounds for interpolation with radial basis functions a uniform error bound for Kriging (alternative term for GP regression for noise-free training data) is derived in Wang2019 . However, the uniform error of Gaussian process regression with noisy observations has not been analyzed with the help of the prior GP distribution to the best of our knowledge.

## 3 Probabilistic Uniform Error Bound

While uniform error bounds for the cases of noise free observations and the restriction to subspaces of a RKHS are widely used, they often rely on constants which are difficult to compute and are typically limited to unnecessary small function spaces. The inherent probability distribution of GPs, which is the largest possible function space for regression with a certain GP, has not been exploited to derive uniform error bounds for Gaussian process regression with noisy observations. Under the weak assumption of Lipschitz continuity of the covariance kernel and the unknown function, a directly computable uniform error bound is derived in Section 3.1. We demonstrate how Lipschitz constants for unknown functions directly follow from the assumed distribution over the function space in LABEL:subsec:probLipschitz. Finally, we show that an arbitrarily small error bound can be reached with sufficiently many and well distributed training data in Section 3.3.

### 3.1 Exploiting Lipschitz Continuity of the Unknown Function

In contrast to the RKHS based approaches in Srinivas2012 ; Chowdhury2017a , we make use of the inherent probability distribution over the function space defined by Gaussian processes. We achieve this through the following assumption.

###### Assumption 3.1.

The unknown function is a sample from a Gaussian process and observations are perturbed by zero mean i.i.d. Gaussian noise with variance .

This assumption includes many information about the regression problem. The space of sample functions is limited through the choice of the kernel of the Gaussian process. Using Mercer’s decomposition Mercer1909 , of the kernel , this space is defined through

 F={f(x): ∃λi,i=1,…,∞ such that f(x)=∞∑i=1λiϕi(x)}, (1)

which contains all functions that can be represented in terms of the kernel . By choosing a suitable class of covariance functions , this space can be designed in order to incorporate prior knowledge of the unknown function . For example, for covariance kernels , which are universal in the sense of Steinwart2001 , continuous functions can be learned with arbitrary precision. Moreover, for the squared exponential kernel the space of sample functions corresponds to the space of continuous functions on , while its RKHS is limited to analytic functions VanderVaart2011 . Furthermore, creftype 3.1 defines a prior GP distribution over the sample space

which is the basis for the calculation of the posterior probability. The prior distribution is typically shaped by the hyperparameters of the covariance kernel

, e.g. slowly varying functions can be assigned a higher probability than functions with high derivatives. Finally, creftype 3.1 allows Gaussian observation noise which is in contrast to the bounded noise required e.g. in Srinivas2012 ; Chowdhury2017a .

In addition to creftype 3.1 we need Lipschitz continuity of the kernel and the unknown function . We define the Lipschitz constant of a differentiable covariance kernel through

 Lk (2)

Since most of the practically used covariance kernels such as squared exponential and Matérn kernels are Lipschitz continuous Rasmussen2006 , this is a weak restriction on covariance kernels. However, it allows us to prove continuity of the posterior mean function and the posterior standard deviation , which is exploited to derive a uniform error bound in the following theorem. The proofs for all the following theorems can be found in the supplementary material.

###### Theorem 3.1.

Consider a zero mean Gaussian process defined through the continuous covariance kernel with Lipschitz constant on the compact set . Furthermore, consider a continuous unknown function with Lipschitz constant and observations satisfying creftype 3.1. Then, the posterior mean and standard deviation of a Gaussian process conditioned on the training data are continuous with Lipschitz constant and modulus of continuity on such that

 LνN ≤Lk√N∥∥(K(XN,XN)+σ2nIN)−1yN∥∥ (3) ωσN(τ) ≤√2τNLk∥(K(XN,XN)+σ2nIN)−1∥maxx,x′∈Xk(x,x′). (4)

Moreover, pick , and set

 β(τ) =log(M(τ,X)δ) (5) γ(τ) =(LνN+Lf)τ+ωσN(τ). (6)

Then, it holds that

 P(|f(x)−νN(x)|≤√β(τ)σN(x)+γ(τ), ∀x∈X)≥1−δ. (7)

Note that most of the equations in Theorem 3.1 can be directly evaluated. Although our expression for  depends on the covering number of , which is in general difficult to calculate, upper bounds can be computed trivially. For example, for a hypercubic set the covering number can be bounded by

 M(τ,X)≤(1+rτ)d, (8)

where is the edge length of the hypercube. Furthermore, (3) and (4) depend only on the training data and kernel expressions, which can be calculated analytically in general. Therefore, (7) can be computed for fixed and if an upper bound for the Lipschitz constant of the unknown function is known. Prior bounds on the Lipschitz constant are often available for control systems, e.g. due to simplified first order physical models. However, we demonstrate a method to obtain probabilistic Lipschitz constants from creftype 3.1 in Section 3.2. Therefore, it is trivial to compute all expressions in Theorem 3.1 or upper bounds thereof, which emphasizes the high applicability of Theorem 3.1 in safe control of unknown systems.

Moreover, it should be noted that can be chosen arbitrarily small such that the effect of the constant can always be reduced to an amount which is negligible compared to . Even conservative approximation of the Lipschitz constants and as well as a loose modulus of continuity do not affect the error bound (7) much since (5) grows merely logarithmically with diminishing . In fact, we use this logarithmic behavior to prove a vanishing uniform error bound under weak assumptions in Section 3.3.

### 3.2 Probabilistic Lipschitz Constants for Gaussian Processes

If little prior knowledge of the unknown function is given, it might not be possible to directly derive a Lipschitz constant on . However, we indirectly assume a certain distribution of the derivatives of with creftype 3.1. Therefore, it is possible to derive a probabilistic Lipschitz constant from this assumption, which is described in the following theorem.

###### Theorem 3.2.

Consider a zero mean Gaussian process defined through the covariance kernel with continuous partial derivatives up to the fourth order and partial derivative kernels

 k∂i(x,x′) =∂2∂xi∂x′ik(x,x′)∀i=1,…,d. (9)

Let denote the Lipschitz constants of the partial derivative kernels on the set with maximal extension . Then, a sample function of the Gaussian process is almost surely continuous on and with probability of at least it holds that

 Lf=∥∥ ∥ ∥ ∥ ∥ ∥∥⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣√2log(2dδL)maxx∈X√k∂1(x,x)+12√6dmax{maxx∈X√k∂1(x,x),√rL∂1k}⋮√2log(2dδL)maxx∈X√k∂d(x,x)+12√6dmax{maxx∈X√k∂d(x,x),√rL∂dk}⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦∥∥ ∥ ∥ ∥ ∥ ∥∥ (10)

is a Lipschitz constant of on .

Note that a higher differentiability of the covariance kernel is required compared to Theorem 3.1. The reason for this is that the proof of Theorem 3.2 exploits the fact that the partial derivative of a differentiable kernel is again a covariance function, which defines a derivative Gaussian process Ghosal2006 . In order to obtain continuity of the samples of these derivative processes, the derivative kernels must be continuously differentiable Dudley1967 .

Furthermore, note that all the values required in (10) can be directly computed. The maximum of the derivative kernels as well as their Lipschitz constants can be calculated analytically for many kernels. Therefore, the Lipschitz constant obtained with Theorem 3.2 can be directly used in Theorem 3.1 through application of the union bound. Since the Lipschitz constant has only a logarithmic dependence on the probability , small error probabilities for the Lipschitz constant can easily be achieved.

### 3.3 Analysis of Asymptotic Behavior

In safe reinforcement learning and control of unknown systems an important question regards the existence of lower bounds for the learning error because they limit the achievable control performance. It is clear that the available data and constraints on the computational resources pose such lower bounds in practice. However, it is not clear under which conditions, e.g. requirements of computational power, an arbitrarily low uniform error can be guaranteed. The asymptotic analysis of the error bound, i.e. investigation of the bound (7) in the limit , can clarify this question. The following theorem is the result of this analysis.

###### Theorem 3.3.

Consider a zero mean Gaussian process defined through the continuous covariance kernel with Lipschitz constant on the set . Furthermore, consider a process generating infinitely many observations of an unknown function with Lipschitz constant and maximum absolute value on which satisfies creftype 3.1. Let and denote the mean and standard deviation of the Gaussian process conditioned on the first observations. If there exists a  such that the standard deviation satisfies , , then it holds for every that

 P(limN→∞supx∈X∥νN(x)−f(x)∥=0)≥1−δ. (11)

In addition to the conditions of Theorem 3.1 the absolute value of the unknown function is required to be bounded by a value . This is necessary to bound the Lipschitz constant of the posterior mean function in the limit of infinite training data. Even if no such constant is known, it can be derived from properties of the GP under weak conditions similarly as in Theorem 3.2. Furthermore, the posterior variance has to converge to sufficiently fast with increasing number of training data . This can be seen as a condition for the distribution of the training data, which depends on the structure of the covariance kernel. In fact, it is straight forward to derive a similar condition for the uniform error bounds in Srinivas2012 ; Chowdhury2017a . However, due to their dependence on the maximal information gain, the required decrease rates depend on the covariance kernel and are typically higher. For example, the posterior variance of a Gaussian process with squared exponential kernel must satisfy for Srinivas2012 and for Chowdhury2017a .

## 4 Safety Guarantees for Control of Unknown Dynamical Systems

Safety guarantees for dynamical systems in terms of upper bounds for the tracking error are becoming more and more relevant as learning controllers are applied in safety-critical applications like autonomous driving or robots working in close proximity to humans. We therefore show how the results in Theorem 3.1 can be applied to control safely unknown dynamical systems. In Section 4.1 we propose an approach for the tracking control problem of systems which are learned with GPs. The stability of the resulting controller is analyzed in Section 4.2.

### 4.1 Tracking Control Design

Consider the nonlinear control affine dynamical system

 ˙x1=x2,˙x2=f(x)+u, (12)

with state  and control input . While the structure of the dynamics (12) is known, the function  is not. However, we assume that it is a sample from a GP with kernel . Systems of the form (12) cover a large range of applications including Lagrangian dynamics and many physical systems.

The task is to define a policy  for which the output  tracks the desired trajectory  such that the tracking error  with  vanishes over time, i.e. . For notational simplicity, we introduce the filtered state , .

A well-known method for tracking control of control affine systems is feedback linearization Khalil2002 which aims for a model-based compensation of the non-linearity  using an estimate  and then applies linear control principles for the tracking. The policy reads then as

 u=π(x)=−^f(x)+ν, (13)

where the linear control law  is the PD-controller

 ν=¨xd−kcr−λe2, (14)

with control gains  results in the dynamics of the filtered state

 ˙r=f(x)−^f(x)−kcr. (15)

Assuming training data of the real system , , are available, we utilize the posterior mean function  for the model estimate

### 4.2 Stability Analysis

Due to safety constraints, e.g. robots working in close proximity to humans, it is usually necessary to verify that the model  is sufficiently precise and the parameters of the controller  are chosen properly. These safety certificates can be achieved if there exists an upper bound for the tracking error as defined in the following.

###### Definition 4.1 (Ultimate Boundedness).

The trajectory  of a dynamical system  is globally ultimately bounded, if there exist a positive constants  such that for every , there is a  such that

 ∥x(t0)∥≤a⇒∥x(t)∥≤b,∀t≥t0+T.

Since the solutions  cannot be computed analytically, a stability analysis is necessary, which is typically based on Lyapunov theory. It allows conclusions of the closed-loop behavior without running the policy on the real system Khalil2002 .

###### Lemma 4.1.

A dynamical system  is globally ultimately bounded to a set , containing the origin, if there exists a positive definite (so called Lyapunov) function, , for which , for all .

To check whether the controller (13) is adherent to the safety requirements, the set  must be computed as shown in the following.

###### Theorem 4.1.

Consider a control affine system (12), where  admits a Lipschitz constant on . Assume that  and the observations , , satisfy the conditions of creftype 3.1. Then, the feedback linearizing controller (13) with  guarantees with probability  that the tracking error converges to

 B={x∈X∣∣ ∣∣∥e∥≤√β(τ)σN(x)+γ(τ)kc√λ2+1}, (16)

with and defined in Theorem 3.1.

It can directly be seen, that the ultimate bound can be made arbitrarily small, by increasing the gains  or with more training points to decrease .

## 5 Numerical Evaluation

We evaluate our theoretical results in two simulations. In Section 5.1 we investigate the effect of applying Theorem 3.2 to determine a probabilistic Lipschitz constant for an unknown synthetic system. Furthermore, we analyze the effect of unevenly distributed training samples on the tracking error bound from Theorem 4.1. In Section 5.2 we apply the feedback linearizing controller (13) to a tracking problem with a robotic manipulator.

### 5.1 Synthetic System with Unknown Lipschitz Constant Lf

As an example for a system of form (12), we consider . Based on a uniform grid over  the training set is formed of  points with . The reference trajectory is a circle  and the controller gains are  and . We choose a probability of failure  and set . The state space is the rectangle . A squared exponential kernel with automatic relevance determination is utilized, for which  and  is derived analytically for the optimized hyperparameters. We make use of Theorem 3.2 to estimate the Lipschitz constant , and it turnes out to be a conservative bound (factor ). However, this is not crucial, because  can be chosen arbitrarily small and is dominated by . As Theorems 3.2 and 3.1 are subsequently utilized in this example, a union bound approximation can be applied to combine  and .

The results are shown in Figs. 2 and 1. Both plots show, that the safety bound here is rather conservative, which also results from the fact, that the violation probability was set to , among others. The simulations

### 5.2 Robotic Manipulator with 2 Degrees of Freedom

Furthermore, we consider a planar robotic manipulator with 2 degrees of freedom (DoFs), with unit length and unit masses / inertia for all links. For this example, we consider

to be known and extend Theorem 3.1 to the multidimensional case using the union bound. The state space is here four dimensional  and we consider . The training points are distributed in  and the control gain is , while other constants remain the same as in Section 5.1. The desired trajectories for both joints are again sinusoidal as shown in Fig. 3 on the right side. The robot dynamics are derived according to (murray1994mathematical, , Chapter 4).

Theorem 3.1 allows to derive a error bound in the joint space of the robot according to Theorem 4.1, which can be transformed into the task space as shown in Fig. 3 on the left. Thus, based on the learned (initially unknown) dynamics it can be guaranteed, that the robot will not leave the depicted area and can thereby be considered as safe.

Previous error bounds for GPs are not applicable to these practical setting, because they i) do not allow observation noise on the training data to be Gaussian Srinivas2012 , which is a common assumption in control. ii) utilize constant which cannot be computed efficiently (e.g. maximal information gain in srinivas2010gaussian ) or iii) make assumptions difficult to verify in practice (e.g. the RKHS norm of the unknown dynamical system Berkenkamp2016a ).

## 6 Conclusion

This paper presents a novel uniform error bound for Gaussian process regression. By exploiting the inherent probability distribution of Gaussian processes instead of the RKHS attached to the covariance kernel, a wider class of functions can be considered. Furthermore, we demonstrate how probabilistic Lipschitz constants can be estimated from the GP distribution and derive sufficient conditions to reach arbitrarily small uniform error bounds. We employ the derived results to show safety bounds for a tracking control algorithm and evaluate them in simulations of a robotic manipulator.

## References

• (1) P. M. Nørgård, O. Ravn, N. K. Poulsen, and L. K. Hansen, Neural Networks for Modelling and Control of Dynamic Systems - A Practitioner’s Handbook.   London: Springer, 2000.
• (2) M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey on policy search for robotics,” Foundations and Trends® in Robotics, vol. 2, no. 1–2, pp. 1–142, Aug. 2013. [Online]. Available: http://dx.doi.org/10.1561/2300000021
• (3)

B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil, M. Andriluka, P. Rajpurkar, T. Migimatsu, R. Cheng-Yue, F. Mujica, A. Coates, and A. Y. Ng, “An empirical evaluation of deep learning on highway driving,” 2015.

• (4) T. Beckers, D. Kulić, and S. Hirche, “Stable Gaussian process based tracking control of Euler-Lagrange systems,” Automatica, vol. 23, no. 103, pp. 390–397, 2019.
• (5) F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safe learning of regions of attraction for uncertain, nonlinear systems with Gaussian processes,” in 2016 IEEE 55th Conference on Decision and Control, CDC 2016, 2016, pp. 4661–4666.
• (6) J. Umlauft, L. Pöhler, and S. Hirche, “An Uncertainty-Based Control Lyapunov Approach for Control-Affine Systems Modeled by Gaussian Process,” IEEE Control Systems Letters, vol. 2, no. 3, pp. 483–488, 2018.
• (7) N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Information-theoretic regret bounds for Gaussian process optimization in the bandit setting,” IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3250–3265, 2012.
• (8) S. R. Chowdhury and A. Gopalan, “On Kernelized Multi-armed Bandits,” in Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 844–853.
• (9) H. K. Khalil, Nonlinear Systems; 3rd ed.   Upper Saddle River, NJ: Prentice-Hall, 2002.
• (10) C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning.   The MIT Press, 2006.
• (11) H. Wendland, Scattered Data Approximation.   Cambridge University Press, 2004.
• (12) Z. M. Wu and R. Schaback, “Local error estimates for radial basis function interpolation of scattered data,” IMA Journal of Numerical Analysis, vol. 13, no. 1, pp. 13–27, 1993.
• (13) R. Schaback, “Improved error bounds for scattered data interpolation by radial basis functions,” Mathematics of Computation, vol. 68, no. 225, pp. 201–217, 2002.
• (14) M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur, “Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences,” arXiv preprint arXiv:1807.02582, pp. 1–64, 2018. [Online]. Available: http://arxiv.org/abs/1807.02582
• (15) S. Hubbert and T. M. Morton, “Lp-error estimates for radial basis function interpolation on the sphere,” Journal of Approximation Theory, vol. 129, no. 1, pp. 58–77, 2004.
• (16) F. J. Narcowich, J. D. Ward, and H. Wendland, “Sobolev Error Estimates and a Bernstein Inequality for Scattered Data Interpolation via Radial Basis Functions,” Constructive Approximation, vol. 24, no. 2, pp. 175–186, 2006.
• (17) R. Beatson, O. Davydov, and J. Levesley, “Error bounds for anisotropic RBF interpolation,” Journal of Approximation Theory, vol. 162, no. 3, pp. 512–527, 2010.
• (18) A. M. Stuart and A. L. Teckentrup, “Posterior Consistency for Gaussian Process Approximations of Bayesian Posterior Distributions,” Mathematics of Computation, vol. 87, no. 310, pp. 721–753, 2018. [Online]. Available: http://arxiv.org/abs/1603.02004
• (19) S. Mendelson, “Improving the sample complexity using global data,” IEEE Transactions on Information Theory, vol. 48, no. 7, pp. 1977–1991, 2002.
• (20) T. Zhang, “Learning bounds for kernel regression using effective data dimensionality,” Neural Computation, vol. 17, no. 9, pp. 2077–2098, 2005.
• (21) C. Cortes, M. Mohri, and A. Talwalkar, “On the Impact of Kernel Approximation on Learning Accuracy,” Proceedings of 13th International Conference on Artificial Intelligece and Statistics, vol. 9, pp. 113–120, 2010.
• (22) L. Shi, “Learning theory estimates for coefficient-based regularized regression,” Applied and Computational Harmonic Analysis, vol. 34, no. 2, pp. 252–265, 2013. [Online]. Available: http://dx.doi.org/10.1016/j.acha.2012.05.001
• (23) L. H. Dicker, D. P. Foster, and D. Hsu, “Kernel ridge vs. Principal component regression: Minimax bounds and the qualification of regularization operators,” Electronic Journal of Statistics, vol. 11, no. 1, pp. 1022–1047, 2017.
• (24) A. van der Vaart and H. van Zanten, “Information Rates of Nonparametric Gaussian Process Methods,” Journal of Machine Learning Research, vol. 12, pp. 2095–2119, 2011. [Online]. Available: http://jmlr.csail.mit.edu/papers/volume12/vandervaart11a/vandervaart11a.pdf
• (25) R. Adler and J. Taylor, Random Fields and Geometry.   Springer Science & Business Media, 2007.
• (26) W. Wang, R. Tuo, and C. F. J. Wu, “On Prediction Properties of Kriging: Uniform Error Bounds and Robustness,” Journal of the American Statistical Society, pp. 1–38, 2019. [Online]. Available: http://arxiv.org/abs/1710.06959
• (27) J. Mercer, “Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 209, no. 441-458, pp. 415–446, 1909.
• (28)

I. Steinwart, “On the Influence of the Kernel on the Consistency of Support Vector Machines,”

Journal of Machine Learning Research, vol. 2, pp. 67–93, 2001.
• (29) S. Ghosal and A. Roy, “Posterior consistency of Gaussian process prior for nonparametric binary regression,” The Annals of Statistics, vol. 34, no. 5, pp. 2413–2429, 2006.
• (30) R. M. Dudley, “The sizes of compact subsets of Hilbert space and continuity of Gaussian processes,” Journal of Functional Analysis, vol. 1, no. 3, pp. 290–330, 1967.
• (31) R. Murray, Z. Li, and S. Sastry, A Mathematical Introduction to Robotic Manipulation.   CRC Press, 1994.
• (32) N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,” in International Conference on Machine Learning (ICML), J. Fürnkranz and T. Joachims, Eds.   Haifa, Israel: Omnipress, Jun. 2010, pp. 1015–1022. [Online]. Available: http://www.icml2010.org/papers/422.pdf
• (33) S. Grünewälder, J.-Y. Audibert, M. Opper, and J. Shawe-Taylor, “Regret Bounds for Gaussian Process Bandit Problems,” Journal of Machine Learning Research, vol. 9, pp. 273–280, 2010.
• (34) M. Talagrand, “Sharper Bounds for Gaussian and Empirical Processes,” The Annals of Probability, vol. 22, no. 1, pp. 28–76, 1994.
• (35) B. Laurent and P. Massart, “Adaptive Estimation of a Quadratic Functional by Model Selection,” The Annals of Statistics, vol. 28, no. 5, pp. 1302–1338, 2000.

## Appendix A Proof of Theorem 3.1

###### Proof of Theorem 3.1.

We first prove the Lipschitz constant of the posterior mean and the modulus of continuity of the standard deviation , before we derive the bound of the regression error. The norm of the difference between the posterior mean evaluated at two different points is given by

 ∥νN(x)−νN(x′)∥ =∥∥(k(x,XN)−k(x′,XN))α∥∥

with

 α=(K(XN,XN)+σ2nIN)−1yN. (17)

Due to the Cauchy-Schwarz inequality and the Lipschitz continuity of the kernel we obtain

 ∥νN(x)−νN(x′)∥ ≤Lk√N∥α∥∥x−x′∥,

which proves Lipschitz continuity of the mean . In order to calculate a modulus of continuity for the posterior standard deviation observe that the difference of the variance at two points can be expressed as

 |σ2N(x)−σ2N(x′)| =|σN(x)−σN(x′)||σN(x)+σN(x′)|. (18)

Since the standard deviation is positive semidefinite we have

 |σN(x)+σN(x′)|≥|σN(x)−σN(x′)| (19)

and hence, we obtain

 |σ2N(x)−σ2N(x′)|≥|σN(x)−σN(x′)|2. (20)

Therefore, it is sufficient to bound the difference of the variance at two points and take the square root of the resulting expression. Due to the Cauchy-Schwarz inequality the absolute value of the difference of the variance can be bounded by

 |σ2N( x)−σ2N(x′)|≤ ∥∥k(x,XN)−k(x′,XN)∥∥∥∥(K(XN,XN)+σ2nIN)−1∥∥∥∥k(XN,x)+k(XN,x′)∥∥. (21)

On the one hand, we have

 ∥k(x,XN)−k(x′,XN)∥≤√NLk∥x−x′∥ (22)

due to Lipschitz continuity of . On the other hand we have

 ∥k(x,XN)+k(x′,XN)∥≤2√Nmaxx,x′∈Xk(x,x′). (23)

The modulus of continuity follows from substituting (22) and (23) in (21) and taking the square root of the resulting expression. Finally, we prove the probabilistic uniform error bound by exploiting the fact that for every grid with grid points and

 maxx∈Xminx′∈Xτ∥x−x′∥≤τ (24)

it holds with probability of at least that Srinivas2012

 |g(x)−νN(x)|≤√β(τ)σN(x)∀x∈Xτ. (25)

Choose , then

 |g(x)−νN(x)|≤√β(τ)σN(x)∀x∈Xτ (26)

holds with probability of at least . Due to continuity of , and we obtain

 minx′∈Xτ|f(x)−f(x′)| ≤τLf∀x∈X (27) minx′∈Xτ|νN(x)−νN(x′)| ≤τLνN∀x∈X (28) minx′∈Xτ|σN(x)−σN(x′)| ≤ωσN(τ)∀x∈X. (29)

Moreover, the minimum number of grid points satisfying (24) is given by the covering number . Hence, we obtain

 P(|g(x)−νN(x)|≤√β(τ)σN(x)+γ(τ), ∀x∈X)≥1−δ, (30)

where

 β(τ) =2log(M(τ,X)δ) (31) γ(τ) =(Lf+LνN)τ+ωσN(τ). (32)

## Appendix B Proof of Theorem 3.2

In order to proof Theorem 3.2, several auxiliary results are necessary, which are derived in the following. The first lemma concerns the expected supremum of a Gaussian process.

###### Lemma B.1.

Consider a Gaussian process with a continuously differentiable covariance function and let denote its Lipschitz constant on the set with maximum extension . Then, the expected supremum of a sample function of this Gaussian process satisfies

 E[supx∈Xf(x)]≤12√6dmax{maxx∈X√k(x,x),√rLk}. (33)
###### Proof.

We prove this lemma by making use of the metric entropy criterion for the sample continuity of some version of a Gaussian process Dudley1967 . This criterion allows to bound the expected supremum of a sample function by

 E[supx∈Xf(x)]≤maxx∈X√k(x,x)∫0√log(N(ϱ,X))dϱ, (34)

where is the -packing number of with respect to the covariance pseudo-metric

 dk(x,x′)=√k(x,x)+k(x′,x′)−2k(x,x′). (35)

Instead of bounding the -packing number, we bound the -covering number, which is known to be an upper bound. The covering number can be easily bounded by transforming the problem of covering with respect to the pseudo-metric into a coverage problem in the original metric of . For this reason, define