DeepAI

# Robustness Guarantees for Bayesian Inference with Gaussian Processes

Bayesian inference and Gaussian processes are widely used in applications ranging from robotics and control to biological systems. Many of these applications are safety-critical and require a characterization of the uncertainty associated with the learning model and formal guarantees on its predictions. In this paper we define a robustness measure for Bayesian inference against input perturbations, given by the probability that, for a test point and a compact set in the input space containing the test point, the prediction of the learning model will remain δ-close for all the points in the set, for δ>0. Such measures can be used to provide formal guarantees for the absence of adversarial examples. By employing the theory of Gaussian processes, we derive tight upper bounds on the resulting robustness by utilising the Borell-TIS inequality, and propose algorithms for their computation. We evaluate our techniques on two examples, a GP regression problem and a fully-connected deep neural network, where we rely on weak convergence to GPs to study adversarial examples on the MNIST dataset.

• 12 publications
• 45 publications
• 20 publications
• 11 publications
04/07/2021

### Adversarial Robustness Guarantees for Gaussian Processes

Gaussian processes (GPs) enable principled computation of model uncertai...
03/05/2019

### Statistical Guarantees for the Robustness of Bayesian Neural Networks

We introduce a probabilistic robustness measure for Bayesian Neural Netw...
04/01/2022

### Autoencoder Attractors for Uncertainty Estimation

The reliability assessment of a machine learning model's prediction is a...
11/17/2017

### How Wrong Am I? - Studying Adversarial Examples and their Impact on Uncertainty in Gaussian Process Machine Learning Models

Machine learning models are vulnerable to adversarial examples: minor, i...
05/28/2019

### Robustness Quantification for Classification with Gaussian Processes

We consider Bayesian classification with Gaussian processes (GPs) and de...
02/10/2021

### Bayesian Inference with Certifiable Adversarial Robustness

We consider adversarial training of deep neural networks through the len...
11/29/2019

### Safety Guarantees for Planning Based on Iterative Gaussian Processes

Gaussian Processes (GPs) are widely employed in control and learning bec...

## Introduction

The widespread deployment of machine learning models, coupled with the discovery of their fragility against carefully crafted manipulation of training and/or test samples

[Biggio and Roli2017, Grosse et al.2017a, Szegedy et al.2013], calls for safe approaches to AI to enable their use in safety-critical applications, as argued, e.g., in [Seshia, Sadigh, and Sastry2016, Dreossi, Donzé, and Seshia2017]. Bayesian techniques, in particular, provide a principled way of combining a-priori information into the training process, so as to obtain an a-posteriori distribution on test data, which also takes into account the uncertainty in the learning process. Recent advances in Bayesian learning include adversarial attacks [Grosse et al.2017b]

and methods to compute pointwise uncertainty estimates in Bayesian deep learning

[Gal and Ghahramani2016]. However, much of the work on formal guarantees for machine learning models has focused on non-Bayesian models, such as deep neural networks (NNs) [Huang et al.2017, Hein and Andriushchenko2017] and, to the best of our knowledge, there is no work directed at providing formal guarantees for the absence of adversarial local input perturbations in Bayesian prediction settings.

Gaussian processes (GPs) are a class of stochastic processes that are, due to their many favourable properties, widely employed for Bayesian learning [Rasmussen2004], with applications spanning robotics, control systems and biological processes [Sadigh and Kapoor2015, Laurenti et al.2017, Bortolussi et al.2018]

. Further, driven by pioneering work that first recognized the convergence of fully-connected NNs to GPs in the limit of infinitely many neurons

[Neal2012], GPs have been used recently as a model to characterize the behaviour of NNs in terms of convergence analysis [Matthews et al.2018], approximated Bayesian inference [Lee et al.2017] and training algorithms [Chouza, Roberts, and Zohren2018].

In this paper we compute formal local robustness guarantees for Bayesian inference with GP priors. The resulting guarantees are probabilistic, as they take into account the uncertainty intrinsic in the Bayesian learning process and explicitly work with the a-posteriori output distribution of the GP. More specifically, given a GP model trained on a given data set, a test input point and a neighborhood around the latter, we are interested in computing the probability that there exists a point in the neighbourhood such that the prediction of the GP on the latter differs from the initial test input point by at least a given threshold. This implicitly gives guarantees on the absence of adversarial examples, that is, input samples that trick a machine learning model into performing wrong predictions.

Unfortunately, computing such a probability is far from trivial. In fact, given a compact set and the above measure reduces to computing the probability that there exists a function sampled from the GP such that there exists for which , where and is a metric norm. Since the set is composed of an infinite number of points, computing such a measure for general stochastic processes is extremely challenging. However, for GPs we can obtain tight upper bounds on the above probability by making use of inequalities developed in the theory of GPs, such as the Borell-TIS inequality [Adler and Taylor2009] and the Dudley’s Entropy Integral [Dudley1967]

. To do this, we need to obtain lower and upper bounds on the extrema of the a-posteriori GP mean and variance functions on neighborhoods of a given test point. We obtain these bounds by constructing lower and upper approximations for the GP kernel as a function of the test point, which are then propagated through the GP inference formulas. Then, safe approximations for these values are obtained by posing a series of optimization problems that can be solved either analytically or by standard convex optimization techniques. We illustrate the above framework with explicit algorithmic techniques for GPs built with squared-exponential and ReLU kernel.

Finally, we apply the methods presented here to characterize the robustness of GPs with ReLU kernel trained on a subset of images included in the MNIST dataset. Relying on the weak convergence between fully-connected NNs with ReLU activation functions and the corresponding GPs with ReLU kernel, we analyze the behaviour of such networks on adversarial images in a Bayesian setting. We use SIFT

[Lowe2004] to focus on important patches of the image, and perform feature-level safety analyses of test points included in the dataset. We apply the proposed methods to evaluate the resilience of features against (generic) adversarial perturbations bounded in norm, and discuss how this is affected by stronger perturbations and different misclassification thresholds. We perform a parametric optimization analysis of maximum prediction variance around specific test points in an effort to characterize active defenses against adversarial examples that rely on variance thresholding. In the examples we studied, we have consistently observed that, while an increased number of training samples may significantly help detect adversarial examples by means of prediction uncertainty, the process may be undermined by more complex architectures.

In summary, the paper makes the following main contributions:

• We provide tight upper bounds on the probability that the prediction of a Gaussian process remains close to a given test point in a neighbourhood, which can be used to quantify local robustness against adversarial examples.

• We develop algorithmic methods for the computation of extrema of GP mean and variance over a compact set.

• Relying on convergence between fully-connected NNs and GPs, we apply the developed methods to provide feature-level analysis of the behaviour of the former on the MNIST dataset.

#### Why probabilistic local robustness guarantees?

Our results provide formal local robustness guarantees in the sense that the resulting bounds are sound with respect to a neighbourhood of an input, and numerical methods have not been used. This enables certification for Bayesian methods that is necessary in safety-critical applications, and is in contrast with many existing pointwise approaches to detect adversarial examples in Bayesian models, which are generally based on heuristics, such as to reject test points with high uncertainty

[Li and Gal2017, Feinman et al.2017]. We illustrate the intuition with the following simple example.

###### Example 1.

Let be a zero-mean stochastic process with values in . Consider the following widely used definition of safety for a set

 Psafe(z,T,δ)=Prob(∀x∈T,z(x)<δ), (1)

where is a given threshold. Assume that, for all , and

are independently and equally distributed random variables such that for each

we have Then, if we compute the above property we obtain

 Psafe(z,T,δ)=0.8510≈0.197.

Thus, even though at each point has relatively high probability of being safe, is still small. This is because safety, as defined in Eqn 1, depends on a set of points, and this must be accounted for to give robustness guarantees for a given stochastic model. Note that, to simplify, we used a discrtete set , but the same reasoning remains valid even if , as in this paper.

### Related Work

Existing formal approaches for machine learning models mostly focus on computing non-probabilistic local safety guarantees [Raghunathan, Steinhardt, and Liang2018, Huang et al.2017, Ruan, Huang, and Kwiatkowska2018] and generally neglect the uncertainty intrinsic in the learning process, which is intrinsic in a Bayesian model. Recently, empirical methods to detect adversarial examples for Bayesian NNs that utilise pointwise uncertainty have been introduced [Li and Gal2017, Feinman et al.2017]. However, these approaches can be fooled by attacks that generate adversarial examples with small uncertainty as shown in [Carlini and Wagner2017]. Unfortunately, obtaining formal guarantees for Bayesian NNs is challenging since their posterior distribution, which can be obtained in closed form for GPs, is generally analytically intractable [Gal and Ghahramani2016]. In [Grosse et al.2017b] attacks for Bayesian inference with Gaussian processes based on local perturbations of the mean have been presented.

Notions of safety for Gaussian processes have been recently studied in the context of system design for stochastic models (see, e.g. [Wachi et al.2018, Bartocci et al.2015, Sadigh and Kapoor2015, Sui et al.2015]). In [Sadigh and Kapoor2015], the authors synthesize safe controllers against Probabilistic Signal Temporal Logic (PrSTL) specifications, which suffer from the issue illustrated in Example 1. Another related approach is that in [Sui et al.2015], where the authors build on [Srinivas et al.2012]

and introduce SAFEOPT, a Bayesian optimization algorithm that additionally guarantees that, for the optimized parameters, with high probability the resulting objective function (sampled from a GP) is greater than a threshold. However, they do not give guarantees against perturbation of the synthesized parameters, i.e., will the resulting behaviour still be safe and close to the optimal if parameters close to the optimal, but for example corrupted by noise, are applied? Our approach allows one to quantify such a probability. We should also stress that, while it is often the case that the guarantees provided by existing algorithms are statistical (i.e., given in terms of confidence intervals), the bounds presented in this paper are probabilistic.

## Problem Formulation

We consider a Gaussian process with values in and with a Gaussian probability measure such that, for any ,

is a multivariate normal distribution

222In this paper we assume is a separable stochastic process. This is a standard and common assumption [Adler and Taylor2009]. The separability of guarantees that Problem 1 and 2 are measurable.. We consider Bayesian inference for . That is, as illustrated in detail in the next section, given a dataset we consider the process

 z(x)|D,x∈Rm,

which represents the conditional distribution of given the set of observations in . The first problem we examine is Problem 1, where we want to compute the probability that local perturbations of a given test point result in predictions that remain close to the original.

###### Problem 1.

(Probabilistic Safety). Consider the training dataset . Let and fix For call

 ϕi1(x∗, T,δ|D)= P(∃x′∈Ts.t.(z(i)(x∗)−z(i)(x′))>δ|D),

where is the i-th component of Then we say that component in is safe with probability for with respect to set and perturbation iff

 ϕi1(x∗ ,T,δ|D)≤ϵ. (2)

Intuitively, we consider a test point and a compact set containing , and compute the probability that the predictions of remain close for each . We consider the components of the GP individually and with sign, enabling one-sided analysis. Note that is composed of an uncountable number of points, making the probability computation challenging. Problem 1 can be generalized to local invariance of with respect to a given metric (Problem 2 below). In the next section, for the corresponding solution, we will work with the norm, but all the results can be easily extended to any norm, including .

###### Problem 2.

(Probabilistic Invariance) Consider the training dataset . Let and assume For metric and call

 ϕ2(x∗,

Then we say that is invariant with respect to metric for in and perturbation with probability iff

 ϕ2(x∗, T,δ|D)≤ϵ. (3)

Probabilistic invariance, as defined in Problem 2, bounds the probability that each function sampled from remains within a distance of at most to the initial point. Note that both Problem 1 and 2 quantify the probability of how the output of a learning process changes its value in a set around a given test input point, which implicitly gives probabilistic guarantees for local robustness against adversarial examples. In the next section, in Theorem 1 and 2, we give analytic upper bounds for Problem 1 and 2. In fact, analytic distributions of the supremum of a GP, which would allow one to solve the above problems, are known only for a very limited class of GPs (and always for GPs evolving over time) [Adler and Taylor2009], making exact computation impossible. However, first, we illustrate the intuition behind the problems studied here on a GP regression problem.

###### Example 2.

We consider a regression problem taken from [Bach2009], where we generate 128 samples from a random two-dimensional covariance matrix, and define labels as a (noisy) quadratic polynomial of the two input variables. We train a GP with squared-exponential kernel on this dataset, using a maximum likelihood estimation of the kernel hyper-parameters [Rasmussen2004]. The mean and variance of the GP obtained after training are plotted in Figure 3, along with the samples used for training.

Consider the origin point , let and define . As is a saddle point for the mean function, variations of the mean around it are relatively small. Analogously, the variance function exhibits a flat behaviour around , meaning greater confidence of the GP in performing predictions around . As such we expect realizations of the GP to be consistently stable in a neighbourhood of , which in turn translates to low values for and where in and , to simplify the notation, we omit the dataset used for training. On the other hand, around the a-posteriori mean changes quickly and the variance is high, reflecting higher uncertainty. Hence, letting , we expect the values of and to be greater than those computed for .

In the next section we show how and can be computed to quantify the uncertainty and variability of the predictions around and

## Theoretical Results

Since is a Gaussian process, its distribution is completely defined by its mean and covariance (or kernel) function Consider a set of training data and call Training in a Bayesian framework is equivalent to computing the distribution of given the dataset , that is, the distribution of the process

 ¯z=z|D.

Given a test point and training inputs in

, consider the joint distribution

, which is still Gaussian with mean and covariance matrix given by

where

is the covariance matrix relative to vector

Then, it follows that is still Gaussian with mean and covariance matrix defined as follows:

 ¯μ(x∗)=μ(x∗)+Σx∗,DΣ−1D,D(y−μD) (4) ¯Σx∗,x∗=Σx∗,x∗−Σx∗,DΣ−1D,DΣTx∗,D, (5)

where Hence, for GPs the distribution of can be computed exactly.

Given two test points and , the above calculations can still be applied to compute the joint distribution

 ¯z(x∗)=([z(x∗1),z(x∗2)]|D).

In particular, is still Gaussian and with mean and covariance matrix given by Eqns (4) and (5) but with and From we can obtain the distribution of the following random variable

 zo(x∗1,x∗2)=(z(x∗1)−z(x∗2))|D,

which represents the difference of at two distinct test points after training. It is straightforward to show that, given such that where

is the identity matrix of dimension

is Gaussian with mean and variance

 μo(x∗1,x∗2)=B¯μ(x∗)Σox∗1,x∗2=B¯Σx∗,x∗BT.

is the distribution of how after training, changes with respect to two different test points. However, to solve Problem 1 and 2, we need to take into account all the test points in and compute the probability that in at least one of them exits from a given set of the output space. This is done in Theorem 1 by making use of the Borell-TIS inequality and of the Dudley’s entropy integral [Adler and Taylor2009, Dudley1967]. The above inequalities allow one to study Gaussian processes by appropriately defining a metric on the variance of the GPs. In order to define such a metric we call the GP with the same covariance matrix as but with zero mean and its -th component. For , a test point and we define the (pseudo-)metric by

 d(i)x∗(x1,x2)= √E[(^zo,(i)(x∗,x1)−^zo,(i)(x∗,x2))2] (6) = √E[(^z(i)(x2)−^z(i)(x1))2],

where is the -th component of the zero-mean version of . Note that does not depend on Additionally, we assume there exists a constant such that for a compact and 333Note that here we work with the norm, but any other metric would work.

 d(i)x∗(x1,x2)≤K(i)x∗||x1−x2||2.

Now, we are finally ready to state the following theorem.

###### Theorem 1.

Assume is a hyper-cube with layers of length . For and let

 ηi=δ−(supx∈Tμo,(i)(x∗,x)+

Assume . Then, it holds that

 ϕi1(x∗,T,δ|D)≤e−η2i2ξ(i),

where is the supremum of the component of the covariance matrix .

In Theorem 1 we derive upper bounds for . Considering that

is a Gaussian process, it is interesting to note that the resulting bounds still follow an exponential distribution. From Theorem

1 we have the following result

###### Theorem 2.

Assume is a hyper-cube with layers of length . For let

 ¯ηi=δ−supx∈T|μo(x∗,x)|1n−

For each assume . Then, it holds that

 ϕ2(x∗,T,δ|D)≤2n∑i=1e−¯η2i2ξ(i),

where .

Note that in Theorem 1 and 2 we assume that is a hyper-cube. However, proofs of both theorems (reported in the Supplementary Materials) can be easily extended to more general compact sets, at a cost of more complex analytic expressions or less tight bounds. Both theorems require the computation of a set of constants, which depends on the particular kernel. In particular, and are upper bound of variance and mean a-posteriori while, for a test point and represent local Lipschitz constant and upper bound for in In the next section, we show how these constants can be computed.

###### Example 3.

We illustrate the upper bounds for and , as given by Theorem 1 and 2, on the GP introduced in Example 2. Figure 4 shows the values obtained for and on and for between and . We observe that values computed for are consistently greater than those computed for , which captures and probabilistically quantifies the increased uncertainty of the GP around , as well as the increased ratio of mean variation around it (see Figure 3). Notice also that values for are always smaller than the corresponding values. This is a direct consequence of the fact that probabilistic invariance is a stronger requirement than probabilistic safety, as defined in Problem 1, as the latter is not affected by variations that tend to increase the value of the GP output (translating to increased confidence in classification settings).

### Constant Computation

We introduce a general framework for the computation of the constants involved in the bounds presented in the previous section with an approach based on a generalisation of that of [Jones, Schonlau, and Welch1998] for squared-exponential kernels in the setting of Kriging regression. Namely, we assume the existence of a suitable decomposition of the kernel function as for all and , such that:

1. is a continuous function;

2. is differentiable, with continuous;

3. can be exactly computed for each and , .

While assumptions 1 and 2 usually follow from smoothness of the kernel used, assumption 4 depends on the particular defined. Intuitively, should represent the smallest building block of the kernel which captures the dependence on the two input points. For example for the squared exponential kernel this has the form of a separable quadratic polynomial so that assumption 3 is verified. Similarly, for the ReLU kernel can be defined as the dot product between the two input points.

Assumptions 1 and 2 guarantee the existence for every of a set of constants , , and such that:

 aiL+biLφΣ(x,xi)≤Σx,xi≤aiU+biUφΣ(x,xi)∀x∈T. (7)

In fact, it follows from those that has a finite number of flex points. Hence, we can iteratively find lower and upper approximation in convex and concave parts, and merge them together as detailed in the Supplementary Material. The key point is that, due to linearity, this upper and lower bound on the kernel can be propagated through the inference equations for Gaussian processes, so as to obtain lower and upper linear bounds on the a-posteriori mean and variance with respect to . Thanks to assumption 3, these bounds can be solved for optimal points exactly, thus providing formal lower and upper values on optimization over a-posteriori mean and variance of the Gaussian process in . The described approach can be used to compute , and . In the following subsection we give details for the computation of . We refer to the Supplementary Materials for the details for the other constants and for squared-exponential and ReLU kernels.

#### Mean Computation

As is fixed, we have that , hence we need just to compute . Using Eqn (7) we can compute a lower and upper bound to this inferior, which can be refined using standard branch and bound techniques. Let , then by the inference formula for Gaussian processes and Eqn (7) we have that:

 ¯μ(x)=n∑i=1tiΣx,xi≥n∑i=1ti(ai+biφΣ(x,xi))∀x∈T

where we choose: Let be an inferior point for to the right-hand side Equation (we refer to the Supplementary Materials to show how this can be computed for squared-exponential and ReLU kernels), then, by the definition of inferior we have that:

 ¯μ(¯x)≥infx∈T¯μ(x)≥n∑i=1tiai+n∑i=1tibiφΣ(¯x,xi).

The latter provide bounds on that can be used within a branch and bound algorithm for further refinement.

### Computational Complexity

Performing inference with GPs has a cost that is where is the size of the dataset. Once inference has been performed the cost of computing upper and lower bounds for is where is a constant that depends on the particular kernel. For instance, for the squared-exponential kernel while for the ReLU kernel (Eqn (8)) where is the number of layers of the corresponding neural network. The computation of the bounds for requires solving a convex problem in variables, while and can be bounded in constant time. Refining the bounds with a branch and bound approach has a worst-case cost that is exponential in the dimension of the input space. Hence, sparse approximations, which mitigate the cost of performing inference with GP [Seeger, Williams, and Lawrence2003], are appealing.

## Experimental Evaluation: Robustness Analysis of Deep Neural Networks

In this section we apply the methods presented above to GP defined with deep kernels, in an effort to provide a probabilistic analysis of adversarial examples. This analysis is exact for GPs, but only approximate for fully-connected NNs, by virtue of weak convergence of the induced distributions between deep kernel GPs and deep fully-connected NNs.

### Experimental Setting

We focus on GPs with ReLU kernel, which directly correspond to fully-connected NNs with ReLU activation functions. Given the number of layers , the regularization parameters (prior variance on the weights) and (prior variance on the bias), the ReLU covariance between two input points is iteratively defined by the set of equations [Lee et al.2017]:

 Σl(x1,x2) =σ2b+σ2w2π√Σl−1(x1,x1)Σl−1(x2,x2) (sinβl−1x1,x2+(π−βl−1x1,x2)cosβl−1x1,x2) (8) βlx1,x2 =cos−1(Σl(x1,x2)√Σl(x1,x1)Σl(x2,x2))

for , where .

##### Training

We follow the experimental setting of [Lee et al.2017], that is, we train a selection of ReLU GPs on a subset of the MNIST dataset using least-square classification (i.e. posing a regression problem to solve the classification task) and rely on optimal hyper-parameter values estimated in the latter work. Note that the methods we presented are not constrained to specific kernels or classification models, and can be generalized by suitable modifications to the constant computation part. Classification accuracy obtained on the full MNIST test set varied between (by training only on 100 samples) to (training on 2000 samples). Unless otherwise stated, we perform analysis on the best model obtained using training samples, that is, a two-hidden-layer architecture with and .

##### Analysis

For scalability purposes we adopt the idea from [Wicker, Huang, and Kwiatkowska2018, Ruan, Huang, and Kwiatkowska2018] of performing a feature-level analysis. Namely, we pre-process each image using SIFT [Lowe2004]. From its output, we keep salient points and their relative magnitude, which we use to extract relevant patches from each image, in the following referred to as features. We apply the analysis to thus extracted features. Unless otherwise stated, feature numbering follows the descending order of magnitude.

### Feature-based Analysis

In the first row of Figure 5

we consider three images from the MNIST test data, and for each we highlight the first five features extracted by SIFT (or less if SIFT detected less than five features). For each image

, feature and we consider the set of images given by the images differing from in only the pixels included in and by no more than for each pixel.

We plot the values obtained for as a function of for and , respectively, on the second and third row of Figure 5. Recall that represents the probability of finding such that the classification confidence for the correct class in drops by more than compared to that of . Since a greater value implies a larger neighborhood , intuitively will monotonically increase along with the value of . Interestingly, the rate of increase is significantly different for different features. In fact, while most of the 14 features analyzed in Figure 5 have similar values for , the values computed for some of the features using are almost double (e.g. feature 4 for the third image), and remains fairly similar for others (e.g. feature 3 for the first image). Also the relative ordering in robustness for different features is not consistent for different values of (e.g. features 2 and 5 from the first image). This highlights the need of performing parametric analysis of adversarial attacks, which take into account different strengths and misclassification thresholds, as suggested in [Biggio and Roli2017]. Finally, notice that, though only 14 features are explored here, the experiment shows no clear relationship between feature magnitude as estimated by SIFT and feature robustness, which calls for caution in adversarial attacks and defences that rely on feature importance.

### Variance Analysis

Most active defences are based upon rejecting input samples characterized by high uncertainty values. After uncertainty is estimated, defences of this type usually proceed by setting a meta-learning problem whose goal is to distinguish between low and high variance input points, so as to flag potential adversarial examples [Grosse et al.2017b, Feinman et al.2017]. However, mixed results are obtained with this approach [Carlini and Wagner2017].

In this subsection we aim at analyzing how the variance around test samples changes with different training settings for the three test points previously discussed. We use the method developed for variance optimisation to compute:

 σ2n(x∗)=1¯Σx∗,x∗supx∈Tf1,γx∗¯Σx,x,

that is, we look for the highest variance point in the neighbourhood of , and normalise its value with respect to the variance at . We use and perform the analysis only on feature 1 of each image.

Figure 6 plots values of as a function of the number of layers (from 1 to 10) and samples (from 100 to 2000) included in the training set. Firstly, notice how maximum values of are perfectly aligned with the results of Figure 5. That is, less robust features are associated with higher values of (e.g. feature 1 for image 1). This highlights the relationship between the existence of adversarial examples in the neighbourhood of a point and model uncertainty. We observe the normalised variance value to consistently monotonically increase with respect to the number of training samples used. This suggests that, as more and more training samples are input into the training model, the latter become more confident in predicting “natural” test samples compared to “artificial” ones. Unfortunately, as the number of layers increases, the value of decreases rapidly to a plateau. This seems to point to the fact that defence methods based on a-posteriori variance thresholding become less effective with more complex neural network architectures, which could be a justification for the mixed results obtained so far using active defences.

## Conclusion

In this paper we presented a formal approach for safety analysis of Bayesian inference with Gaussian process priors with respect to adversarial examples and invariance properties. As the properties considered in this paper cannot be computed exactly for general GPs, we compute their safe over-approximations. Our bounds are based on the Borell-TIS inequality and the Dudley entropy integral, which are known to give tight bounds for the study of suprema of Gaussian processes [Adler and Taylor2009]. On examples of regression tasks for GPs and deep neural networks, we showed how our results allow one to quantify the uncertainty associated to a given prediction, also taking into account of local perturbations of the input space. Hence, we believe our results represent a step towards the application of Bayesian models in safety-critical applications.

## Appendix A Supplementary Materials

In what follows we report the supplementary material of the paper. We first report the proofs of the main results and then give further details of the algorithmic framework we develop to compute the constants required in Theorem 1 and 2. Finally, we give details for the case of the squared-exponential kernel and ReLu kernel.

## Proofs

##### Proof of Theorem 1
 P(∃x∈Ts.t.(z(i)(x)−z(i)(x∗)>δ) (By definition of supremum) = P(supx∈Tzo,(i)(x∗,x)>δ) (By linearity of GPs) = (By definition of supremum) ≤ P(supx∈T^zo,(i)(x∗,x)>δ−supx1∈TE[zo,(i)(x∗,x1)]).

where is the zero mean Gaussian process with same variance of The last inequality can be bound from above using the following inequality, called Borell-TIS inequality [Adler and Taylor2009].

###### Theorem 3.

(Borell-TIS inequality) Let a zero-mean unidimensional Gaussian process with covariance matrix . Assume . Then, for any it holds that

 P(supx∈T^z(x)>u)≤e(u−E[supt∈T^z(x)])22σ2T, (9)

where .

In order to use the Borell-TIS inequality we need to bound from above , the expectation of the supremum of . For Gaussian processes we can use the Dudley’s entropy integral [Adler and Taylor2009], which guarantees that

 E[supx∈T ^z(x)]≤12∫supx1,x2∈Td(x1,x2)0√ln(N(d,x,T))dx,

where is the smallest number of balls of radius according to metric that completely cover (see [Adler and Taylor2009] for further details). For a hyper-cube of dimension , in order to compute we first need to compute , the number of covering balls of diameter of under norm. As the largest hyper-cube contained inside a sphere of diameter has a side of length we obtain

 N(L2,r,T)≤(1+D√mr)m.

Now we know that for

 supx1,x2∈Td(i)x∗(x2,x1)≤K(i)^x||x2−x1||2,

Thus, this implies that all the points inside a ball of radius will have a distance in the d metric smaller or equal than . Thus, the number of covering balls of radius for , according to pseudo-metric is upper-bounded by

 N(d,x,T)≤(√mDK(i)^xx+1)m.
##### Proof of Theorem 2
 P(∃x∈Ts.t.||z(x)−z(x∗)||1>δ) (By definition of supremum) = P(supx∈T||z(x)−z(x∗)||1>δ) (By definition of L1 norm) = P(supx∈Tn∑i=1|z(i)(x)−z(i)(x∗)|>δ) (By closure of GPs wrt linear operations) ≤ P(supx∈Tn∑i=1|^z(i)(x)−^z(i)(x∗)|>δ−supx1∈T||μo(x1,x∗)||1) (By the fact that ∀i∈{1,...,n}|^z(i)(x)−^z(i)(x∗)|≥0) ≤ P(∨i∈{1,...,n}supx∈T|^z(i)(x)−^z(i)(x∗)|>δ−supx1∈T||μo(x1,x∗)||1n) (By the union bound and symmetric properties of % Gaussian distributions) ≤ 2n∑i=1P(supx∈T^zo,(i)(x,x∗)>δ−supx1∈T||μo(x1,x∗)||1n) .

Last term can be bounded by using the Borell-TIS inequality and Dudley’s entropy integral, as shown in the proof of Theorem 1.

###### Lemma 1.

Let be a symmetric and positive definite matrix and . Then, it holds that

 aTBa+bTBb−2bTBa>0.
###### Proof.

Let

be the diagonal matrix with diagonal entries the eigenvalues of

. Then, there exists an ortonormal matrix such that . Consider the vectors and . Then, we obtain

 aTBa+bTBb−2bTBa=(Ca1)TBCTa1+(Cb1)TBCb1−2(Cb1)TBCa1= aT1Da1+bT1Db1−2bT1Da1=m∑i=1λi(a(i)1−b(i)1)2,

which is greater than zero as all the eigenvalues of are positive by definition of positive definite matrix. ∎

## Appendix B Constants Computation

### Lower and Upper bound to Kernel Function

In this subsection we describe a method for computing lower and linear approximation to the kernel function. Namely, given and , in this we show how to compute , such that:

 aL+bLφΣ(x,x∗)≤Σx,x∗∀x∈T.

Notice that the same techniques can be used to find and coefficients of an upper-bound, simply by considering . Let and be maximum and minimum values of for , and consider the univariate and unidimensional function . We can then compute and by using the methods described below.

##### Case 1

If happens to be concave function, than by definition of concave function, a lower bound is given by the line that links the points and .

##### Case 2

If on the other hand happens to be a convex function, than by definition of convex function, a valid lower bound is given by the tangent line in the middle point of the interval.

##### Case 3

Assume now, that is concave in , and convex in , for a certain (the same line of arguments can be used by reversing convexity and concavity). Let , coefficients for linear lower approximation in and , analogous coefficients in (respectively computed as for Case 1 and 2), and call and the corresponding functions. Define to be the linear function of coefficients and that goes through the two points and . We then have that is a valid linear lower bound in in fact:

1. if : in this case we have that , and . Hence