# Verifying Controllers Against Adversarial Examples with Bayesian Optimization

Recent successes in reinforcement learning have lead to the development of complex controllers for real-world robots. As these robots are deployed in safety-critical applications and interact with humans, it becomes critical to ensure safety in order to avoid causing harm. A first step in this direction is to test the controllers in simulation. To be able to do this, we need to capture what we mean by safety and then efficiently search the space of all behaviors to see if they are safe. In this paper, we present an active-testing framework based on Bayesian Optimization. We specify safety constraints using logic and exploit structure in the problem in order to test the system for adversarial counter examples that violate the safety specifications. These specifications are defined as complex boolean combinations of smooth functions on the trajectories and, unlike reward functions in reinforcement learning, are expressive and impose hard constraints on the system. In our framework, we exploit regularity assumptions on individual functions in form of a Gaussian Process (GP) prior. We combine these into a coherent optimization framework using problem structure. The resulting algorithm is able to provably verify complex safety specifications or alternatively find counter examples. Experimental results show that the proposed method is able to find adversarial examples quickly.

## Authors

• 13 publications
• 12 publications
• 8 publications
• 4 publications
• 29 publications
• ### Learning-based Model Predictive Control for Safe Exploration and Reinforcement Learning

Learning-based methods have been successful in solving complex control t...
03/22/2018 ∙ by Torsten Koller, et al. ∙ 0

• ### Gray-box Adversarial Testing for Control Systems with Machine Learning Component

Neural Networks (NN) have been proposed in the past as an effective mean...
12/31/2018 ∙ by Shakiba Yaghoubi, et al. ∙ 0

• ### Safe Exploration for Interactive Machine Learning

In Interactive Machine Learning (IML), we iteratively make decisions and...
10/30/2019 ∙ by Matteo Turchetta, et al. ∙ 20

• ### A Safe Hierarchical Planning Framework for Complex Driving Scenarios based on Reinforcement Learning

Autonomous vehicles need to handle various traffic conditions and make s...
01/17/2021 ∙ by Jinning Li, et al. ∙ 0

• ### Global Optimization of Objective Functions Represented by ReLU Networks

Neural networks (NN) learn complex non-convex functions, making them des...
10/07/2020 ∙ by Christopher A. Strong, et al. ∙ 0

• ### Safe RAN control: A Symbolic Reinforcement Learning Approach

In this paper, we present a Symbolic Reinforcement Learning (SRL) based ...
06/03/2021 ∙ by Alexandros Nikou, et al. ∙ 0

• ### Tractable Reinforcement Learning of Signal Temporal Logic Objectives

Signal temporal logic (STL) is an expressive language to specify time-bo...
01/26/2020 ∙ by Harish Venkataraman, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In recent years, research in control theory and robotics has focused on developing efficient controllers for robots that operate in the real world. Controller synthesis techniques such as reinforcement learning, optimal control, and model predictive control have been used to synthesize complex policies. However, if there is a large amount of uncertainty about the real world environment that the system interacts with, the robustness of the synthesized controller becomes critical. This is particularly true in safety-critical systems, where the actions of an autonomous agent may affect human lives. This motivates us to provably verify the properties of controllers in simulation before deployment in the real world.

In this paper, we present an active machine learning framework that is able to verify black-box systems against, or alternatively find, adversarial counter examples to a given set of safety specifications. We test the controller safety under uncertainty that arises from stochastic environments and errors in modeling. In essence, we actively search for adversarial environments under which the controller could have to operate that lead failure modes in simulation.

Historically, designing robust controllers has been considered in control theory [1, 2]. A common issue with these techniques is that, although they consider uncertainty, they rely on simple linear models of the underlying system. This means that resulting controllers are often either overly conservative or violate safety constraints if they fail to capture nonlinear effects.

For nonlinear models with complex dynamics, reinforcement learning has been successful for synthesizing high fidelity controllers. Recently, algorithms based on reinforcement learning that can handle uncertainty have been proposed [3, 4, 5], where the performance is measured in expectation. A fundamental issue with learned controllers is that it is difficult to provide formal guarantees for safety in the presence of uncertainty. For example, a controller for an autonomous vehicle must consider human driver behaviors, pedestrian behaviors, traffic lights, uncertainty due to sensors, etc. Without formally verifying that these controllers are indeed safe, deploying them on the road could lead to loss of property or human lives.

Formal safety certificates, i.e., mathematical proofs for safety, have been considered in the formal methods community, where safety requirements are referred to as a specification. There, the goal is to verify that the behaviors of a particular model satisfies a specification ([6, 7]). Synthesizing controllers which satisfy a high level temporal specification have been studied in the context of motion planning [8] and for cyber-physical systems [9]. However, these techniques rely on simple model dynamics. For nonlinear systems, reachability algorithms based on level set methods have been used to approximate backward reachable sets for safety verification [10, 11]

. However, these methods suffer from two major drawbacks: (1) the curse of dimensionality of the state space, which limits them to low-dimensional systems; and (2)

a priori knowledge of the system dynamics.

A dual, and often simpler, problem is falsification

, which tests the system within a set of environment conditions for adversarial examples. Adversarial examples have recently been considered for neural networks

[12, 13, 14, 15], where the input is typically perturbed locally in order to find counterexamples. In [16], the authors compute adversarial perturbations for a trained neural network policy for a subset of white box and black-box systems. However, these local perturbations are often not meaningful for dynamic systems. Recently, [17, 18] have focused on testing of closed-loop safety critical systems with neural networks by finding “meaningful” perturbations.

Testing black-box systems in simulators is a well studied problem in the formal methods community [19, 20, 21]

. The heart of research in black-box testing focuses on developing smarter search techniques which efficiently samples the uncertainty space. Indeed, in recent years, several sequential search algorithms based on heuristics such as Simulated Annealing

[21], Tabu search [22], and CMA-ES [23] have been suggested. Although these algorithms sample the uncertainty space efficiently, they do not utilize any of the information gathered during previous simulations.

One active method that has been used recently for testing black-box systems is Bayesian Optimization (BO) [24], an optimization method that aims to find the global optimum of an a priori unknown function based on noisy evaluations. Typically, BO algorithms are based on Gaussian Process (GP [25]) models of the underlying function and certain algorithms provably converge close to the global optimum [26]. It has been used in robotics to, for example, safely optimize controller parameters of a quadrotor [27]. In the testing setting, BO has been used to actively find counter examples by treating the search problem as a minimization problem in [28] over adversarial control signals. However, the authors do not consider the structure of the problem and thereby violate the smoothness assumptions made by the GP model. As a result, their methods are slow to converge or may fail to find counterexamples.

In this paper, we provide a formal framework that uses BO to actively test and verify closed-loop black-box systems in simulation. We model the relation between environments and safety specification using GPs and use BO to predict the environment scenarios most likely to cause failures in our controllers. Unlike previous approaches, we exploit structure in the problem in order to provide a formal way to reason across multiple safety constraints in order to find counterexample. Hence, our approach is able to find counterexamples more quickly than previous approaches. Our main contributions are:

• An active learning framework for testing and verifying robotic controllers in simulation. Our framework can find adversarial examples for a synthesized controller independent of its structure or how it was synthesized.

• A common GP framework to model logical safety specifications along with theoretical analysis on when a system is verified.

## Ii Problem Statement

We address the problem of testing complex black-box closed-loop robotic systems in simulation. We assume that we have access to a simulation of the robot that includes the control strategy, i.e., the closed-loop system. The simulator is parameterized by a set of parameters , which model all sources of uncertainty. For example, they can represent environment effects such as weather, non-deterministic components such as other agents interacting with the simulator, or uncertain parameters of the physical system, e.g., friction.

The goal is to test whether the system remains safe for all possible sources of uncertainty in . We specify these safety constraints on finite-length trajectories of the system that can be obtained by simulating the robot for a given set of environment parameters . Safety constraints on these trajectories are specified using logic. We explain this in detail in Sec. III-A, but the result is a specification  that can, in general, be written as a requirement . For example,  can encode state or input constraints that have to be satisfied over time.

We want to test whether there exists an adversarial example  for which the specification is violated, i.e., . Typically, adversarial examples are found by randomly sampling the environment and simulating the behaviors. However, this approach does not provide any guarantees and does not allow us to conclude that no adversarial example exist if none are found in our samples. Moreover, since high-fidelity simulations can often be very expensive, we want to minimize the number of simulations that we have to carry out in order to find a counterexample.

We propose an active learning framework for testing, where we utilize the results from previous simulation runs to make more informed decisions about which environment to simulate next. In particular, we pose the search problem for a counterexample as an optimization problem,

 \operatornamewithlimitsargminw∈Wφ(w), (1)

where we want to minimize the number of queries  until a counterexample is found or we can verify that no counterexample exists. The main challenge is that the functional dependence  between parameters in  and the specification is unknown a priori, since we treat the simulator as a black-box. Solving this problem is difficult in general, but we can exploit regularity properties of . In particular, in the following we use GP to model the specification and use the model to pick parameters that are likely to be counterexamples.

## Iii Background

In this section, we introduce an overview of formal safety specifications and Gaussian processes, which we use in Sec. IV to verify the closed-loop black-box system.

### Iii-a Safety Specification

In the formal methods community, complex safety requirements are expressed using automatons [29] and temporal logic [30, 31]. These allow us to specify complex constraints, which can also have temporal dependence.

###### Example 1.

A safety constraint for a quadcopter might be that the quadcopter cannot fly at an altitude  greater than 3 m when the battery level  is below 30%.”

In logic, we can express this as “ implies() ”, which in words says if the battery level is less than the quadcopter is flying at a height less than 3 m.

Importantly, these kind specifications make no assumptions about the underlying system themselves. They just state requirements that must hold for all simulations in . Formally, a logic specification is a function that tests properties of a particular trajectory. However, we will continue to write  to denote the specification that tests trajectories generated by the simulator with parameters .

A specification  consists of multiple individual constraints, called predicates, which form the basic building blocks of the logic. These predicates can be combined using a syntax or grammar of logical operations:

 φ:=μ|¬μ|φ∧ψ|φ∨ψ. (2)

where is a predicate, and is assumed to be a smooth and continuous function of a trajectory . The constraint forms the basic building block of the overall system specification . We say a predicate is satisfied if is greater than or falsified otherwise. The operations represent negation, conjunction(and) and disjunction(or) , respectively. These basic operations can be combined to define complex boolean formula such as implication, , and if-and-only-if, using the rules

 φ→ψ:=¬φ∨ψ,~{}and~{}φ↔ψ:=(¬φ∧¬ψ)∨(φ∧ψ). (3)

Since is a real valued function, we can convert these boolean logic statements into an equivalent equation with continuous output, which defines the quantitative semantics,

 μ(ξ) :=μ(ξ), (φ∧ψ)(ξ) :=min(φ(ξ),ψ(ξ)), (4) ¬μ(ξ) :=−μ(ξ), (φ∨ψ)(ξ) :=max(φ(ξ),ψ(ξ)).

This allows us to confirm that a logic statement  holds true for all trajectories generated by simulators , by confirming that the function  takes positive values for all .

In the quantitative semantics creftype 4, the satisfaction of a requirement is no longer a yes or no answer, but can be quantified by a real number. The nature of this quantification is similar to that of a reward function, where lower values indicate a larger safety violation. This allows us to introduce a ranking among failures: implies is a more ”dangerous” failure case than . To guarantees safety, we have to take a pessimistic outlook, and denote as a violation and as satisfaction of the specification .

###### Example 2.

Let us look at the specification in Example 1, . Applying the re-write rule creftype 3, this can be written as . Applying the quantitative semantics creftype 4, we get , which consists of two predicates, and . Intuitively, this means , i.e., the specification is satisfied, if the battery is greater than 30 or if the quadcopter flies at an altitude less than 3m .

### Iii-B Gaussian Process

For general black-box systems, the dependence of the specification on the parameters  is unknown a priori. We use a GP to approximate each predicate in the domain . We detail the modeling of in Sec. IV. The following introduction about GPs is based on [25].

GPs are non-parametric regression method from machine learning, where the goal is to find an approximation of the nonlinear function from an environment to the function value . This is done by considering the function values

to be random variables, such that any finite number of them have a joint Gaussian distribution.

The Bayesian, non-parametric regression is based on a prior mean function and the kernel function , which defines the covariance between the function values at two points . We set the prior mean to zero, since we do not have any knowledge about the system. The choice of kernel function is problem-dependent and encodes assumptions about the unknown function.

We can obtain the posterior distribution of a function value at an arbitrary state by conditioning the GP distribution of on a set of past measurements, at environment scenarios , where and is Gaussian noise. The posterior over is a GP distribution again, with mean , covariance

, and variance

:

 (5)

where the vector

contains the covariances between the new environment, , and the environment scenarios in , the kernel matrix has entries , with , and

is the identity matrix.

### Iii-C Bayesian Optimization (BO)

In the following we use BO in order to find the minimum of the unknown function , which we construct using the GP models on  in Sec. IV. BO uses a GP model to query parameters that are informative about the minimum of the function. In particular, the GP-LCB algorithm from [26] uses the GP prediction and associated uncertainty in creftype 5 to trade off exploration and exploitation by, at iteration , selecting an environment according to

 wn=\operatornamewithlimitsargminw∈Wmn−1(w)−β1/2nσn−1(w), (6)

where

determines the confidence interval. We provide an appropriate choice for

in Theorem 1.

At each iteration, creftype 6 selects parameters for which the lower confidence bound of the GP is minimal. Repeatedly evaluating the true function  at samples given by creftype 6 improves the GP model and decreases uncertainty at candidate locations for the minimum, such that the global minimum is found eventually [26].

## Iv Active Testing for Counterexamples

In this section, we show how to model specifications  in creftype 1 using GPs without violating smoothness assumptions and use this to find adversarial counterexamples.

In order to use BO to optimize creftype 1, we need to construct reliable confidence intervals on . However, if we were to model as a GP with commonly-used kernels, it would need it to be a smooth function of . Even though the predicates, , are typically smooth functions of the trajectories, and hence smooth in , conjunction and disjunction ( and ) in creftype 4 are non-smooth operators that render to become non-smooth as well. Instead, we exploit the structure of the specification  and decompose  into a parse tree, where the leaf nodes are the predicates.

###### Definition 1 (Parse Tree T).

Given a specification formula , the corresponding parse tree, , has leaf nodes that correspond to function predicates, while other nodes are (disjunctions) and (conjunctions).

A parse tree is an equivalent graphical representation of . For example, consider the specification

 φ:=(μ1∨μ2)→(μ3∨μ4)=(¬μ1∧¬μ2)∨(μ3∨μ4), (7)

where the second equality follows from De-Morgan’s law. We can obtain an equivalent function  with creftype 4,

 φ(w)=max( min(−μ1(w),−μ2(w)), (8) max(μ3(w),μ4(w))).

The parse tree, , for in creftype 8 is shown in Fig. 2. We can use the parse tree to decompose any complex specification into  and  functions of the individual predicates; that is, .

We now model each predicate  in the parse tree of with a GP and combine them with the parse tree to obtain confidence intervals on the overall specification  for BO. GP-LCB as expressed in creftype 6 can be used to search for the minimum for a single GP. A key insight to extending creftype 6 across multiple GPs, is that the minimum of creftype 1

is, with high probability, lower bounded by the lower-confidence interval of one of the GPs used to model the predicates of

. This is because, the and operators do not change the value of the predicates, but only make a choice between them. As a consequence, we can model the smooth parts of , i.e., the predicates, using GPs and then consider the non-smoothness through the parse tree.

For each predicate  in the parse tree of , we construct a lower confidence bound , where

are the mean and standard deviation of the GP corresponding to

. From this, we can construct a lower-confidence interval on  as , where we replace the th leaf node  of the parse tree with the pessimistic prediction of the corresponding GP. Similar to creftype 6, the corresponding acquisition function for BO uses this lower bound to select the next evaluation point,

 wn=\operatornamewithlimitsargminw∈WT(l1(w),…,lq(w)). (9)

Intuitively, the next environment selected to simulate is the one that minimizes the worst-case predictions on . Effectively, we propagate the confidence intervals associated with the GP for each predicates through the parse tree  in order to obtain predictions about  directly. Note, that creftype 9 does not return an environment sample that minimizes the satisfaction of all the predicates, it only minimizes the lower bound on .

Algorithm 1 describes our active testing procedure. The algorithm proceeds by first computing the parse tree from the specification, . At each iteration  of BO, we select new environment parameters  according to creftype 9. We then simulate the system with parameters  and evaluate each predicate  on the simulated trajectories. Lastly, we update each GP with the corresponding measurement of . The algorithm either returns a counterexample that minimizes creftype 1; or when  is greater then zero, and we can conclude that the system has been verified.

### Iv-a Theoretical Results

We can transfer theoretical convergence results for GP-LCB [26] to the setting of Algorithm 1. To do this, we need to make structural assumptions about the predicates. In particular, we assume that they have bounded norm in the Reproducing Kernel Hilbert Space (RKHS, [32]) that corresponds to the GP’s kernel. These are well-behaved functions of the form  with representer points  and weights  that decay sufficiently quickly. We leverage theoretical results from [33] and [27] that allow us to build reliable confidence intervals using the GP models from Sec. III-B. We have the following result.

###### Theorem 1.

Assume that each predicate has RKHS norm bounded by  and that the measurement noise is -sub-Gaussian. Select , according to creftype 9, and let . If , then with probability at least we have that and the system has been verified against all environments in .

Here  is the mutual information between , the  noisy measurements of , and the GP prior of . This function was shown to be sublinear in  for many commonly-used kernels in [26], see the appendix for more details. Theorem 1 states that we can verify the system against adversarial examples with high probability, by checking whether the worst-case lower-confidence bound is greater than zero. We provide additional theoretical results about the existence of a finite  such that the system can be verified up to  accuracy in the appendix.

## V Evaluation

In this section, we evaluate our method on several challenging test cases. A Python implementation of our framework and the following experiments can be found at https://github.com/shromonag/adversarial˙testing.git

In order to use Algorithm 1, we have to solve the optimization problem creftype 9. In practice, different optimization techniques have been proposed to find the global minimum of the function. One popular algorithm is DIRECT [34], a gradient-free optimization method. An alternative is to use gradient-based methods together with random-restarts. Particularly, we sample a large number of potential environment scenarios at random from , and run seperate optimization routines to minimize creftype 9 from these.

Another challenge is that the dimensionality of the optimization problem can often be very large. However, methods that allow for more efficient computation do exist. These methods reduce the effective size of the input space and thereby make the optimization problem more tractable. One possibility is to use random embedding to reduce the input dimension as done in Random Embedding Bayesian Optimization (REMBO [35]). We can then model the GP in this smaller input dimension and carry out BO in the lower dimension input space.

### V-a Modeling smooth functions vs non-smooth function

In the following, we show the effectiveness of modeling smooth functions by GPs and considering the non-smooth operations in the BO search as opposed to modeling the non-smooth function by a single GP.

Consider the following, illustrative optimization problem,

 w∗=\operatornamewithlimitsargminw∈(0,10)max(sin(w)+0.65,cos(w)+0.65) (10)

We consider two modeling scenarios, one where we model as a single GP, and another where we model by one GP and by another. We initialize the GP models for , and with 5 samples chosen in random. We then use BO to find . We were able to model smooth functions like and with GPs, even with fewer samples. At each iteration of BO, we computed the next sample by solving for the which minimized the maximum across the two GPs. This quickly stabilizes to the true  (Fig. 2(c)). When we model using a GP, in Fig. 2(b), the initial 5 samples were not able to model it well. In fact, the original function in orange is not contained within the uncertainty bounds of the GP. Hence, in each iteration of BO, where we chose which minimized this function, we were never able to converge . It is not surprising to see that, given these models, BO does not always converge when we model non-smooth functions such as in creftype 10.

To support our claim, we repeat this experiment 15 times with different initial samples. In each experiment we run BO for 50 iterations. When modeling and as separate GPs, BO stabilized to in about 5 iterations in all 15 experiments. However, when modeling as a single GP, it takes over 35 iterations to converge and in 5 out of the 15 cases, it did not converge to . We show these two different behaviors in Fig. 4.

### V-B Collision Avoidance with High Dimensional Uncertainty

Consider an autonomous car that travels on a straight road with a obstacle at . We require that the car can come to a stop before colliding with an obstacle. The car has two states; location, , and velocity, ; and one control input acceleration; . The dynamics of the car is given by,

 ˙x=v,˙v=a. (11)

Our safety specification for collision avoidance is given by, , i.e., the minimum distance between the position of the car and the obstacle over a horizon of length . We assume that the car does not know where the obstacle is a priori, but receives locations of the obstacle through a sensor at each time instant, . The controller is a simple linear state feedback control, , such that at time , .

We assume that the car initially starts at location , with a velocity . Let the obstacle be at , which is not known by the car. Instead, it receives sensor readings for the location of the obstacle such that . If is negative, then for some which signifies collision. Moreover, we constrain the acceleration to lie in .

The domain of our uncertainty is , i.e., the sensor readings over the horizon . We compare across three experimental setups, first, we model the GP in the original space of i.e., with inputs; second, we model the GP in a lower dimension input space as described in the preamble of this section; and third, we randomly sample inputs and test them. We run BO for 250 iterations on the GPs, and consider 250 random samples for the random testing. We repeat this experiment 10 times and show our results in Fig. 5.

The green and blue bar in Fig. 5 show the average number of counterexamples returned running BO on the GP defined over the original input space and in the low dimension input space. In general, active testing in the high-dimensional input space gives the best results, which deteriorates with an increase in compression of the input space. Random testing, shown in red performs the worst. This is not surprising as, (1) samples is not sufficient to cover an input space of dimensions uniformly; and (2) the samples are all independent of each other. Moreover, in the uncompressed input case, the specification evaluated at the worst counterexample, , has a mean and standard deviation of and as compared to and for random sampling.

### V-C OpenAI Gym Environments

We interfaced our tool with environments from OpenAI gym [36] to test controllers from Open AI baselines [37]. For brevity, we refer the details of the environments to [38]. In both case studies, we introduce uncertainty around the parameters the controller has been trained for. The rationale behind this is that the parameters in a simulator are an estimate of the true values. This ensures that counterexamples found, can indeed occur in the real system.

#### V-C1 Reacher

In the reacher environment, we have a 2D robot trying to reach a target. For this environment we have six sources of uncertainty: two for the goal position, , two for state perturbations and two for velocity perturbations . The state of the reacher is tuple with the current location, , velocity , and rotation, . A trajectory of the system, , is a sequence of states over time, i.e., . Our uncertainty space is, . Given an instance of , the trajectory, , of the system is uniquely defined.

We trained a controller using the Proximal Policy Optimization (PPO) [39] implementation available at Open AI baselines. We determine a trajectory to be safe if either the reacher reaches the goal, or if it does not rotate unnecessarily. This can be captured as , where, is the minimum distance between the trajectory and the goal position, and is total rotation accumulated over the trajectory; and its continuous variant, .

Using our modeling approach, we model this using two GPs, one for and another for . We compare this to modeling as a single GP and random sampling. We run 200 BO iterations and consider 200 random samples for random testing. We repeat this experiment 10 times.

In Fig. 6, we plot the number of counterexamples found by each of the three methods over 10 runs of the experiment. Modeling the predicates by separate GPs and applying BO across them (shown in green) consistently performs better than applying BO on a single GP modeling  (shown in blue) and random testing (shown in red). We see the that random testing performs very poorly, in some cases (experiment runs ) finds no counterexamples.

By modeling the predicates separately, the specification evaluated at the worst counterexample, , has a mean and standard deviation of and as compared to and when considering a single GP. This suggests, that using our modeling paradigm BO converges (since the standard deviation is small) to a more falsifying counterexample (since the mean is smaller).

#### V-C2 Mountain Car Environment

The mountain car environment in OpenAI gym, is a car on a one-dimensional track, positioned between two mountains. The goal is to drive the car up the mountain on the right. The environment comes with one source of uncertainty, the initial state . We introduced four other sources of uncertainty, for the initial velocity, ; goal location, ; maximum speed, and maximum power magnitude, . The state of the mountain car is a tuple with the current location, , and velocity, . A trajectory of the system, , is a sequence of states over time, i.e., . Our uncertainty space is given by, . Given an instance of , the trajectory, , of the system is uniquely defined.

We trained two controllers one using PPO and another using an actor critic method (DDPG) for continuous Deep Q-learning [40]. We determine a trajectory to be safe, if it reaches the goal quickly or if does not deviate too much from its initial location and always maintains its velocity in some bound. Our safety specification can be written as , where, is time taken to reach the goal, is the deviation from the initial location and is the deviation from the velocity bound; and its continuous variant of . We model , by modeling each predicate, , by a GP. We compare this to modeling with a single GP and random sampling. We run 200 BO iterations for the GPs and consider 200 random samples for random testing. We repeat this experiment 10 times. We show our results in Fig. 7, where we plot the number of counterexamples found by each of the three methods over 10 runs of the experiment for each controller.  Fig. 7 demonstrates the strength of our approach. The number of counterexamples found by our method (in green bar) is much higher compared to random sampling (in red) and modeling as a single GP (in blue). In Fig. 6(a) the blue bars are smaller than even the ones in red, suggesting random sampling performs better than applying BO on the GP modeling . The is because the GP is not able to model , and is so far away from the true model, that the sample returned by the BO is worse than if were to sample randomly.

This is further highlighted by the value of the specification at worst counterexample, . The mean and standard deviation for over the 10 experiment runs is and for our method, and when is modeled as a single GP; and and for random sampling. A similar but less drastic result holds in the case of the controller trained with DDPG.

## Vi Conclusion

We presented an active testing framework that uses Bayesian Optimization to test and verify closed-loop robotic systems in simulation. Our framework handles complex logic specifications and models them efficiently using Gaussian Processes in order to find adversarial examples faster. We showed the effectiveness of our framework on controllers designed on OpenAI gym environments. As future work, we would like to extend this framework to test more complex robotic systems and find regions in the environment parameter space where the closed-loop control is expected to fail.

## Acknowledgments

Research reported in this paper was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-17-2-0196 111The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the oﬃcial policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. and was accomplished under Cooperative Agreement Number W911NF-17-2-0196; and in part by Toyota under the iCyPhy center.

## Appendix A Proofs

In this section, we prove the convergence of our algorithm under specified regularity assumptions on the underlying predicates. Consider the specification

 φ(w)=T(μ1(w),…,μq(w)), (12)

where represents the number of predicates. Let the domain of the predicate indices be represented by, . The convergence proofs for classical Bayesian optimization in [26, 33] proceed by building reliable confidence intervals for the underlying function and then showing, that these confidence intervals concentrate quickly enough at the location of the optimum under the proposed evaluation strategy. For ease of exposition, we assume that measurements of each predicate  are corrupted by the same measurement noise.

To leverage these proofs, we need to account for the fact that our GP model is composed of several individual predicates and that we obtain one measurement for each predicates at every iteration of the algorithm.

We start by defining a composite function , which returns the function values for the individual predicates indexed by .

 f(w,i)=μi(w) (13)

The function is a single output function, which can be modeled with a single GP with a scalar output over the extended input space, . For example, if we assume that the predicates are independent of each other, the kernel function for would look like,

 k((w,i),(w′,i′))={ki(w,w′) if i=i′0 otherwise, (14)

where is the kernel function corresponding to the GP for the th predicate, . It is straightforward to include correlations between functions in this formulation too.

This reformulation allows us to build reliable confidence intervals on the underlying predicates, given regularity assumptions. In particular, we make the assumption that the function  has bounded norm in the Reproducing Kernel Hilbert Space (RKHS, [32]) corresponding to the same kernel  that is used for the GP on .

###### Remark 1.

Note, that this model is more general then the case where we assume that each predicate, , individually has bounded RKHS norm . In this case, the function, has RKHS norm with respect to the kernel in creftype 14 bounded by .

###### Lemma 1.

Assume that has RKHS norm bounded by and the measurements are corrupted by -sub-Gaussian noise. If , then the following holds for all environment scenarios, , predicate indices, , and iterations jointly with probability at least ,

 |f(w,i)−miq⋅(n−1)(w,i)|≤β1/2q⋅nσiq⋅(n−1)(w,i) (15)
###### Proof.

This follows directly from [27], which extends the results from [33] and Lemma 5.1 from [26] to the case of multiple measurements. ∎

The scaling factor for the confidence intervals, , depends on the mutual information  between the GP model of  and the  measurements of the individual predicates that we have obtained for each time step so far. It can easily be computed as

 I(yq⋅(n−1);f) =log(1+1σ2Kq⋅(n−1)), (16) =n−1∑j=1q∑i=1log(1+σ2j⋅q(wj,i)/σ2),

where  is the kernel matrix of the single GP over the extended parameter space and the inner sum in the second equation indicates the fact that we obtain  measurements at every iteration.

Based on these individual confidence intervals on , we can construct confidence intervals on . In particular, let

 li(w) =mq⋅(n−1)(w,i)−β1/2q⋅nσq⋅(n−1)(w,i) (17) ui(w) =mq⋅(n−1)(w,i)+β1/2q⋅nσq⋅(n−1)(w,i)

be the lower and upper confidence intervals on each predicate. From this, we construct reliable confidence intervals on  as follows:

###### Lemma 2.

Under the assumptions of Lemma 1. Let  be the parse tree corresponding to . Then the following holds for all environment scenarios, and iterations jointly with probability at least ,

 T(l1(w),…,lq(w))≤φ(w)≤T(u1(w),…,uq(w)) (18)
###### Proof.

This is a direct consequence of Lemma 1 and the properties of the  and  operators. ∎

We are now able to prove the main theorem as a direct consequence of Lemma 2.

See 1

###### Proof.

For independent variables the mutual information decomposes additively and following Remark 1 this is a direct consequence of Lemma 2, since  holds for all with probability at least . ∎

### A-a Convergence proof

In the following, we prove a stronger result about convergence of our algorithm.

The key quantity in the behavior of the algorithm is the mutual information creftype 16. Importantly, it was shown in [27] that it can be upper bounded by the worst-case mutual information, the information capacity, which in turn was shown to be sublinear by [26]. In particular, let  denote the noisy measurements obtained when evaluating the function  at points in . The mutual information obtained by the algorithm can be bounded according to

 I(fWn×I;f)≤max¯W⊂W,|¯W|≤nI(f¯W×I;f);≤maxD⊂W×I,|D|≤n⋅qI(fD;f);=γq⋅n, (19)

where  is the worst-case mutual information that we can obtain from  measurements,

 γn=maxD⊂W×I,|D|=nI(fD;f). (20)

This quantity was shown to be sublinear in  for many commonly-used kernels in [26].

A key quantity to show convergence of the algorithm is the instantaneous regret,

 rn=minw∈Wφ(w)−φ(wn), (21)

the difference between the unknown true minimizer of  and the environment parameters  that Algorithm 1 selects at iteration . If the instantaneous regret is equal to zero, the algorithm has converged.

In the following, we will show that the cumulative regret, is sublinear in , which implies convergence of Algorithm 1.

We start by bounding the regret in terms of the confidence intervals on .

###### Lemma 3.

Fix , if for all , then the regret is bounded by .

###### Proof.

The proof is analogous to [26, Lemma 5.2]. The maximum standard deviation follows from the properties of the and operators in the parse tree . In particular, let  with . Then for all  and  we have that

 a1−b1≤min(a1+c1,a2+c2)≤a1+b1. (22)

The  operator is analogous. Thus, since the parse tree  is composed only of min and max nodes, the regret is bounded by the maximum error over all predicates. The result follows. ∎

###### Lemma 4.

Pick and as shown in Lemma 1, then the following holds with probability at least ,

 n∑i=1r2n≤βq⋅nC1qI(fWn×I;f)≤βq⋅nC1γq⋅n (23)

where is the regret between the true minimizing environment scenario, and the current sample, ; and

###### Proof.

The first inequality follows similar to [26, Lemma 5.4] and the proofs in [27]. In particular, as in [27],

 r2n≤4β2q⋅nmaxi∈Iσ2q⋅(n−1)(wn,i)

The second inequality follows from creftype 19. ∎

###### Lemma 5.

Under the assumptions of Lemma 2, let and choose according to creftype 9. Then, the cumulative regret over iterations of Algorithm 1 is bounded with high probability,

 Pr{Rn≤√C1βNNγq⋅N∀N≥1}≥1−δ (24)

where .

###### Proof.

Since, , from Cauchy-Schwartz inequality we have, . The rest follows from Lemma 4. ∎

We introduce some notation, let

 ^wn=argminw∈{w1,…,wn}φ(w) (25)

be the minimizing environment scenario sampled by BO in iterations and let

 w∗=\operatornamewithlimitsargminw∈Wφ(w) (26)

be the unknown, optimal parameter.

###### Corollary 1.

For any and , there exits a ,

 n∗βq⋅n∗γq⋅n∗=C1ϵ2 (27)

such that , holds with probability at least .

###### Proof.

The cumulative reward over iterations, where is the -th BO sample. Defining as in creftype 25 we have,

 Rn=n∑i=1φ(w∗)−φ(wi)≥n∑i=1φ(w∗)−φ(^wn)=n(φ(^wn)−φ(w∗)) (28)

Combining this result with Lemma 5, we have with probability greater than that

 φ(w∗)−φ(^wn)≤Rnn≤√C1βq⋅nγq⋅nn (29)

To find, , we bound the RHS by ,

 √C1βq⋅n∗γq⋅n∗n∗≤ϵ⇒n∗βq⋅n∗γq⋅n∗≥C1ϵ2 (30)

For , the minimum . ∎

We are now ready to prove our main convergence theorem.

###### Theorem 2.

Under the assumptions of Lemma 2, choose , and define using Corollary 1. If and , then, with probability greater than , the following statements hold jointly

• The closed loop satisfies , i.e., the control can safely control the system in all environment scenarios,

• The system has been verified against all environments,

###### Proof.

This holds from Lemma 5 and Corollary 1. From Corollary 1, we have , . If , such that , then we have , i.e., the minimum value can achieve on the closed loop system is greater than . is hence, satisfied by our system in all . ∎