In recent years, research in control theory and robotics has focused on developing efficient controllers for robots that operate in the real world. Controller synthesis techniques such as reinforcement learning, optimal control, and model predictive control have been used to synthesize complex policies. However, if there is a large amount of uncertainty about the real world environment that the system interacts with, the robustness of the synthesized controller becomes critical. This is particularly true in safety-critical systems, where the actions of an autonomous agent may affect human lives. This motivates us to provably verify the properties of controllers in simulation before deployment in the real world.
In this paper, we present an active machine learning framework that is able to verify black-box systems against, or alternatively find, adversarial counter examples to a given set of safety specifications. We test the controller safety under uncertainty that arises from stochastic environments and errors in modeling. In essence, we actively search for adversarial environments under which the controller could have to operate that lead failure modes in simulation.
Historically, designing robust controllers has been considered in control theory [1, 2]. A common issue with these techniques is that, although they consider uncertainty, they rely on simple linear models of the underlying system. This means that resulting controllers are often either overly conservative or violate safety constraints if they fail to capture nonlinear effects.
For nonlinear models with complex dynamics, reinforcement learning has been successful for synthesizing high fidelity controllers. Recently, algorithms based on reinforcement learning that can handle uncertainty have been proposed [3, 4, 5], where the performance is measured in expectation. A fundamental issue with learned controllers is that it is difficult to provide formal guarantees for safety in the presence of uncertainty. For example, a controller for an autonomous vehicle must consider human driver behaviors, pedestrian behaviors, traffic lights, uncertainty due to sensors, etc. Without formally verifying that these controllers are indeed safe, deploying them on the road could lead to loss of property or human lives.
Formal safety certificates, i.e., mathematical proofs for safety, have been considered in the formal methods community, where safety requirements are referred to as a specification. There, the goal is to verify that the behaviors of a particular model satisfies a specification ([6, 7]). Synthesizing controllers which satisfy a high level temporal specification have been studied in the context of motion planning  and for cyber-physical systems . However, these techniques rely on simple model dynamics. For nonlinear systems, reachability algorithms based on level set methods have been used to approximate backward reachable sets for safety verification [10, 11]
. However, these methods suffer from two major drawbacks: (1) the curse of dimensionality of the state space, which limits them to low-dimensional systems; and (2)a priori knowledge of the system dynamics.
A dual, and often simpler, problem is falsification
, which tests the system within a set of environment conditions for adversarial examples. Adversarial examples have recently been considered for neural networks[12, 13, 14, 15], where the input is typically perturbed locally in order to find counterexamples. In , the authors compute adversarial perturbations for a trained neural network policy for a subset of white box and black-box systems. However, these local perturbations are often not meaningful for dynamic systems. Recently, [17, 18] have focused on testing of closed-loop safety critical systems with neural networks by finding “meaningful” perturbations.
. The heart of research in black-box testing focuses on developing smarter search techniques which efficiently samples the uncertainty space. Indeed, in recent years, several sequential search algorithms based on heuristics such as Simulated Annealing, Tabu search , and CMA-ES  have been suggested. Although these algorithms sample the uncertainty space efficiently, they do not utilize any of the information gathered during previous simulations.
One active method that has been used recently for testing black-box systems is Bayesian Optimization (BO) , an optimization method that aims to find the global optimum of an a priori unknown function based on noisy evaluations. Typically, BO algorithms are based on Gaussian Process (GP ) models of the underlying function and certain algorithms provably converge close to the global optimum . It has been used in robotics to, for example, safely optimize controller parameters of a quadrotor . In the testing setting, BO has been used to actively find counter examples by treating the search problem as a minimization problem in  over adversarial control signals. However, the authors do not consider the structure of the problem and thereby violate the smoothness assumptions made by the GP model. As a result, their methods are slow to converge or may fail to find counterexamples.
In this paper, we provide a formal framework that uses BO to actively test and verify closed-loop black-box systems in simulation. We model the relation between environments and safety specification using GPs and use BO to predict the environment scenarios most likely to cause failures in our controllers. Unlike previous approaches, we exploit structure in the problem in order to provide a formal way to reason across multiple safety constraints in order to find counterexample. Hence, our approach is able to find counterexamples more quickly than previous approaches. Our main contributions are:
An active learning framework for testing and verifying robotic controllers in simulation. Our framework can find adversarial examples for a synthesized controller independent of its structure or how it was synthesized.
A common GP framework to model logical safety specifications along with theoretical analysis on when a system is verified.
Ii Problem Statement
We address the problem of testing complex black-box closed-loop robotic systems in simulation. We assume that we have access to a simulation of the robot that includes the control strategy, i.e., the closed-loop system. The simulator is parameterized by a set of parameters , which model all sources of uncertainty. For example, they can represent environment effects such as weather, non-deterministic components such as other agents interacting with the simulator, or uncertain parameters of the physical system, e.g., friction.
The goal is to test whether the system remains safe for all possible sources of uncertainty in . We specify these safety constraints on finite-length trajectories of the system that can be obtained by simulating the robot for a given set of environment parameters . Safety constraints on these trajectories are specified using logic. We explain this in detail in Sec. III-A, but the result is a specification that can, in general, be written as a requirement . For example, can encode state or input constraints that have to be satisfied over time.
We want to test whether there exists an adversarial example for which the specification is violated, i.e., . Typically, adversarial examples are found by randomly sampling the environment and simulating the behaviors. However, this approach does not provide any guarantees and does not allow us to conclude that no adversarial example exist if none are found in our samples. Moreover, since high-fidelity simulations can often be very expensive, we want to minimize the number of simulations that we have to carry out in order to find a counterexample.
We propose an active learning framework for testing, where we utilize the results from previous simulation runs to make more informed decisions about which environment to simulate next. In particular, we pose the search problem for a counterexample as an optimization problem,
where we want to minimize the number of queries until a counterexample is found or we can verify that no counterexample exists. The main challenge is that the functional dependence between parameters in and the specification is unknown a priori, since we treat the simulator as a black-box. Solving this problem is difficult in general, but we can exploit regularity properties of . In particular, in the following we use GP to model the specification and use the model to pick parameters that are likely to be counterexamples.
In this section, we introduce an overview of formal safety specifications and Gaussian processes, which we use in Sec. IV to verify the closed-loop black-box system.
Iii-a Safety Specification
In the formal methods community, complex safety requirements are expressed using automatons  and temporal logic [30, 31]. These allow us to specify complex constraints, which can also have temporal dependence.
A safety constraint for a quadcopter might be that the quadcopter cannot fly at an altitude greater than 3 m when the battery level is below 30%.”
In logic, we can express this as “ implies() ”, which in words says if the battery level is less than the quadcopter is flying at a height less than 3 m.
Importantly, these kind specifications make no assumptions about the underlying system themselves. They just state requirements that must hold for all simulations in . Formally, a logic specification is a function that tests properties of a particular trajectory. However, we will continue to write to denote the specification that tests trajectories generated by the simulator with parameters .
A specification consists of multiple individual constraints, called predicates, which form the basic building blocks of the logic. These predicates can be combined using a syntax or grammar of logical operations:
where is a predicate, and is assumed to be a smooth and continuous function of a trajectory . The constraint forms the basic building block of the overall system specification . We say a predicate is satisfied if is greater than or falsified otherwise. The operations represent negation, conjunction(and) and disjunction(or) , respectively. These basic operations can be combined to define complex boolean formula such as implication, , and if-and-only-if, using the rules
Since is a real valued function, we can convert these boolean logic statements into an equivalent equation with continuous output, which defines the quantitative semantics,
This allows us to confirm that a logic statement holds true for all trajectories generated by simulators , by confirming that the function takes positive values for all .
In the quantitative semantics creftype 4, the satisfaction of a requirement is no longer a yes or no answer, but can be quantified by a real number. The nature of this quantification is similar to that of a reward function, where lower values indicate a larger safety violation. This allows us to introduce a ranking among failures: implies is a more ”dangerous” failure case than . To guarantees safety, we have to take a pessimistic outlook, and denote as a violation and as satisfaction of the specification .
Let us look at the specification in Example 1, . Applying the re-write rule creftype 3, this can be written as . Applying the quantitative semantics creftype 4, we get , which consists of two predicates, and . Intuitively, this means , i.e., the specification is satisfied, if the battery is greater than 30 or if the quadcopter flies at an altitude less than 3m .
Iii-B Gaussian Process
For general black-box systems, the dependence of the specification on the parameters is unknown a priori. We use a GP to approximate each predicate in the domain . We detail the modeling of in Sec. IV. The following introduction about GPs is based on .
GPs are non-parametric regression method from machine learning, where the goal is to find an approximation of the nonlinear function from an environment to the function value . This is done by considering the function values
The Bayesian, non-parametric regression is based on a prior mean function and the kernel function , which defines the covariance between the function values at two points . We set the prior mean to zero, since we do not have any knowledge about the system. The choice of kernel function is problem-dependent and encodes assumptions about the unknown function.
We can obtain the posterior distribution of a function value at an arbitrary state by conditioning the GP distribution of on a set of past measurements, at environment scenarios , where and is Gaussian noise. The posterior over is a GP distribution again, with mean , covariance
, and variance:
where the vectorcontains the covariances between the new environment, , and the environment scenarios in , the kernel matrix has entries , with , and
is the identity matrix.
Iii-C Bayesian Optimization (BO)
In the following we use BO in order to find the minimum of the unknown function , which we construct using the GP models on in Sec. IV. BO uses a GP model to query parameters that are informative about the minimum of the function. In particular, the GP-LCB algorithm from  uses the GP prediction and associated uncertainty in creftype 5 to trade off exploration and exploitation by, at iteration , selecting an environment according to
determines the confidence interval. We provide an appropriate choice forin Theorem 1.
At each iteration, creftype 6 selects parameters for which the lower confidence bound of the GP is minimal. Repeatedly evaluating the true function at samples given by creftype 6 improves the GP model and decreases uncertainty at candidate locations for the minimum, such that the global minimum is found eventually .
Iv Active Testing for Counterexamples
In this section, we show how to model specifications in creftype 1 using GPs without violating smoothness assumptions and use this to find adversarial counterexamples.
In order to use BO to optimize creftype 1, we need to construct reliable confidence intervals on . However, if we were to model as a GP with commonly-used kernels, it would need it to be a smooth function of . Even though the predicates, , are typically smooth functions of the trajectories, and hence smooth in , conjunction and disjunction ( and ) in creftype 4 are non-smooth operators that render to become non-smooth as well. Instead, we exploit the structure of the specification and decompose into a parse tree, where the leaf nodes are the predicates.
Definition 1 (Parse Tree ).
Given a specification formula , the corresponding parse tree, , has leaf nodes that correspond to function predicates, while other nodes are (disjunctions) and (conjunctions).
A parse tree is an equivalent graphical representation of . For example, consider the specification
where the second equality follows from De-Morgan’s law. We can obtain an equivalent function with creftype 4,
We now model each predicate in the parse tree of with a GP and combine them with the parse tree to obtain confidence intervals on the overall specification for BO. GP-LCB as expressed in creftype 6 can be used to search for the minimum for a single GP. A key insight to extending creftype 6 across multiple GPs, is that the minimum of creftype 1
is, with high probability, lower bounded by the lower-confidence interval of one of the GPs used to model the predicates of. This is because, the and operators do not change the value of the predicates, but only make a choice between them. As a consequence, we can model the smooth parts of , i.e., the predicates, using GPs and then consider the non-smoothness through the parse tree.
For each predicate in the parse tree of , we construct a lower confidence bound , where
are the mean and standard deviation of the GP corresponding to. From this, we can construct a lower-confidence interval on as , where we replace the th leaf node of the parse tree with the pessimistic prediction of the corresponding GP. Similar to creftype 6, the corresponding acquisition function for BO uses this lower bound to select the next evaluation point,
Intuitively, the next environment selected to simulate is the one that minimizes the worst-case predictions on . Effectively, we propagate the confidence intervals associated with the GP for each predicates through the parse tree in order to obtain predictions about directly. Note, that creftype 9 does not return an environment sample that minimizes the satisfaction of all the predicates, it only minimizes the lower bound on .
Algorithm 1 describes our active testing procedure. The algorithm proceeds by first computing the parse tree from the specification, . At each iteration of BO, we select new environment parameters according to creftype 9. We then simulate the system with parameters and evaluate each predicate on the simulated trajectories. Lastly, we update each GP with the corresponding measurement of . The algorithm either returns a counterexample that minimizes creftype 1; or when is greater then zero, and we can conclude that the system has been verified.
Iv-a Theoretical Results
We can transfer theoretical convergence results for GP-LCB  to the setting of Algorithm 1. To do this, we need to make structural assumptions about the predicates. In particular, we assume that they have bounded norm in the Reproducing Kernel Hilbert Space (RKHS, ) that corresponds to the GP’s kernel. These are well-behaved functions of the form with representer points and weights that decay sufficiently quickly. We leverage theoretical results from  and  that allow us to build reliable confidence intervals using the GP models from Sec. III-B. We have the following result.
Assume that each predicate has RKHS norm bounded by and that the measurement noise is -sub-Gaussian. Select , according to creftype 9, and let . If , then with probability at least we have that and the system has been verified against all environments in .
Here is the mutual information between , the noisy measurements of , and the GP prior of . This function was shown to be sublinear in for many commonly-used kernels in , see the appendix for more details. Theorem 1 states that we can verify the system against adversarial examples with high probability, by checking whether the worst-case lower-confidence bound is greater than zero. We provide additional theoretical results about the existence of a finite such that the system can be verified up to accuracy in the appendix.
In this section, we evaluate our method on several challenging test cases. A Python implementation of our framework and the following experiments can be found at https://github.com/shromonag/adversarial˙testing.git
In order to use Algorithm 1, we have to solve the optimization problem creftype 9. In practice, different optimization techniques have been proposed to find the global minimum of the function. One popular algorithm is DIRECT , a gradient-free optimization method. An alternative is to use gradient-based methods together with random-restarts. Particularly, we sample a large number of potential environment scenarios at random from , and run seperate optimization routines to minimize creftype 9 from these.
Another challenge is that the dimensionality of the optimization problem can often be very large. However, methods that allow for more efficient computation do exist. These methods reduce the effective size of the input space and thereby make the optimization problem more tractable. One possibility is to use random embedding to reduce the input dimension as done in Random Embedding Bayesian Optimization (REMBO ). We can then model the GP in this smaller input dimension and carry out BO in the lower dimension input space.
V-a Modeling smooth functions vs non-smooth function
In the following, we show the effectiveness of modeling smooth functions by GPs and considering the non-smooth operations in the BO search as opposed to modeling the non-smooth function by a single GP.
Consider the following, illustrative optimization problem,
confidence interval of the GP (blue shaded) with mean estimate (in blue line) does not capture the true functionin orange. In fact, the minimum (red star) is not contained within the shaded region, causing the optimization to diverge. BO converges to the green dot, where which is not a counterexample. Instead, modeling the two predicates individually and combining them with the parse tree, leads to the model in Fig. 2(c). Here, the true function is completely captured in the confidence interval. As a consequence, BO converges to the global minimum (the red star and green dot converge).
We consider two modeling scenarios, one where we model as a single GP, and another where we model by one GP and by another. We initialize the GP models for , and with 5 samples chosen in random. We then use BO to find . We were able to model smooth functions like and with GPs, even with fewer samples. At each iteration of BO, we computed the next sample by solving for the which minimized the maximum across the two GPs. This quickly stabilizes to the true (Fig. 2(c)). When we model using a GP, in Fig. 2(b), the initial 5 samples were not able to model it well. In fact, the original function in orange is not contained within the uncertainty bounds of the GP. Hence, in each iteration of BO, where we chose which minimized this function, we were never able to converge . It is not surprising to see that, given these models, BO does not always converge when we model non-smooth functions such as in creftype 10.
To support our claim, we repeat this experiment 15 times with different initial samples. In each experiment we run BO for 50 iterations. When modeling and as separate GPs, BO stabilized to in about 5 iterations in all 15 experiments. However, when modeling as a single GP, it takes over 35 iterations to converge and in 5 out of the 15 cases, it did not converge to . We show these two different behaviors in Fig. 4.
V-B Collision Avoidance with High Dimensional Uncertainty
Consider an autonomous car that travels on a straight road with a obstacle at . We require that the car can come to a stop before colliding with an obstacle. The car has two states; location, , and velocity, ; and one control input acceleration; . The dynamics of the car is given by,
Our safety specification for collision avoidance is given by, , i.e., the minimum distance between the position of the car and the obstacle over a horizon of length . We assume that the car does not know where the obstacle is a priori, but receives locations of the obstacle through a sensor at each time instant, . The controller is a simple linear state feedback control, , such that at time , .
We assume that the car initially starts at location , with a velocity . Let the obstacle be at , which is not known by the car. Instead, it receives sensor readings for the location of the obstacle such that . If is negative, then for some which signifies collision. Moreover, we constrain the acceleration to lie in .
The domain of our uncertainty is , i.e., the sensor readings over the horizon . We compare across three experimental setups, first, we model the GP in the original space of i.e., with inputs; second, we model the GP in a lower dimension input space as described in the preamble of this section; and third, we randomly sample inputs and test them. We run BO for 250 iterations on the GPs, and consider 250 random samples for the random testing. We repeat this experiment 10 times and show our results in Fig. 5.
The green and blue bar in Fig. 5 show the average number of counterexamples returned running BO on the GP defined over the original input space and in the low dimension input space. In general, active testing in the high-dimensional input space gives the best results, which deteriorates with an increase in compression of the input space. Random testing, shown in red performs the worst. This is not surprising as, (1) samples is not sufficient to cover an input space of dimensions uniformly; and (2) the samples are all independent of each other. Moreover, in the uncompressed input case, the specification evaluated at the worst counterexample, , has a mean and standard deviation of and as compared to and for random sampling.
V-C OpenAI Gym Environments
We interfaced our tool with environments from OpenAI gym  to test controllers from Open AI baselines . For brevity, we refer the details of the environments to . In both case studies, we introduce uncertainty around the parameters the controller has been trained for. The rationale behind this is that the parameters in a simulator are an estimate of the true values. This ensures that counterexamples found, can indeed occur in the real system.
In the reacher environment, we have a 2D robot trying to reach a target. For this environment we have six sources of uncertainty: two for the goal position, , two for state perturbations and two for velocity perturbations . The state of the reacher is tuple with the current location, , velocity , and rotation, . A trajectory of the system, , is a sequence of states over time, i.e., . Our uncertainty space is, . Given an instance of , the trajectory, , of the system is uniquely defined.
We trained a controller using the Proximal Policy Optimization (PPO)  implementation available at Open AI baselines. We determine a trajectory to be safe if either the reacher reaches the goal, or if it does not rotate unnecessarily. This can be captured as , where, is the minimum distance between the trajectory and the goal position, and is total rotation accumulated over the trajectory; and its continuous variant, .
Using our modeling approach, we model this using two GPs, one for and another for . We compare this to modeling as a single GP and random sampling. We run 200 BO iterations and consider 200 random samples for random testing. We repeat this experiment 10 times.
In Fig. 6, we plot the number of counterexamples found by each of the three methods over 10 runs of the experiment. Modeling the predicates by separate GPs and applying BO across them (shown in green) consistently performs better than applying BO on a single GP modeling (shown in blue) and random testing (shown in red). We see the that random testing performs very poorly, in some cases (experiment runs ) finds no counterexamples.
By modeling the predicates separately, the specification evaluated at the worst counterexample, , has a mean and standard deviation of and as compared to and when considering a single GP. This suggests, that using our modeling paradigm BO converges (since the standard deviation is small) to a more falsifying counterexample (since the mean is smaller).
V-C2 Mountain Car Environment
The mountain car environment in OpenAI gym, is a car on a one-dimensional track, positioned between two mountains. The goal is to drive the car up the mountain on the right. The environment comes with one source of uncertainty, the initial state . We introduced four other sources of uncertainty, for the initial velocity, ; goal location, ; maximum speed, and maximum power magnitude, . The state of the mountain car is a tuple with the current location, , and velocity, . A trajectory of the system, , is a sequence of states over time, i.e., . Our uncertainty space is given by, . Given an instance of , the trajectory, , of the system is uniquely defined.
We trained two controllers one using PPO and another using an actor critic method (DDPG) for continuous Deep Q-learning . We determine a trajectory to be safe, if it reaches the goal quickly or if does not deviate too much from its initial location and always maintains its velocity in some bound. Our safety specification can be written as , where, is time taken to reach the goal, is the deviation from the initial location and is the deviation from the velocity bound; and its continuous variant of . We model , by modeling each predicate, , by a GP. We compare this to modeling with a single GP and random sampling. We run 200 BO iterations for the GPs and consider 200 random samples for random testing. We repeat this experiment 10 times. We show our results in Fig. 7, where we plot the number of counterexamples found by each of the three methods over 10 runs of the experiment for each controller. Fig. 7 demonstrates the strength of our approach. The number of counterexamples found by our method (in green bar) is much higher compared to random sampling (in red) and modeling as a single GP (in blue). In Fig. 6(a) the blue bars are smaller than even the ones in red, suggesting random sampling performs better than applying BO on the GP modeling . The is because the GP is not able to model , and is so far away from the true model, that the sample returned by the BO is worse than if were to sample randomly.
This is further highlighted by the value of the specification at worst counterexample, . The mean and standard deviation for over the 10 experiment runs is and for our method, and when is modeled as a single GP; and and for random sampling. A similar but less drastic result holds in the case of the controller trained with DDPG.
We presented an active testing framework that uses Bayesian Optimization to test and verify closed-loop robotic systems in simulation. Our framework handles complex logic specifications and models them efficiently using Gaussian Processes in order to find adversarial examples faster. We showed the effectiveness of our framework on controllers designed on OpenAI gym environments. As future work, we would like to extend this framework to test more complex robotic systems and find regions in the environment parameter space where the closed-loop control is expected to fail.
Research reported in this paper was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-17-2-0196 111The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the oﬃcial policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. and was accomplished under Cooperative Agreement Number W911NF-17-2-0196; and in part by Toyota under the iCyPhy center.
-  S. Sastry and M. Bodson, Adaptive Control: Stability, Convergence, and Robustness. Prentice-Hall, Inc., 1989.
-  R. F. Stengel, Stochastic Optimal Control: Theory and Application. John Wiley & Sons, Inc., 1986.
-  G. Kahn, A. Villaflor, V. Pong, P. Abbeel, and S. Levine, “Uncertainty-aware reinforcement learning for collision avoidance,” CoRR, vol. abs/1702.01182, 2017.
-  Y. Niv, D. Joel, I. Meilijson, and E. Ruppin, “Evolution of reinforcement learning in uncertain environments: A simple explanation for complex foraging behaviors,” 2002.
P. Poupart and N. Vlassis, “Model-based bayesian reinforcement learning in
partially observable domains,” in
Proc Int. Symp. on Artificial Intelligence and Mathematics,, 2008, pp. 1–2.
-  E. M. Clarke, Jr., O. Grumberg, and D. A. Peled, Model Checking. MIT Press, 1999.
-  I. Mitchell and C. J. Tomlin, “Level set methods for computation in hybrid systems,” in International Workshop on Hybrid Systems: Computation and Control. Springer, 2000, pp. 310–323.
-  A. Bhatia, L. E. Kavraki, and M. Y. Vardi, “Sampling-based motion planning with temporal goals,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010, pp. 2689–2696.
-  V. Raman, A. Donzé, M. Maasoumy, R. M. Murray, A. Sangiovanni-Vincentelli, and S. A. Seshia, “Model predictive control with signal temporal logic specifications,” in Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on. IEEE, 2014, pp. 81–87.
-  I. Mitchell, A. Bayen, and C. J. Tomlin, “Computing reachable sets for continuous dynamic games using level set methods,” Submitted January, 2004.
-  F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Proc. of Neural Information Processing Systems (NIPS), 2017.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
-  N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against deep learning systems using adversarial examples,” arXiv preprint arXiv:1602.02697, 2016.
-  V. Behzadan and A. Munir, “Vulnerability of deep reinforcement learning to policy induction attacks,” CoRR, vol. abs/1701.04143, 2017.
-  N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in Security and Privacy (SP). IEEE, 2017, pp. 39–57.
-  S. H. Huang, N. Papernot, I. J. Goodfellow, Y. Duan, and P. Abbeel, “Adversarial attacks on neural network policies,” CoRR, vol. abs/1702.02284, 2017.
-  T. Dreossi, A. Donzé, and S. A. Seshia, “Compositional falsification of cyber-physical systems with machine learning components,” in NASA Formal Methods Symposium. Springer, 2017, pp. 357–372.
K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” inProceedings of the 26th Symposium on Operating Systems Principles, ser. SOSP ’17, 2017.
-  A. Donzé, “Breach, a toolbox for verification and parameter synthesis of hybrid systems,” in CAV. Springer, 2010, pp. 167–170.
-  P. S. Duggirala, S. Mitra, M. Viswanathan, and M. Potok, “C2E2: A verification tool for stateflow models,” in Tools and Algorithms for the Construction and Analysis of Systems, C. Baier and C. Tinelli, Eds., 2015.
-  Y. Annpureddy, C. Liu, G. E. Fainekos, and S. Sankaranarayanan, “S-taliro: A tool for temporal logic falsification for hybrid systems.” in TACAS, vol. 6605. Springer, 2011, pp. 254–257.
-  J. Deshmukh, X. Jin, J. Kapinski, and O. Maler, “Stochastic local search for falsification of hybrid systems,” in ATVA. Springer, 2015.
-  N. Hansen, “The CMA evolution strategy: A tutorial,” arXiv preprint arXiv:1604.00772, 2016.
-  J. Mockus, Bayesian approach to global optimization: theory and applications. Springer Science & Business Media, 2012, vol. 37.
-  C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning. MIT Press.
-  N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: no regret and experimental design,” IEEE Transactions on Information Theory, vol. 58, 2012.
-  F. Berkenkamp, A. Krause, and Angela P. Schoellig, “Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics.” arXiv, 2016.
-  J. V. Deshmukh, M. Horvat, X. Jin, R. Majumdar, and V. S. Prabhu, “Testing cyber-physical systems through bayesian optimization,” Transactions on Embedded Computing Systems, 2017.
-  R. Alur and D. L. Dill, “A theory of timed automata,” Theoretical computer science, vol. 126, no. 2, pp. 183–235, 1994.
-  A. Pnueli, “The temporal logic of programs,” in Proceedings of the 18th Annual Symposium on Foundations of Computer Science, ser. SFCS ’77, 1977, pp. 46–57.
-  O. Maler and D. Nickovic, “Monitoring temporal properties of continuous signals,” in FORMATS/FTRTFT. Springer, 2004.
-  I. Steinwart and A. Christmann, Support vector machines. Springer Science & Business Media, 2008.
-  S. R. Chowdhury and A. Gopalan, “On kernelized multi-armed bandits,” in ICML, 2017, pp. 844–853.
-  D. E. Finkel, “Direct optimization algorithm user guide,” 2003.
-  Z. Wang, M. Zoghi, F. Hutter, D. Matheson, N. De Freitas et al., “Bayesian optimization in high dimensions via random embeddings.” in IJCAI, 2013, pp. 1778–1784.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” 2016.
-  “Open AI Baselines,” https://github.com/openai/baselines.
-  “Open AI Gym Environments,” https://gym.openai.com/envs.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
Appendix A Proofs
In this section, we prove the convergence of our algorithm under specified regularity assumptions on the underlying predicates. Consider the specification
where represents the number of predicates. Let the domain of the predicate indices be represented by, . The convergence proofs for classical Bayesian optimization in [26, 33] proceed by building reliable confidence intervals for the underlying function and then showing, that these confidence intervals concentrate quickly enough at the location of the optimum under the proposed evaluation strategy. For ease of exposition, we assume that measurements of each predicate are corrupted by the same measurement noise.
To leverage these proofs, we need to account for the fact that our GP model is composed of several individual predicates and that we obtain one measurement for each predicates at every iteration of the algorithm.
We start by defining a composite function , which returns the function values for the individual predicates indexed by .
The function is a single output function, which can be modeled with a single GP with a scalar output over the extended input space, . For example, if we assume that the predicates are independent of each other, the kernel function for would look like,
where is the kernel function corresponding to the GP for the th predicate, . It is straightforward to include correlations between functions in this formulation too.
This reformulation allows us to build reliable confidence intervals on the underlying predicates, given regularity assumptions. In particular, we make the assumption that the function has bounded norm in the Reproducing Kernel Hilbert Space (RKHS, ) corresponding to the same kernel that is used for the GP on .
Note, that this model is more general then the case where we assume that each predicate, , individually has bounded RKHS norm . In this case, the function, has RKHS norm with respect to the kernel in creftype 14 bounded by .
Assume that has RKHS norm bounded by and the measurements are corrupted by -sub-Gaussian noise. If , then the following holds for all environment scenarios, , predicate indices, , and iterations jointly with probability at least ,
The scaling factor for the confidence intervals, , depends on the mutual information between the GP model of and the measurements of the individual predicates that we have obtained for each time step so far. It can easily be computed as
where is the kernel matrix of the single GP over the extended parameter space and the inner sum in the second equation indicates the fact that we obtain measurements at every iteration.
Based on these individual confidence intervals on , we can construct confidence intervals on . In particular, let
be the lower and upper confidence intervals on each predicate. From this, we construct reliable confidence intervals on as follows:
Under the assumptions of Lemma 1. Let be the parse tree corresponding to . Then the following holds for all environment scenarios, and iterations jointly with probability at least ,
This is a direct consequence of Lemma 1 and the properties of the and operators. ∎
We are now able to prove the main theorem as a direct consequence of Lemma 2.
A-a Convergence proof
In the following, we prove a stronger result about convergence of our algorithm.
The key quantity in the behavior of the algorithm is the mutual information creftype 16. Importantly, it was shown in  that it can be upper bounded by the worst-case mutual information, the information capacity, which in turn was shown to be sublinear by . In particular, let denote the noisy measurements obtained when evaluating the function at points in . The mutual information obtained by the algorithm can be bounded according to
where is the worst-case mutual information that we can obtain from measurements,
This quantity was shown to be sublinear in for many commonly-used kernels in .
A key quantity to show convergence of the algorithm is the instantaneous regret,
the difference between the unknown true minimizer of and the environment parameters that Algorithm 1 selects at iteration . If the instantaneous regret is equal to zero, the algorithm has converged.
In the following, we will show that the cumulative regret, is sublinear in , which implies convergence of Algorithm 1.
We start by bounding the regret in terms of the confidence intervals on .
Fix , if for all , then the regret is bounded by .
The proof is analogous to [26, Lemma 5.2]. The maximum standard deviation follows from the properties of the and operators in the parse tree . In particular, let with . Then for all and we have that
The operator is analogous. Thus, since the parse tree is composed only of min and max nodes, the regret is bounded by the maximum error over all predicates. The result follows. ∎
Pick and as shown in Lemma 1, then the following holds with probability at least ,
where is the regret between the true minimizing environment scenario, and the current sample, ; and
Since, , from Cauchy-Schwartz inequality we have, . The rest follows from Lemma 4. ∎
We introduce some notation, let
be the minimizing environment scenario sampled by BO in iterations and let
be the unknown, optimal parameter.
For any and , there exits a ,
such that , holds with probability at least .
We are now ready to prove our main convergence theorem.
The closed loop satisfies , i.e., the control can safely control the system in all environment scenarios,
The system has been verified against all environments,