Safe Neurosymbolic Learning with Differentiable Symbolic Execution

03/15/2022
by   Chenxi Yang, et al.
0

We study the problem of learning worst-case-safe parameters for programs that use neural networks as well as symbolic, human-written code. Such neurosymbolic programs arise in many safety-critical domains. However, because they can use nondifferentiable operations, it is hard to learn their parameters using existing gradient-based approaches to safe learning. Our approach to this problem, Differentiable Symbolic Execution (DSE), samples control flow paths in a program, symbolically constructs worst-case "safety losses" along these paths, and backpropagates the gradients of these losses through program operations using a generalization of the REINFORCE estimator. We evaluate the method on a mix of synthetic tasks and real-world benchmarks. Our experiments show that DSE significantly outperforms the state-of-the-art DiffAI method on these tasks.

READ FULL TEXT VIEW PDF

page 8

page 22

06/08/2018

Badger: Complexity Analysis with Fuzzing and Symbolic Execution

Hybrid testing approaches that involve fuzz testing and symbolic executi...
09/27/2017

Symbolic Computation of the Worst-Case Execution Time of a Program

Parametric Worst-case execution time (WCET) analysis of a sequential pro...
11/03/2017

SPARK: Static Program Analysis Reasoning and Retrieving Knowledge

Program analysis is a technique to reason about programs without executi...
06/14/2017

Failure-Directed Program Trimming (Extended Version)

This paper describes a new program simplification technique called progr...
10/07/2021

Adaptive Safety Margin Estimation for Safe Real-Time Replanning under Time-Varying Disturbance

Safe navigation in real-time is challenging because engineers need to wo...
06/26/2018

Indexing Operators to Extend the Reach of Symbolic Execution

Traditional program analysis analyses a program language, that is, all p...
03/07/2016

On the limitations of analysing worst-case dynamic energy of processing

This paper examines dynamic energy consumption caused by data during sof...

1 Introduction

Safety on worst-case inputs has recently emerged as a key challenge in deep learning research. Formal verification of neural networks 

(albarghouthi2021introduction) is an established response to this challenge. In particular, a body of recent work (mirman2018differentiable; madry2017towards; cohen2019certified; singh2018fast) seeks to incorporate formal verification into the training of neural networks. DiffAi, among the most prominent of such approaches, uses a neural network verifier to construct a differentiable, worst-case safety loss for the learner. This loss is used to regularize a standard data-driven loss, biasing the learner towards parameters that are both performant and safe.

A weakness of these methods is that they only consider functional properties (such as adversarial robustness) of isolated neural networks. By contrast, in real-world applications, neural networks are often embedded within human-written symbolic code, and correctness requirements apply to the neurosymbolic programs (chaudhuri2021neurosymbolic) obtained by composing the two. For example, consider a car directed by a neural controller (qin2000overview). Safety properties for the car are functions of its trajectories, and these trajectories depend not just on the controller but also the symbolic equations that define the environment. While recent work (christakis2021automated) has studied the verification of such neurosymbolic programs, there is no prior work on integrating verification and learning for such systems.

In this paper, we present the first steps towards such an integration. The fundamental technical challenge here is that while a neural network is differentiable in its parameters, the code surrounding it may be non-differentiable or even discontinuous. This makes our problem a poor fit for methods like DiffAi, which were designed for fully differentiable learning objectives.

We solve the problem using a new method, Differentiable Symbolic Execution (Dse), that can estimate gradients of worst-case safety losses of nondifferentiable neurosymbolic programs. Our method approximates the program’s worst-case safety loss by an integral over the safety losses along individual control flow paths in the program. We compute the gradients of this integral using a generalization of the classic Reinforce estimator (williams1992simple). The gradient updates resulting from this process balance two goals: (i) choosing the parameters of individual paths so that program trajectories that follow them are safer, and (ii) learning the parameters of conditional branches so that the program is discouraged from entering unsafe paths. The procedure requires a method for sampling paths and computing the safety loss for individual paths. We use a version of symbolic execution (baldoni2018survey) to this end.

We evaluate Dse through several problems from the embedded control and navigation domains, as well as several synthetic tasks. Our baselines include an extended version of DiffAi, the current state of the art, and an ablation that does not use an explicit safety loss. Our experiments show that Dse significantly outperforms the baselines in finding safe and high-performance model parameters.

To summarize, our main contributions are:

  • [leftmargin=2em]

  • We present the first approach to worst-case-safe parameter learning for neural networks embedded within nondifferentiable, symbolic programs.

  • As part of our learning algorithm, we give a new way to bring together symbolic execution and stochastic gradient estimators. that might have applications outside our immediate task.

  • We present experiments that show that Dse can handle programs with complex structure and also outperforms DiffAi on our problem.

2 Problem Formulation

1Example(x): // x  [-5, 5]
2   y := (x)
3   if y  1.0:
4        z := x + 10.0
5   else:
6        z := x - 5.0
7   assert (z <= 1)
Figure 1: (Left) An example program. is a neural network with parameters . DiffAi fails to learn safe parameters for this program. (Right) The program as an STS. Each location corresponds to line in the program. For brevity, the updates only depict the variable that changes value.

Programs. We define programs with embedded neural networks as symbolic transition systems (STS) (manna2012temporal). Formally, a program is a tuple Here, is a finite set of (control) locations, is a set of real-valued variables, and is the initial location. , a constraint over , is an initial condition for the program. is a map from locations to constraints over ; intuitively, is a safety requirement asserted at location . Finally, is a transition relation consisting of transitions such that: (i) is the source location and is the destination location; (ii) the guard is a constraint over ; and (iii) the update

is a vector

, where each is a real-valued expression over constructed using standard symbolic operators and neural networks with parameters . Intuitively, represents the update to the -th variable. We assume that each is differentiable in . We also assume that each guard is of the form , where — note that this is not a significant restriction as can be made to store the value of a complex expression using an update. Finally, we assume that programs are deterministic. That is, if and are guards for two distinct transitions from the same source state, then is unsatisfiable.

Programs in higher-level languages can be translated to the STS notation in a standard way. For example, Figure 1 (left) shows a simple high-level program. The STS for this program appears in Figure 1 (right). While the program is simple, the state-of-the-art DiffAi approach to verified learning fails to learn safe parameters for it.

Safety Semantics. In classical formal methods, a program is considered safe if all of its executions satisfy a safety constraint. However, in learning settings, it helps to know not only whether a program is unsafe, but also the extent to which it is unsafe. Consequently, we define the safety semantics of programs in terms of a (worst-case) safety loss that quantifies a program’s safeness.

Formally, let a state of be a pair , where is a location and is an assignment of values to the variables (i.e., is the value of ). Such a state is said to be at location . For boolean constraints and assignments to variables in , let us write if satisfies . A state , where , is an initial state. A state is safe if .

Let be an assignment to the variables. For a real-valued expression over , let be the value of when is substituted by . For an update , we define as the assignment . A length- trajectory of is a sequence , with , such that: (i) is an initial state; and (ii) for each , there is a transition such that and .

Let us fix a size bound for trajectories. A trajectory is maximal if it has length , or if there is no trajectory with length . Because our programs are deterministic, there is a unique maximal trajectory from each . We denote this trajectory by .

Let us assume a real-valued loss that quantifies the unsafeness of each state . We require if is safe and otherwise. We lift this measure to trajectories by letting . The safety loss for is now defined as:

(1)

Thus, if and only if all program trajectories are safe.

Problem Statement. Our learning problem formalizes a setting in which we have training data for neural networks inside a program . While training the networks with respect to this data, we must ensure that the overall program satisfies its safety requirements.

To ensure that the parameters of the different neural networks in are not incorrectly entangled, we assume in this exposition that only one of these networks, , has trainable parameters.222Note that some of our experimental benchmarks have trainable neural modules. In these tasks, each neural module comes with its own training data; thus, in effect, we have instances of the learning problem. Our implementation learns the modules simultaneously by interleaving their gradient updates. However, the gradient of for each module is only influenced by its own data. We expect as input a training set of i.i.d. samples from an unknown distribution over the inputs and outputs of , and a differentiable data loss that quantifies the network’s fidelity to this training set. Our learning goal is to solve the following constrained optimization problem:

(2)

3 Learning Algorithm

The learning approach in Dse is based on two ideas. First, we directly apply a recently-developed equivalence between constrained and regularized learning (agarwal2018reductions; le2019batch) to reduce Equation 2 to a series of unconstrained optimization problems. Second, we use the novel technique of differentiating through a symbolic executor to solve these unconstrained problems.

At the highest level, we convexify the class of programs by allowing stochastic combinations of programs. That is, we now allow programs of the form , where . To execute the program from a given initial state, we sample a specific program from the distribution and then execute it from that state.

Equation 2 can now be written into the problem , which in turn can be solved using a classic algorithm (freund1999adaptive) for computing equilibria in a two-player game. See Section A.1 for more details on this algorithm.

A key feature of our high-level algorithm is that it repeatedly solves the optimization problem for fixed values of . This problem is challenging because while is differentiable in , depends on the entirety of and may not even be continuous. Dse, our main contribution, addresses this challenge by constructing a differentiable approximation of . We present the details of this method in the next section.

4 Differentiating through a Symbolic Executor

Background on Symbolic Execution. Dse uses a refinement of symbolic execution (baldoni2018survey), a classic technique for systematic formal analysis of programs. Symbolic execution has been recently used to analyze the safety of neural networks (gopinath2018symbolic). However, we are not aware of any prior attempt to incorporate symbolic execution into neural network training.

A symbolic executor systematically searches the set of symbolic trajectories of programs, which we now define. Consider a program as in Section 2. First, we fix a syntactically restricted class of boolean constraints over and . is required to be closed under conjunction and must include all expressions that can appear as a guard in . In particular, in our implementation, consists of formulas that specify closed or open intervals, with bounds given by expressions over parameters, in which each variable lies.

A symbolic state of is a pair , where is a location and . Intuitively, represents the set of states of that are at location and satisfy the property . We designate a finite set of initial symbolic states . It is required that . Intuitively, the initial symbolic states “cover” all the possible initial states of the program .

We also construct an overapproximate update for each update in . For all , we have , where is the formula . Intuitively, is an overapproximate representation of the set of states reached by applying to a state that satisfies .

Let us call a transition enabled at a symbolic state if is satisfiable. A symbolic trajectory of is a sequence with the following properties:

  • [wide = 0pt,leftmargin=1em]

  • is an initial symbolic state.

  • Let and . Then there is a transition such that: (i) is enabled at , and (ii) . In this case, we write .

Intuitively, represents an overapproximation of the set of concrete trajectories of that pass through the “control path” .

The above symbolic trajectory is safe if for all , . Intuitively, in this case, all concrete trajectories that the symbolic trajectory represents follow the program’s safety requirements.

Example(Cont.) Consider the example program in Figure 1. Let us assume a single initial symbolic state where , and suppose when . The program has two symbolic trajectories from the initial symbolic state:

  • [wide = 0pt,leftmargin=1em]

  • .

We note that only is safe.

Probabilistic Symbolic Execution. A key difficulty with using symbolic execution to estimate the safety loss is that our programs need not be differentiable, or even continuous. To understand the issues here, consider our running example once again. Here, the parameter is used to calculate the variable , which is used in a discontinuous conditional statement that assigns very different values to the variable in its two branches. Additionally, a slight change in can sometimes cause the guard of the conditional to go from being on some values of , to being on all values of . In the second case, the updated would have one, rather than two, symbolic trajectories. If the symbolic trajectory in which is serves as an edge case to judge whether a program is worst-case-safe, learning safe parameters would be difficult, as there would be no gradient guiding the optimizer back to the case in which is .

Dse overcomes these difficulties using a probabilistic approach to symbolic execution. Specifically, at a symbolic state , the symbolic executor in Dse samples

its next action following a probability distribution

, where ranges over program transitions or a special action .

For a boolean expression over the program variables and parameters , let denote the volume of the assignments to that satisfy . We assume a procedure to compute for any . (Note that such a procedure is easy to implement when every is an interval, as is the case in our implementation.) We define:

  • [wide = 0pt,leftmargin=1em]

  • If there is a unique program transition that is enabled at , then .

  • If there is no transition that is enabled at , then .

  • Otherwise, let be the transitions that are enabled at , with . Then

We also define a volume-weighted probability distribution

over the (finite set of) initial symbolic states . Specifically, for each , we have

(3)

Dse uses the distributions and to sample symbolic trajectories. Let , with . The probability of sampling is

(4)

We note that is differentiable in .

Approximate Safety Loss. A key decision in Dse is to approximate the overall worst-case loss by the expectation of a differentiable safety loss computed per sampled symbolic trajectory. Recall the safety loss for individual states. For , we define as a differentiable approximation of the worst-case loss We lift this loss to symbolic trajectories by defining We observe that is differentiable in .

Our approximation to the safety loss is now given by:

(5)

Intuitively, guides the learner towards either making the unsafe trajectories safer or lowering the probability of the program falling into unsafe trajectories. If equals zero, the program is provably safe. That said, in practice, we must estimate the expectation in Equation 5 using sampling. As this sampling process may miss trajectories, a low empirical estimate of need not guarantee worst-case safety. As an extreme example, suppose a program has one unsafe symbolic trajectory, but the probability of this symbolic trajectory is near-zero. This trajectory is likely to be missed by the sampling process, leading to an artificially low value of .

Example (Cont.) Consider the trajectories and in our running example. We have , as the other transitions in have probability 1. We compute . Thus, and similarly, .

Gradient Estimation. Our ultimate goal is to compute the gradient of the approximate safety loss. At first sight, the classic Reinforce estimator (williams1992simple) or one of its relatives (schulman2015gradient) seems suitable for this task, given that they can differentiate an integral over sampled “paths”. However, a subtlety in our setting is that the losses for individual paths (symbolic trajectories) is a function of . This calls for a generalization of traditional Reinforce-like estimators.

More precisely, we adapt the derivation of Reinforce to our setting as follows:

The above derivation can be further refined to incorporate the variance reduction techniques for

Reinforce

that are commonly employed in generative modeling and reinforcement learning. The use of such techniques is orthogonal to our main contribution, and we ignore it in this paper.

5 Evaluation

Our experimental evaluation seeks to answer the following research questions:

  1. Is Dse effective in learning safe and performant parameters?

  2. How does data size influence Dse and baselines?

  3. How well does Dse scale as the program being learned gets more complex?

5.1 Experimental Setup

System Setup.

Our framework is built on top of PyTorch

(NEURIPS2019_9015). We use the Adam Optimizer (kingma2014adam) for all the experiments with default parameters and a weight decay of . We ran all the experiments using a single-thread implementation on a Linux system with Intel Xeon Gold 5218 2.30GHz CPUs and GeForce RTX 2080 Ti GPUs. (Please refer to Section A.7 for more training details.)

Provably Safe Portion
Template Model DiffAI+ Dse
Pattern1 NNSmall 0.0 1.0
NNMed 0.0 1.0
NNBig 0.0 1.0
Pattern2 NNSmall 0.0 0.70
NNMed 0.0 0.78
NNBig 0.0 0.73
Pattern3 NNSmall 1.0 1.0
NNMed 1.0 1.0
NNBig 1.0 1.0
Pattern4 NNSmall 0.0 0.76
NNMed 0.0 0.80
NNBig 0.58 0.97
Pattern5 NNSmall 0.0 1.0
NNMed 0.0 1.0
NNBig 0.0 1.0
Figure 2: Results of synthetic microbenchmarks of Dse and DiffAI+.

Baselines. We use two types of baselines: (i) Ablation, which ignores the safety constraint and learns only from data; (ii) DiffAI+, an extended version of the original DiffAi method (mirman2018differentiable). DiffAi does not directly fit our setting of neurosymbolic programs as it does not handle general conditional statements. DiffAI+ extends DiffAi by adding the meet and join operations following the abstract interpretation framework (cousot1977abstract). Also, we allow DiffAI+ to split the input set into several (100 if not specified) initial symbolic states to increase the precision of symbolic execution of each trajectory. (The original DiffAi does not need to perform such splits because, being focused on adversarial robustness, it only propagates small regions around each input point through the model.) However, the parts of DiffAi that propagate losses and their gradients through neural networks remain the same in DiffAI+.

We use the box (interval) domain for both Dse and DiffAI+ in all the experiments. Dse samples 50 symbolic trajectories for each starting component.

Benchmarks. Our evaluation uses 5 synthetic microbenchmarks and 4 case studies. The microbenchmarks (Section A.5) are small programs consisting of basic neural network modules, arithmetic function assignment, and conditional branches. These programs are designed to highlight the relationship between DiffAI+ and Dse. The case studies, described below, model real-world control and navigation tasks (see Section A.4 for more details):

  • [leftmargin=0.2in]

  • In Thermostat, we want to learn safe parameters for two neural networks that control the temperature of a room.

  • Racetrack is a navigation benchmark (barto1995learning; christakis2021automated). Two vehicles are trained by path planners separately. The racetrack system is expected to learn two safe controllers so that vehicles do not crash into walls or into each other.

  • In Aircraft-Collision (AC), the goal is to learn safe parameters for an airplane that performs maneuvers to avoid a collision with a second plane.

  • Cartpole aims to train a cart imitating the demonstrations and keep the cart within the safe position range.

For each benchmark, we first write a ground-truth program that does not include neural networks and provably satisfies a safety requirement. We identify certain modules of these programs as candidates for learning. Then we replace the modules with neural networks with trainable parameters.

For all benchmarks with the exception of Racetrack, we execute the ground-truth programs on uniformly sampled inputs to collect input-output datasets for each module that is replaced by a neural network. As for Racetrack, this benchmark involves two simulated vehicles with neural controllers; learned parameters for these controllers are safe if the vehicles do not crash into each other or a wall. To model real-world navigation, we generate the training data for the controllers from a ground-truth trajectory constructed using a path planner. This path planner only tries to avoid the walls in the map and does not have a representation of distances from other vehicles. Thus, in this problem, even a very large data set does not contain all the information needed for learning safe parameters.

Evaluation of Safety and Data Loss. Once training is over, we evaluate the learned program’s test data loss by running it on 10000 initial states that were not seen during training. We evaluate the safety loss using an abstract interpreter (cousot1977abstract) that splits the initial condition into a certain number of boxes ( for Cartpole, 10000 for the other benchmarks), then constructs symbolic trajectories from these boxes. This analysis is sound (unlike the approximate loss employed during Dse training), meaning that the program is provably worst-case safe if the safety loss evaluates to 0. Our safety metric is the provably safe portion, which is the fraction of the program’s symbolic trajectories that are worst-case safe.

(a) Thermostat (Safety)
(b) AC (Safety)
(c) Racetrack (Safety)
(d) Cartpole (Safety)
(e) Thermostat                                                                                                                                                   (Test Data Loss)
(f) AC                                                                                                                                                   (Test Data Loss)
(g) Racetrack                                                                                                                                                  (Test Data Loss)
(h) Cartpole                                                                                                                                                  (Test Data Loss)
Figure 3: The provably safe portion and the test data loss of Ablation, DiffAI+ and Dse when varying the size of the data to train.
(a) Ablation(Final)
(b) DiffAI+ (Final)
(c) Dse (Middle)
(d) Dse (Final)
(e) Ablation(Final)
(f) DiffAI+ (Final)
(g) Dse (Middle)
(h) Dse (Final)
(i) Ablation(Final)[pos]
(j) DiffAI+ (Final)[pos]
(k) Dse (Middle)[pos]
(l) Dse (Final)[pos]
(m) Ablation(Final) [dist]
(n) DiffAI+ (Final) [dist]
(o) Dse (Middle) [dist]
(p) Dse (Final) [dist]
(q) Ablation(Final)
(r) DiffAI+ (Final)
(s) Dse (Middle)
(t) Dse (Final)
Figure 4: Trajectories during training. Each row exhibits the concrete trajectories and symbolic trajectories of one case from different methods. From top to bottom, the cases are Thermostat (Figure. 3(a), 3(b), 3(c), 3(d)), AC (Figure. 3(e), 3(f), 3(g), 3(h)), Racetrack with position property (Figure. 3(i), 3(j), 3(k), 3(p)) and distance property (Figure. 3(m), 3(n), 3(o), 3(p)), and Cartpole (Figure. 3(q), 3(r), 3(s), 3(t)). Each figure shows the trajectories (concrete: red, symbolic: green rhombus) of programs learnt by the method from different training stages, which is denoted by “Middle” and “Final”. We separate the input state set into 100 components (81 components for Cartpole) evenly to plot the symbolic trajectories clearly. During evaluation, we separate the input set into 10000 components ( components for Cartpole) evenly to get more accurate symbolic trajectories measurement.

5.2 Results

RQ1: Overall Quality of Learned Parameters. Figure 2 exhibits the provably safe portion of DiffAI+ and Dse on our microbenchmarks. Here, the first two patterns only use neural modules to evaluate the guards of a conditional branch, and the last three benchmarks use the neural modules in the guards as well as updates (assignments) that directly affect the program’s safety. We see that DiffAI+ cannot learn safe highly non-differentiable modules (pattern1, pattern2) while Dse can. For the cases where optimization across branches is not required to give safe programs, DiffAI+ can handle them (pattern3). For the patterns with neural assignments, DiffAI+ finds it hard to learn to jump from one conditional branch to another to increase safety (pattern4, pattern5). We refer the readers to Section A.9 for more detailed pattern analysis.

We show the training-time and test-time safety and data losses for our more complex benchmarks in Figure 3. From Figure 2(a), 2(b), 2(c), 2(d), we exhibit that Dse can learn programs with 0.8 provably safe portion even with 200 data points and the results keep when the number of training data points is increased. Meanwhile, both DiffAI+ and Ablation fail to provide safe programs for AC, Racetrack and Cartpole. For Thermostat, Ablation and DiffAI+ can achieve a provably safe portion of 0.8 only when using 5000 and 10000 data points to train.

Our test data loss is sometimes larger yet comparable with the Ablation and DiffAI+ from Figure 2(e), 2(g). Figure 2(h) exhibits the tension between achieving a good loss and safety, where the test data loss of Dse increases as the provably safe portion becomes larger. For AC specifically, the safety constraint can help the learner overcome some local optimal yet unsafe areas to get a safer result with more accurate behaviors.

Overall, we find that Dse outperforms DiffAI+ when neural modules are used in the conditional statement and the interaction between neural modules is important for the safety property. This is the case in our larger benchmarks. For example, in Thermostat and AC, the neural modules’ outputs decide the actions that the neural module to take in the next steps. In Racetrack, the distance between two neural modules regulates each step of the vehicles’ controllers. In Cartpole, the neural modules gives the force on carts as the output.

We display representative symbolic trajectories from the programs learned by Ablation, DiffAI+ and Dse in Figure 4. The larger portion of the symbolic trajectories is provably safe from Dse than the ones from baselines, which is indicated by the less overlapping between the green trajectories and the gray area. Because the symbolic trajectories are overapproximate, we also depict 100 randomly sampled representative concrete trajectories for each approach. We note that more concrete trajectories of Thermostat, AC, the distance property of Racetrack and Cartpole from Dse fall into the safe area compared with DiffAI+ and Ablation.

RQ2: Impact of Data Size. Figure 3 compares the different methods’ performance on the test metrics as the number of training data points is changed. The Ablation and DiffAI+ can reach 0.95 provably safe portion with 10000 data points for Thermostat. However, the variance of results is large, as the minimum provably safe portion across training can reach 0.74 and 0.0 for Ablation and DiffAI+. In the Thermostat benchmark, DiffAI+’s performance mainly comes from the guidance from the data loss rather than the safety loss.

For the other three cases, a larger training data set cannot help the Ablation and DiffAI+ to give much safer programs. For AC, as is in Figure 3(e), the program learnt from Ablation is very close to the safe area but the Ablation is still not accurate enough to give a safe program in the third step. In Racetrack, increasing the data size to train the vehicles’ controllers can only satisfy the requirement of “not crashing into walls”. The networks cannot learn to satisfy the property of not crashing into other vehicles as this information is not available in the training data. In Cartpole, the baselines mainly follow the guidance of the data loss and can not give safer programs.

RQ3: Scalability. As our solution for the safety loss does not rely on the number of data points, the solution is scalable in terms of both data size and neural network size. In Figure 2

, we use the three types of neural networks: (i) NNSmall (with about 33000 parameters); (ii) NNMed (with over 0.5 million parameters); (iii) NNBig (with over 2 million parameters), the training time per epoch is 0.09 sec, 0.10sec, 0.11sec for NNSmall, NNMed, NNBig separately. For the synthetic benchmarks, the program can converge within about 200 epochs which takes 18 sec, 20sec, and 22 sec.

Our Thermostat, AC and Racetrack benchmarks consist of neural networks with 4933, 8836 and 4547 parameters respectively, and loops with 20, 15 and 20 iterations respectively. There are 2, 1 and 2 neural modules (respectively) in each benchmark. The training times for DiffAI+ and Dse are comparable on these benchmarks. Specifically, for Thermostat, AC, and Racetrack, the total training time of Dse is less than 1 hour, less than 90 minutes, and around 13 hours. The corresponding numbers are over 90 minutes, more than 2 hours, and more than 17 hours for DiffAI+. A more detailed scalability analysis is available in Section A.11.

6 Related Work

Verification of Neural and Neurosymbolic Models. There are many recent papers on the verification of worst-case properties of neural networks (anderson2019optimization; gehr2018ai2; katz2017reluplex; elboher2020abstraction; wang2018formal). There are also several recent papers (ivanov2019verisig; tran2020nnv; sun2019formal; christakis2021automated) on the verification of compositions of neural networks and symbolic systems (for example, plant models). To our knowledge, the present effort is the first to integrate a method of this sort — propagation of worst-case intervals through neurosymbolic program — with the gradient-based training of neurosymbolic programs.

Verified Deep Learning. There is a growing literature on methods that incorporate worst-case objectives (such as safety and robustness) into neural network training (zhang2019towards; mirman2018differentiable; madry2017towards; cohen2019certified; singh2018fast). Most work on this topic focuses on the training of single neural networks. There are a few domain-specific efforts that consider the environment of the neural networks being trained. For example, (shi2019neural) uses spectral normalization to constrain the neural network module of a neurosymbolic controller and ensure that it respects certain stability properties. Unlike Dse, these methods treat neural network modules in an isolated way and do not consider the interactions between these modules and surrounding code (stone2016artificial; qin2000overview; nassi2020phantom; katz2021scenario).

Verified Parameter Synthesis for Symbolic Code. There is a large body of work on parameter synthesis for traditional symbolic code (alur2013syntax; chaudhuri2014bridging) with respect to worst-case correctness constraints. Especially relevant in this literature is chaudhuri2014bridging, which introduced a method to smoothly approximate abstract interpreters of programs that is closely related to Dse and DiffAi. However, these methods do not use contemporary gradient-based learning, and as a result, scaling them to programs with neural modules is impractical.

7 Conclusion

We presented Dse, the first approach to worst-case-safe parameter learning for programs that embed neural networks inside potentially discontinuous symbolic code. The method is based on a new mechanism for differentiating through a symbolic executor. We demonstrate that Dse outperforms DiffAi, a state-of-the-art approach to worst-case-safe deep learning, in a range of tasks.

Our current implementation of Dse uses an interval representation of symbolic states. Future work should explore more precise representations such as zonotopes. Incorporating modern variance reduction techniques in our gradient estimator is another natural next step. Finally, one challenge in Dse is that learning here can get harder as the symbolic state representation gets more precise. In particular, if we increase the number of initial symbolic states beyond a point, each state would only lead to a unique symbolic trajectory, and there would be no gradient signal to adjust the relative weights of the different symbolic trajectories. Future work should seek to identify good, possibly adaptive, tradeoffs between the precision of symbolic states and the ease of learning.

Acknowledgments: We thank our anonymous reviewers for their insightful comments and Radoslav Ivanov (Rensselaer Polytechnic Institute) for his help with benchmark design. Work on this paper was supported by the United States Air Force and DARPA under Contract No. FA8750-20-C-0002, by ONR under Award No. N00014-20-1-2115, and by NSF under grant #1901376.

References

Appendix A Appendix

a.1 Learning Framework

We use an equivalence between constrained and regularized learning that (agarwal2018reductions; le2019batch), among others, have recently developed in other learning settings, we reduce our problem to a series of unconstrained optimization tasks.

for  do
      
       ,  
      
      
       if  then return (;
      
      
Algorithm 1 Learning Safe, Optimal Parameter Mixtures (agarwal2018reductions; le2019batch)

We convexify the program set by considering stochastic mixtures (le2019batch) and represent the convexified set as a probabilistic function . Following Dse, we rewrite Equation 2 in terms of these mixtures:

(6)

We convert Equation 6 to a Lagrangian function

(7)
(8)

, where is a Lagrange multiplier. Following the equilibrium computation technique by freund1996game, Equation 6 can be rewritten as

(9)

Solutions to this problem can be interpreted as equilibria of a game between a -player and a -player in which the -player minimizes given the current , and the -player maximizes given the current . Our overall algorithm is shown in Algorithm 1. In this pseudocode, is a predefined positive real. is the mixture that selects a out of uniformly at random.

refers to the player’s best response for a given value of (it can be shown that this best response is a single parameter value, rather than a mixture of parameters). Computing this best response amounts to solving the unconstrained optimization problem , i.e.,

(10)

is the -player’s best response to a particular parameter mixture . We define this function as

(11)

where is the upper bound on . Intuitively, when is non-positive, the current parameter mixture is safe. As is a linear function of , the minimum value of is achieved in this case when is zero. For other cases, the minimum is reached by setting to the maximum value .

Finally, computes, in constant time, a new value of based on the most recent best response by player. We do not describe this function in detail; please see (agarwal2018reductions) for more details.

Following prior work, we can show that Algorithm 1 converges to a value that is within additive distance from a saddle point for Equation 9. To solve our original problem (Equation 2), we take the returned and then return the real-valued parameter to which this mixture assigns the highest probability. It is easy to see that this value is an approximate solution to Equation 2.

a.2 Abstract Update for Neural Network Modules

We consider the box domain in the implementation. For a program with variables, each component in the domain represents a -dimensional box. Each component of the domain is a pair , where is the center of the box and represents the non-negative deviations. The interval concretization of the -th dimension variable of is given by

Now we give the abstract update for the box domain following mirman2018differentiable.

Add.

For a concrete function that replaces the -th element in the input vector by the sum of the -th and -th element:

The abstraction function of is given by:

where can replace the -th element of by the sum of the -th and -th element by .

Multiplication.

For a concrete function that multiplies the -th element in the input vector by a constant :

The abstraction function of is given by:

where multiplies the -th element of by and multiplies the -th element of with .

Matrix Multiplication.

For a concrete function that multiplies the input by a fixed matrix :

The abstraction function of is given by:

where is an element-wise absolute value operation. Convolutions follow the same approach, as they are also linear operations.

ReLU.

For a concrete element-wise ReLU operation over :

the abstraction function of ReLU is given by:

where and denotes the element-wise sum and element-wise subtraction between and .

Sigmoid.

As Sigmoid and ReLU are both monotonic functions, the abstraction functions follow the same approach. For a concrete element-wise Sigmoid operation over :

the abstraction function of Sigmoid is given by:

where and denotes the element-wise sum and element-wise subtraction between and . All the above abstract updates can be easily differentiable and parallelized on the GPU.

a.3 Instantiation of the Function

In general, the function over individual states can be any differentiable distance function between a point and a set which satisfies the property that if is in the safe set, , and if is not in . We give the following instantiation as the unsafeness score over individual states:

where Dist denotes the euclidean distance between two points.

Similarly, the function over symbolic states can be any differentiable distance function between two sets which satisfies the property that if is in , and if is not in . We give the following instantiation as a differentiable approximation of the worst-case loss in our implementation:

a.4 Benchmarks for Case Studies

The detailed programs describing Aircraft Collision, Thermostat and Racetrack are in Figure 5, 6 and 7. The of Aircraft Collision and , in Thermostat are a 3 layer feed forward net with 64 nodes in each layer and a ReLU after each layer except the last one. A sigmoid layer serves as the last layer. Both the and in Racetrack are a 3 layer feed forward net with 64 nodes in each and a ReLU after each layer. The Cartpole benchmark is a linear approximation following bastani2018verifiable with a 3 layer feed forward net with 64 nodes in each layer as the controller.

1aircraftcollision(x):
2    x1, y1 := x, -15.0
3    x2, y2 := 0.0, 0.0
4    N := 15
5    i = 0
6    while i < N:
7        p0, p1, p2, p3, step := (x1, y1, x2, y2, step, stage)
8        stage := argmax(p0, p1, p2, p3)
9        if stage == CRUISE:
10            x1, y1 := MOVE_CRUISE(x1, y1)
11        else if stage == LEFT:
12            x1, y1 := MOVE_LEFT(x1, y1)
13        else if stage == STRAIGHT:
14            x1, y1 := MOVE_STRAIGHT(x1, y1)
15        else if stage == RIGHT:
16            x1, y1 := MOVE_RIGHT(x1, y1)
17
18        x2, y2 := MOVE_2(x2, y2)
19        i := i + 1
20        assert (!IS_CRASH(x1, y1, x2, y2))
21    return
Figure 5: Aircraft Collision
1thermostat(x):
2    N = 20
3    isOn = 0.0
4    i = 0
5    while i < N:
6        if isOn  0.5:
7            isOn := (x)
8            x := COOLING(x)
9        else:
10            isOn, heat := (x)
11            x := WARMING(x, heat)
12        i := i + 1
13        assert(!EXTREME_TEMPERATURE(x))
14
15    return
Figure 6: Thermostat
1racetrack(x):
2    x1, y1, x2, y2 := x, 0.0, x, 0.0
3    N := 20
4    while i < N:
5        p10, p11, p12 := (x1, y1)
6        p20, p21, p22 := (x2, y2)
7        action1 := argmax(p10, p11, p12)
8        action2 := argmax(p20, p21, p22)
9        x1, y1 := MOVE(x1, y1, action1)
10        x2, y2 := MOVE(x2, y2, action2)
11        i := i + 1
12        assert(!CRASH_WALL(x1, y1) && !CRASH_WALL(x2, y2)
13            && !CRASH(x1, y1, x2, y2))
14
15    return
Figure 7: Racetrack

a.5 Synthetic Microbenchmarks

Figure 8 exhibits the 5 synthetic microbenchmarks in our evaluation. Specifically, the neural network modules are used as conditions in pattern 1 and pattern 2. The neural network modules serve as both conditions and the assignment variable in pattern 3, pattern 4 and pattern 5. To highlight the impact from the safety loss, we only train with safety loss for synthetic microbenchmarks.

1pattern1(x):
2    # x  [-5, 5]
3    y := (x)
4    if y  1.0:
5        z := 10
6    else:
7        z: = 1
8    assert(z  1)
9    return z
(a) Pattern1
1pattern2(x):
2    # x  [-5, 5]
3    y := (x)
4    if y  1.0:
5        z := x + 10
6    else:
7        z: = x - 5
8    assert(z  0)
9    return z
(b) Pattern2
1pattern3(x):
2    # x  [-5, 5]
3    y := (x)
4    if y  1.0:
5        z := 10 - y
6    else:
7        z: = 1
8    assert(z  1)
9    return z
(c) Pattern3
1pattern4(x):
2    # x  [-5, 5]
3    y := (x)
4    if y  -1.0:
5        z := 1
6    else:
7        z: = 2 + y * y
8    assert(z  1)
9    return z
(d) Pattern4
1pattern5(x):
2    # x  [-1, 1]
3    y := (x)
4    if y  1.0:
5        z := y
6    else:
7        z: = -10
8    assert(z  0 && z  -5)
9    return z
(e) Pattern5
Figure 8: Programs for Patterns

a.6 Data Generation

In this section, we provide more details about our data generation process. We give our ground-truth programs and the description of each benchmark:

  • Thermostat: Figure 6 exhibits the program with initial temperature in training, where a 20-length loop is involved. The COOLING and WARMING are two differentiable functions updating the temperature, . The and

    are two linear feed-forward neural networks. We set the safe temperature to the area

    . EXTREME_TEMPERATURE measures that whether is not within the safe temperature scope. The ground-truth program replaces and with two different functional mechanisms (including branches and assignment), which decides the value of isOn and heat. To mimic the real-world signal sensing process of thermostat, we carefully add noise to the output of these functional mechanisms and restricts the noise does not influence the safety of the ground-truth program.

  • AC: Figure 5 illustrates the program in training with initial input x-axis of aircraft1 , where a 15-length loop is involved. The MOVE_* functions update the position of aircraft in a differentiable way. The is one linear feed-forward neural networks. IS_CRASH measures the distances between two aircraft and the safe distance area is set to be larger than 40.0. In the ground-truth program, the is replaced by one functional mechanism mimicing the real-world aircraft planner. If the stage is CRUISE, the ground-truth planner detects the distance between aircraft and decides the next step’s stage by assignment values or to p0, p1, p2, p3. The step is set to 0. When the stage is in LEFT or RIGHT, stage keeps and the step increases if the number of steps in this stage is within a threshold, and the stage changes to STRAIGHT or CRUISE and the step is set to 0 otherwise. Specifically, we convert the argmax to a nested If-Then-Else(ITE) block to select the index of the with the maximum value.

  • Racetrack: Figure 7 illustrates the program in training with the input x-axis of each agent (The two agents start from the same location.) This program has a 20-length loop with a dynamic length of 223 lines (The unfold ITE of argmax consists of 3 lines code.) In training, the and are two linear feed-forward neural network. The MOVE is differentiable functions that update agents’ positions. measures whether the agents’ position collides with the wall. measures whether the two agents crash into each other. In ground-truth program, and are replaced by functional path planners guiding the agent’s direction. The path planner only tries to avoid wall-collision in the map. That is to say, in the trajectories data generated, an arbitrary pair of trajectories of agent1 and agent2 is not guaranteed to be no-collision with each other. To model real-world navigation, we also add noise in the next step selection for one position to generate independent trajectories of each agent. In the ground-truth program, the noise is added by uniformly selecting from the safe next step area.

  • Cartpole: In the Cartpole experiment, we use the trajectories data generated from the expert model from Imitation Package. Starting from , we train a cart of which the position is within a range . The state space is 4 and the action space is 2. We use a 3 layer fully connected neural network (each layer followed by a ReLU and the last layer followed by a Sigmoid) to train. We follow the two points in bastani2018verifiable below to setup the experiment:

    • We approximate the system using a finite time horizon = 10;

    • we use a linear approximation f(s, a) As + Ba.

When generating the trajectory dataset, we uniformly sample the input from the input space and run the ground-truth program to get the trajectory. Specifically, the trajectories in the dataset is represented by a sequence of neural network input-output pairs and the index of the neural networks. Take Thermostat as an example, we get a sequence of for one starting point. The represents the data collected in the -th step of the program.

a.7 Training Procedure

Our framework is built on top of PyTorch (NEURIPS2019_9015). We use the Adam Optimizer (kingma2014adam) for all the experiments with a learning rate of and a weight decay of . We ran all the experiments using a single-thread implementation on a Linux system with Intel Xeon Gold 5218 2.30GHz CPUs and GeForce RTX 2080 Ti GPUs.

We represent the data loss (

) following imitation learning techniques. The trajectory dataset is converted to several input-output example pair sets. Each input-output pair set represents the input-output pairs for each neural network.

is calculated by the average of the Mean Squared Error of the each neural network over the corresponding input-output pair set.

The of Dse follows the algorithm given in the Section 4 with one symbolic state covering the input space. In DiffAI+, we use sound join operation, which is used in classic abstract interpretation cousot1977abstract, whenever encountering branches. Sound join means that the symbolic state after the branch block is the join of the symbolic states computed from each branch. The concretization of the symbolic state after sound join covers the union of the concretization of the symbolic states from each branch. Intuitively, starting from one input symbolic state, DiffAI+ gets one symbolic trajectory which covers all the potential trajectories starting from the concrete input satisfying the input symbolic state. For the safety loss of DiffAI+, we first split the input space into 100 subregions evenly to give more accuracy for DiffAI+. After that, we can get 100 symbolic trajectories. We compute the average of the safety loss of all the 100 symbolic trajectories as the training safety loss of DiffAI+.

We set the convergence criterion as a combination of epoch number and loss. In each training iteration, the final loss is represented as . We set the maximum epoch for Thermostat, AC and Racetrack as 1500, 1200 and 6000. We set the early stop if the final loss has less than a 1% decrease over 200 epochs.

a.8 Training Performance

Figure 9 and Figure 10 illustrates the training performance of safety loss and data loss when we vary data points (2%, 10%, 20%, 50%, 100% of the training dataset) used by DiffAi + and Dse. We can see that the safety loss of Dse can converge while the variance of the safety loss of DiffAi + is quite large for these three cases and can not converge to a small safety loss.

While we found that starting from random initialization can give us safe programs in Thermostat, we achieved much quicker convergence when initializing Thermostat programs trained purely with data and then continue to train with safety loss. We show the results and training performance for Thermostat with data-loss initialized program. The other two benchmarks do not benefit a lot from the data loss initialized program. We keep the training of them with random initialization.

a.8.1 Safety Loss

(a) Thermostat(200)
(b) Thermostat(1k)
(c) Thermostat(2k)
(d) Thermostat(5k)
(e) Thermostat(10k)
(f) AC(150)
(g) AC(750)
(h) AC(1.5k)
(i) AC(3.75k)
(j) AC(7.5k)
(k) Racetrack(200)
(l) Racetrack(1k)
(m) Racetrack(2k)
(n) Racetrack(5k)
(o) Racetrack(10k)
(p) Cartpole(200)
(q) Cartpole(1k)
(r) Cartpole(2k)
(s) Cartpole(5k)
(t) Cartpole(10k)
Figure 9: Training performance of data loss on different benchmarks varying data points. The y-axis represents the safety loss () and the x-axis gives the number of training epochs. Overall, Dse converges within 1500, 1000 , 6000, and 2000 epochs for Thermostat, AC, Racetrack, Cartpole, while DiffAI+ easily gets stuck or fluctuates on these benchmarks.

a.8.2 Data Loss

(a) Thermostat(200)
(b) Thermostat(1k)
(c) Thermostat(2k)
(d) Thermostat(5k)
(e) Thermostat(10k)
(f) AC(150)
(g) AC(750)
(h) AC(1.5k)
(i) AC(3.75k)
(j) AC(7.5k)
(k) Racetrack(200)
(l) Racetrack(1k)
(m) Racetrack(2k)
(n) Racetrack(5k)
(o) Racetrack(10k)
(p) Cartpole(200)
(q) Cartpole(1k)
(r) Cartpole(2k)
(s) Cartpole(5k)
(t) Cartpole(10k)
Figure 10: Training performance of data loss on different benchmarks varying data points. The y-axis represents the data loss () and the x-axis gives the number of training epochs. Since Thermostat is initialized with data trained programs to give quicker convergence, the data loss does not change a lot during training. For other benchmarks, data loss of Dse and DiffAI+ are both converging. Overall speaking, Dse sacrifices some data loss to give safer programs.

a.8.3 Details of Training and Test Data

We also attach the detailed training data loss (), training safety loss (), test data loss and provably safe portion for Thermostat(Figure 11), AC(Figure 12) and Racetrack(Figure 13). Specifically, training safety loss measures the expectation of the quantitative safety of the symbolic trajectories and provably safe portion measures the portion of the symbolic trajectories that is provably safe. Intuitively, in training, if the number of unsafe symbolic trajectories is larger and the unsafe symbolic trajectories are further away from safe areas, is larger. In test, if the number of provably safe symbolic trajectories is larger, the provably safe portion is larger.

Data Size Approach Test Data Loss Provably Safe Portion
200 Dse 0.13 0.02 0.24 0.99
DiffAI+ 0.07 25.37 0.21 0.28
1000 Dse 0.09 0.0 0.22 0.99
DiffAI+ 0.05 1.96 0.14 0.55
2000 Dse 0.08 0.0 0.21 0.99
DiffAI+ 0.05 2.70 0.15 0.46
5000 Dse 0.07 0.0 0.19 0.99
DiffAI+ 0.04 4.58 0.18 0.40
10000 Dse 0.04 0.0 0.19 0.99
DiffAI+ 0.02 1.23 0.19 0.80
Figure 11: Training and Test Results of Thermostat
Data Size Approach