# Formal Methods with a Touch of Magic

Machine learning and formal methods have complimentary benefits and drawbacks. In this work, we address the controller-design problem with a combination of techniques from both fields. The use of black-box neural networks in deep reinforcement learning (deep RL) poses a challenge for such a combination. Instead of reasoning formally about the output of deep RL, which we call the wizard, we extract from it a decision-tree based model, which we refer to as the magic book. Using the extracted model as an intermediary, we are able to handle problems that are infeasible for either deep RL or formal methods by themselves. First, we suggest, for the first time, combining a magic book in a synthesis procedure. We synthesize a stand-alone correct-by-design controller that enjoys the favorable performance of RL. Second, we incorporate a magic book in a bounded model checking (BMC) procedure. BMC allows us to find numerous traces of the plant under the control of the wizard, which a user can use to increase the trustworthiness of the wizard and direct further training.

## Authors

• 2 publications
• 7 publications
• 18 publications
• 7 publications
11/17/2017

### A Supervisory Control Algorithm Based on Property-Directed Reachability

We present an algorithm for synthesising a controller (supervisor) for a...
02/02/2021

### An Abstraction-based Method to Check Multi-Agent Deep Reinforcement-Learning Behaviors

Multi-agent reinforcement learning (RL) often struggles to ensure the sa...
10/02/2018

### CEM-RL: Combining evolutionary and gradient-based methods for policy search

Deep neuroevolution and deep reinforcement learning (deep RL) algorithms...
05/14/2020

### Probabilistic Guarantees for Safe Deep Reinforcement Learning

Deep reinforcement learning has been successfully applied to many contro...
11/17/2016

### Learning to reinforcement learn

In recent years deep reinforcement learning (RL) systems have attained s...
03/13/2019

### Task-oriented Design through Deep Reinforcement Learning

We propose a new low-cost machine-learning-based methodology which assis...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Machine-learning techniques and, in particular, the use of neural networks (NNs), are exploding in popularity and becoming a vital part of the development of many technologies. There is a challenge, however, in deploying systems that use trained components, which are inherently black-box. For a system to be used by a human, it must be trustworthy: provably correct, or predictable, at the least. Current trained systems lack either of these properties.

In this work, we focus on the controller-design problem. Abstractly speaking, a controller is a device that interacts with a plant. At each time step, the plant outputs its state and the controller feeds back an action. Combining techniques from both formal methods and machine learning is especially appealing in the controller-design problem since it is critical that the designed controller is both correct and that it optimizes plant performance.

Reinforcement learning (RL) is the main machine-learning tool for designing controllers. The RL approach is based on “trial and error”: the agent randomly explores its environment, receives rewards and learns from experience how to maximize them. RL has made a quantum leap in terms of scalability since the recent introduction of NNs into the approach, termed deep RL [1]. We call the output of deep RL the wizard: it optimizes plant performance but, since it is a NN, it does not reveal its decision procedure. More importantly, there are no guarantees on the wizard and it can behave unexpectedly and even incorrectly.

Reasoning about systems that use NNs poses a challenge for formal methods. First, in terms of scalability (NNs tend to be large), and second, the operations that NNs depend on are challenging for formal methods tools, namely NNs use numerical rather than Boolean operations and ReLuneurons use the max operator, which SMT tools struggle with.

We propose a novel approach based on extracting a decision-tree-based model from the wizard, which approximates its operation and is intended to reveal its decision-making process. Hence, we refer to this model as the magic book. Our requirements for the magic book are that it is (1) simple enough for formal methods to use, and (2) a good approximation of the NN.

Extracting decision-tree-based models that approximate a complicated function is an established practice [2]

. The assumption that allows this extraction to work is that a NN contains substantial redundancy. During training, the NN “learns” heuristics that it uses to optimize plant performance. The heuristics can be compactly captured in a small model, e.g., in a decision-tree. This assumption has led, for example, to attempts of distilling knowledge from a trained NN to a second NN during its training

[3, 4], and of minimizing NNs (e.g., [5]). The extraction of a simple model is especially common in explainable AI (XAI) [6], where the goal is to explain the operation of a learned system to a human user.

We use the tree-based magic book to solve problems that are infeasible both for deep RL and for formal methods alone. Specifically, we illustrate the magic book’s benefit in two approaches for designing controllers as we elaborate below.

Reactive synthesis [7] is a formal approach to design controllers. The input is a qualitative specification and the output is a correct-by-design controller. The fact that the controller is provably correct, is the strength of synthesis. A first weakness of traditional synthesis is that it is purely qualitative and specifications cannot naturally express quantitative performance. There is a recent surge of quantitative approaches to synthesis (e.g., [8, 9, 10]). However, these approaches suffer from other weaknesses of synthesis: deep RL vastly outperforms synthesis in terms of scalability. Also, in the average-case, RL-based controllers beat synthesized controllers since the goal in synthesis is to maximize worst-case performance.

Synthesis is often reduced to solving a two-player graph game; Player  represents the controller and Player  represents the plant. In each step, Player  reveals the current state of the plant and Player  responds by choosing an action. In our construction, when Player  chooses , we extract from the magic book the action that is taken at . Player ’s action then depends on as we elaborate below. The construction of the game arena thus depends on the magic book, and using the wizard instead is infeasible.

We present a novel approach for introducing performance considerations into reactive synthesis. We synthesize a controller that satisfies a given qualitative specification while following the magic book as closely as possible. We formalize the later as a quantitative objective: whenever Player  agrees with the choice of action suggested by Player , he receives a reward, and the goal is to maximize rewards. Since the magic book is a proxy for the RL-generated wizard, we obtain the best of both worlds: a provably correct controller that enjoys the high average-case performance of RL. In our experiments, we synthesize a controller for a taxi that travels on a grid for the specification “visit a gas station every steps” while following advice from a wizard that is trained to collect as many passengers as possible in a given time frame.

In a second application, we use a magic book to relax the adversarial assumption on the environment in a multi-agent setting. We are thus able to synthesize controllers for specifications that are otherwise unrealizable, i.e., for which traditional synthesis does not return any controller. Our goal is to synthesize a controller for an agent that interacts with an environment that consists of other agents. Instead of modeling the other agents as adversarial, we assume that they operate according to a magic book. This restricts their possible actions and regains realizability. For example, suppose a taxi that is out of our control, shares a network of roads with a bus, under our control. Our goal is to synthesize a controller that guarantees that the bus travels between two stations without crashing into a taxi. While an adversarial taxi can block the bus, by assuming that the taxi operates according to a magic book, we limit Player ’s action in the game and find a winning Player  strategy that corresponds to a correct controller.

Bounded model checking [11] (BMC) is an established technique to find bounded traces of a system that satisfy a given specification. In a second approach to the controller-design problem, we use BMC as an XAI tool to increase the trustworthiness of a wizard before outputting it as the controller of the plant. We rely on BMC to find (many) traces of the plant under the control of the wizard that are tedious to find manually.

We solve BMC by constructing an SMT program that intuitively simulates the operation of the plant under the control of the magic book rather than under the control of the wizard. This leads to a simple reduction and a significant performance gain: in our experiments, we use the standard SMT solver Z3 [12] to extract thousands of witnesses within minutes, whereas Z3 is incapable of solving extremely modest wizard-based BMC instances. Since traces returned by BMC witness the magic book, a secondary simple test is required to check that the traces witness the wizard as well. In our experiments, we find that many traces are indeed shared between the two, since the magic book is a good approximation of the wizard. Thus, our procedure efficiently finds numerous traces of the plant under the control of the wizard.

A first application of BMC is in verification; namely, we find counterexamples for a given specification. For example, when controlling a taxi, a violation of a liveness property is an infinite loop in which no passenger is collected. We find it more appealing to use BMC as an XAI tool. For example, BMC allows us to find “suspicious” traces that are not necessarily incorrect; e.g., when controlling a taxi, a passenger that is not closest is collected first. Individual traces can serve as explanations. Alternatively, we use BMC’s ability to find many traces and gather a large dataset. We extract a small human-interpretable model from the dataset that attempts to explain the wizard’s decision-making procedure. For example, the model serves as an answer to the question: when does the wizard prefer collecting a passenger that is not closest?

### I-a Related work

We compare our synthesis approach to shielding [13, 14], which adds guarantees to a learned controller at runtime by monitoring the wizard and correcting its actions. Unlike shielding, the magic book allows us to open up the black-box wizard, which, for example, enables our controller to cross an obstacle that was not present in training, a task that is inherently impossible for a shield-based controller. A second key difference is that we produce stand-alone controllers whereas a shield-based approach needs to execute the NN wizard in each step. Our method is thus preferable in settings where running a NN is costly, e.g., embedded systems or real time systems.

To the best of our knowledge, synthesis in combination with a magic book was never studied. Previously, finding counterexamples for tree-based controllers that are extracted from NN controllers was studied in [15] and [16]

. The ultimate goal in those works is to output a correct tree-based controller. A first weakness of this approach is that, since both wizard and magic book are trained, they exhibit many correctness violations. We believe that repairing them manually while maintaining high performance is a challenging task. Our synthesis procedure assists in automating this procedure. Second, in some cases, a designer would prefer to use a NN controller rather than a tree-based one since NNs tend to generalize better than tree-based models. Hence, we promote the use of BMC for XAI to increase the trustworthiness of the wizard. Finally, the case studies the authors demonstrate are different from ours, thus they strengthen the claim that a tree-based classifier extraction is not specific to our domain rather it is a general concept.

A specialized wizard-based BMC tool was recently shown in [17], thus unlike our approach, there is no need to check that the output trace is also a witness for the wizard. More importantly, their method is “sound”: if their method terminates without finding a counterexample for bound , then there is indeed no violation of length . Beyond the disadvantages listed above, the main disadvantage of their approach is scalability, which is not clear in the paper. As we describe in the experiments section, our experience is that a wizard-based BMC implemented in Z3 does not scale.

Our BMC procedure finds traces that witness a temporal behavior of the plant. This is very different from finding adversarial examples, which are inputs slightly perturbed so that to lead to a different output. Finding adversarial examples and verifying robustness have attracted considerable attention in NNs (for example, [18, 19, 20]

) as well as in random-forest classifiers (e.g.,

[21, 22]).

Somewhat similar in spirit to our approach is applying program synthesis to extract a program from a NN [23, 24], which, similar to the role of the magic book, is an alternative small model for application of formal methods.

Finally, examples of other combinations of RL with synthesis include works that run an online version of RL (see [25] and references therein), an execution of RL restricted to correct traces [26], and RL with safety specifications [27].

## Ii Preliminaries

#### Plant and controller

We formalize the interaction between a controller and a plant. The plant is modelled as a Markov decision process (MDP) which is , where is a finite set of states, is an initial configuration of the state, is a finite collection of actions, is a reward provided in each state, and

is a probabilistic transition function that, given a state and an action, produces a probability distribution over states.

###### Example 1.

Our running example throughout the paper is a taxi that travels on an grid and collects passengers. Whenever a passenger is collected, it re-appears in a random location. A state of the plant contains the locations of the taxi and the passengers, thus it is a tuple , where for , the pair is a position on the grid, is the position of the taxi, and is the position of Passenger . The set of actions is . The transitions of are largely deterministic: given an action , we obtain the updated state by updating the position of the taxi deterministically, and if the taxi collects a passenger, i.e., , for some , then the new position of Passenger  is chosen uniformly at random.

The controller is a policy, which prescribes which action to take given the history of visited states, thus it is a function . A policy is positional if the action that it prescribes depends only on the current position, thus it is a function . We are interested in finding an optimal and correct policy as we define below.

#### Qualitative correctness

We consider a strong notion of qualitative correctness that disregards probabilistic events, often called surely correctness. A specification is . We define the support of at given as and, for a policy , we define the support of to be . We define the surely language of w.r.t. , denoted . A run is in iff we have and for every , we have , where . We say that is surely-correct for plant w.r.t. a specification iff it allows only correct runs of , thus .

#### Quantitative performance and deep reinforcement learning

The goal of reinforcement learning (RL) is to find a policy in an MDP that maximizes the expected reward [28]. In a finite MDP , the state at a time step

is a random variable, denoted

. Each time step entails a reward, which is also a random variable, denoted . The probability that and get particular values depends solely on the previous state and action. Formally, for an initial state , we define , and for and , we have We consider discounted rewards. Let be a discount factor. The expected reward that a policy ensures starting at state is , where is defined w.r.t. as in the above. The goal is to find the optimal policy that attains .

We consider the Q-learning algorithm for solving MDPs, which relies on a function such that represents the expected value under the assumption that the initial state is and the first action to be taken is , thus . Clearly, given the function , one can obtain an optimal positional policy , by defining , for every state

. In Q-learning, the Q function is estimated and iteratively refined using the Bellman equation.

Traditional implementations of Q-learning assume that the MDP is represented explicitly. Deep RL [1] implements the Q-learning algorithm using a symbolic representation of the MDP as a NN. The NN takes as input a state and outputs for each , an estimate of . The technical challenge in deep RL is that it combines training of the NN with estimating the Q function. We call the NN that deep RL outputs the wizard. Even though deep RL does not provide any guarantees on the wizard, in practice it has shown remarkable success.

#### Magic books from decision-tree-based classifiers

Recall that the output of deep RL is a positional function that is represented by a NN . We are interested in extracting a small function MB of the same type that approximates Wiz well. We use decision-tree based classifiers as our model of choice for MB. Each internal node of a decision tree is labeled with a predicate over and each leaf is labeled with an action in . A plant state gives rise to a unique path in a decision tree , denoted , in the expected manner. The first node is the root. Upon visiting an internal node , the next node in depends on the satisfaction value of . Suppose is the sequence of predicates traversed by a path , we use to denote . Thus, for every we have iff satisfies . When ends in a leaf labeled , we say that the tree votes for . A forest contains several trees. On input , each tree votes for an action and the action receiving most votes is output.

To obtain MB from Wiz, we first execute Wiz with the plant on a considerable number of steps to collect pairs of the form , for , where is the system state at time . We then employ standard techniques on this dataset to construct either a decision tree, or a forest of decision trees using the state-of-the-art random forest [29] or

[30] techniques.

###### Remark 1.

One might wonder whether the wizard is an essential step in the construction of the magic book. That is, whether it is possible to obtain a decision tree directly from RL. While some attempts have been made to use decision trees to succinctly represent a policy [31], the combination of decision trees with RL is not as natural as it is with other models (such as NNs). It has never shown great success and has largely been abandoned. Thus, we argue that the wizard is indeed essential. Extracting a decision-tree controller from a NN was also done in [15, 16].

## Iii Synthesis with a Touch of Magic

Our primary goal in this section is to automatically construct a correct controller and performance is a secondary consideration. We incorporate a magic book into synthesis and illustrate two applications of the constructions that are infeasible without a magic book.

### Iii-a Constructing a game

Synthesis is often reduced to a two-player graph game (see [32]). In this section, we describe a construction of a game arena that is based on a magic book and in the next sections we complete the construction by describing the players’ objectives and illustrate applications. In the traditional game, Player  represents the environment and in each turn, he reveals the current location of the plant. Player , who represents the controller, answers with an action. A strategy for Player  corresponds to a policy (controller) since, given the history of observed plant states, it prescribes which action to feed in to the plant next. The traditional goal is to find a Player  strategy that guarantees that a given specification is satisfied no matter how Player  plays. Traditional synthesis is purely qualitative; namely, it returns some correct policy with no consideration to its performance. When no correct controller exists, we say that is un-realizable.

Formally, a graph game is played on an arena , where is a set of vertices, for , Player ’s possible actions are , and is a deterministic transition function. The game proceeds by placing a token on a vertex in . When the token is placed on , Player  moves first and chooses . Then, Player  chooses and the token proceeds to . In games, rather than using the term “policy”, we use the term strategy. Two strategies and for the two players and an initial vertex induce a unique infinite play, which we denote by .

We describe our construction in which the roles of the players is slightly altered. Consider a plant with state space and actions . The arena of our synthesis game is based on two abstractions and of . While we assume is provided by a user, the partition is extracted from the magic book. The arena is , where is defined below. Suppose that the token is placed on (see Fig. 1). Intuitively, the actual location of the plant is a state with . Player  moves first and chooses a set such that . Intuitively, a Player  action reveals that the actual state of the plant is in . Player  reacts by choosing an action . We denote by the set of possible next locations the plant can be in, thus . Then, the next state in the game according to is the minimal-sized set such that .

Suppose for ease of presentation that the magic book is a decision tree , and the construction easily generalizes to forests. Recall that a state produces a unique path , which corresponds to sequence of predicates , and . We define . An immediate consequence of the construction is the following.

###### Lemma 1.

For every there is such that , for all .

In the following lemma we formalize the intuition that Player  over-approximates the plant. It is not hard, given a Player  strategy , to obtain a policy that follows it. For , we use and to denote the unique abstract sets that belongs to.

###### Lemma 2.

Let be a Player  strategy. Consider a trace . Then, there is a Player  strategy such that .

###### Proof.

We define inductively so that for every , the -th vertex of is . Suppose the invariant holds for . Player  chooses . The definition of implies that the invariant is maintained, thus . ∎

We note that the converse of Lemma 2 is not necessarily correct, thus Player  strictly over-approximates the plant. Indeed, suppose that the token is placed on , Player  chooses , Player  chooses , and the token proceeds to . Intuitively, the plant state was in and thus should now be in . In the subsequent move, however, Player  is allowed to choose any with , even one that does not intersect .

In this section, we abstain from solving the problem of finding a correct and optimal controller; a problem that is computationally hard for explicit systems, not to mention symbolically-represented systems like the ones we consider. Instead, in order to add performance consideration to synthesis, we think of the wizard as an authority in terms of performance and solve the (hopefully simpler) problem of constructing a correct controller that follows the wizard’s actions as closely as possible. We use the magic book as a proxy for the wizard and assume that following its actions most of the time results in favorable performance.

The game arena is constructed as in the previous section. Player ’s goal is to ensure that a given specification is satisfied while optimizing a quantitative objective that we use to formalize the notion of “following the magic book”. For simplicity, we consider finite paths, thus , and the definitions can be generalized to infinite plays. By Lem. 1, every Player  action corresponds to a unique action in , which we denote by . We think of Player  as “suggesting” the action since for every , we have . To motivate Player  to use , when he “accepts” the suggestion and chooses the same action, he obtains a reward of and otherwise he obtains no reward. Then, Player ’s goal in the game is to maximize the sum of rewards that he obtains.

We formalize the guarantees of the controller that we synthesize w.r.t. an optimal strategy for Player . Intuitively, the payoff that guarantees in the game is a lower on the number of times agrees with the magic book in any trace of the plant. Let and be two strategies for the two players. We use to denote the payoff of Player  in the game. When , we set , thus Player  first tries to ensure that holds. If , the score is the sum of rewards in . We assign a score to in a path-based manner. Let . For every , we issue a reward of if , and we denote by , the sum of rewards, which represents the sum of states in which agrees with MB throughout . The following theorem follows from Lem. 2.

###### Theorem 1.

Let be a strategy that achieves . If , then is correct w.r.t. . Moreover, for every we have .

### Iii-C Multi-agent synthesis

In this section, we design a controller in a multi-agent setting, where traditional synthesis is unrealizable and thus does not return any controller.

For ease of presentation, we focus on two agents, and the construction can be generalized to more agents in a straightforward manner. We assume that the set of actions is partitioned between the two agents, thus . In each step, the players simultaneously select actions, where for , Player  selects an action in . As before, the joint action determines a probability distribution on the next state according to . Our goal is to find a controller for Agent  that satisfies a given specification no matter how Player  plays.

###### Example 2.

Suppose that the grid has two means of transportation: a bus (Agent ) and a taxi (Agent ). We are interested in synthesizing a bus controller for the specification “travel between two stations while not hitting the taxi”. If one models the taxi as an adversary, the specification is clearly not realizable: the taxi parks in one of the targets so that the bus cannot visit it without crashing into the taxi.

We assume that Agent  is operating according to a magic book. As in the previous section, we require an abstraction such that and the abstraction is obtained from the magic book. We construct a game arena as in Section III-A and Player  wins an infinite play iff it satisfies .

The way the magic book is employed here is that it restricts the possible actions that Player  can take. Going back to the taxi and bus example, at a state , Player  essentially chooses how to move the taxi. Suppose the token is placed on . Player  cannot choose to move the taxi in any direction; indeed, he can choose only when there is a state such that . The following theorem is an immediate consequence of Lem. 2.

###### Theorem 2.

Let be a winning strategy: for every , satisfies . Then, .

In Remark 3 we discuss the guarantees on the magic book that are needed to assume that Agent  operates according to a wizard rather than a magic book.

## Iv BMC Based on Magic Books

In this section, we describe a bounded-model-checking (BMC) [11] procedure that is based on a tree-based magic book. We use our procedure in verification and as an explainability tool to increase the trustworthiness of the wizard before outputting it as the controller for the plant.

###### Definition 1 (Bounded model checking).

Given a plant with state space , a specification , a bound , and a policy , output a run of length in if one exists.

BMC reduces to the satisfiability problem for satisfiability modulo theories (SMT), where the goal is, given a set of constraints over a set of variables , either find a satisfying assignment to or return that none exists. We are interested in solving BMC for wizards, i.e., finding a path in . However, as can be seen in the proof of Thm. 3 below, the SMT program needs to simulate the execution of the wizard, thus it becomes both large and challenging (due to the max operator) for standard SMT solvers. Instead, we solve BMC for magic books to find a path . Since MB is a good approximation for Wiz, we often have .

###### Theorem 3.

BMC reduces to SMT. Specifically, given a plant with states , a specification , a policy given as a tree-based magic book, and a bound , there is an SMT formula whose satisfying assignments correspond to paths of length in .

###### Proof.

The first steps of the reduction are standard. Consider a policy and a bound . The variables consist of state variables and action variables . We add constraints so that, for a satisfying assignment , for , each corresponds to a state in , and for , each corresponds to an action in . Moreover, for , the constraints ensure that , thus we obtain a path in .

We consider a specification that can be represented as an SMT constraint over and add constraints so that the path we find is in .

The missing component from this construction ensures that the action is indeed the action that selects at state . For that, we need to simulate the operation of using constraints. Suppose first that is represented using a decision tree . For a path in , recall that is the predicate that is satisfied by every state such that . Moreover, recall that each is a predicate over . For , we create a copy of using the variables so that it is satisfied iff satisfies . For , let denote the set of paths in that end in the action . We add a constraint that states that if is true at time , then . Finally, when MB is a forest, we need to count the number of trees that vote for each action and set to equal the action with the highest count. ∎

###### Remark 2.

(The size of the SMT program). In the construction in Theorem 3, as is standard in BMC, we use roughly copies of , where the size of each copy depends on the representation size of . In addition, we need a constraint that represents , which in our examples, is of size . The main bottleneck are the constraints that represent . Each path appears exactly once in a constraint, and we use copies of , thus the total size of these constraints is , where is the number of paths in the trees in the forest.

###### Example 3.

Recall the description of the plant in Example 1 in which a taxi travels in a grid. We illustrate how to simulate the plant using an SMT program. A state at time is a tuple of variables that take integer values in . The position of the taxi at time is and the position of Passenger  is . The transition function is represented using constraints. For example, the constraint means that when the action up is taken, the taxi moves one step up. The constraint means that if Passenger  is not collected by the taxi at time , its location should not change. A key point is that when Passenger  is collected, we do not constrain his new location, thus we replace the randomness in with nondeterminism.

#### Verification

In verification, our goal is to find violations of the wizard for a given specification.

###### Example 4.

We show how to express the specification “the taxi never enters a loop in which no passenger is collected” as an SMT constraint based on the construction in Example 3. We simplify slightly and use the constraint that means that the taxi returns to its initial position to close a cycle at the end of the trace. We add a second constraint that means that all passengers stay in their original position throughout the trace. In Fig. 3 (right), we depict a lasso-shaped trace that witnesses a violation of this property.

###### Remark 3 (Soundness).

The benefit of using magic books is scalability, and the draw-back is soundness. For example, when the SMT formula is unsatisfiable for a bound , this only means that there are no violations of the magic book of length , and there can still be a violation of the wizard. To regain soundness we would need guarantees on the relation between the magic book and the wizard. An example of a guarantee is that the two functions coincide, thus for every state , we have . However, if at all possible, we expect such a strong guarantee to come at the expense of a huge magic book, thus bringing us back to square one. We are more optimistic that one can find small magic books with approximation guarantees. For example, one can define a magic book as a function that “suggests” a set of actions rather than only one, and require that for every state , we have . Such guarantees suffice to regain soundness both in BMC and for the synthesis application in Section III-C. We leave for future work obtaining such magic books.

#### Explainability

We illustrate how BMC can be used as an XAI tool. BMC allows us to find corner-case traces that are hard to find in a manual simulation and the individual traces can serve as explanations. For example, in Fig. 3 (left), we depict a trace that is obtained using BMC for the property “the first passenger to be collected is not the closest”.

A second application of BMC is based on gathering a large number of traces. We construct a small human-readable model that explains the decision procedure of the wizard. We note that while the magic book is already a small model that approximates the wizard, its size is way to large for a human to reason about. For us, a small model is one decision tree of depth at most . Moreover, the magic book is a “local” function, its type is from states to actions, whereas a human is typically interested in “global” behavior, e.g., which action to take next as opposed to which passenger is collected next, respectively.

We rely on the user to supply specifications . We gather a dataset that consists of pairs of the form , for each , where is such that when the plant starts at configuration under the control of the wizard, then is satisfied. To find many traces that satisfy , we iteratively call an SMT solver. Suppose it finds a trace . Then, before the next call, we add the constraint to the SMT program so that is not found again. In practice, the amortized running time of this simple algorithm is low. One reason is that generating the SMT program takes considerable time, even when comparing to the time it takes to solve it. This running time is essentially amortized over all executions since the running time of adding a single constraint is negligible. In addition, the SMT solver learns the structure of the SMT program and uses it to speed up subsequent executions.

###### Example 5.

Suppose we are interested in understanding if and how the wizard prioritizes collecting passengers. We consider the specifications “Passenger  is collected first”, for . It can be formalized using the following constraints. The constraint means that Passenger  is not collected since it stays in place throughout the whole trace, and we add such a constraint for all but one passenger. The constraint means that Passenger  must have been collected at least once since its final position differs from his initial position. In Fig. 4 we depict a tree that we extract using these specifications.

## V Experiments

#### Setup

We illustrate our approach using an implementation of the case study that is our running example: a taxi traveling on a grid and collecting passengers. We set the size of the grid to be and the number of passengers to , thus the state space is almost . All simulations were programmed in Python and run on a personal computer with an Intel Core i3-4130 3.40GHz CPU, 7.7 GiB memory runnning Ubuntu.

#### Training a wizard using deep RL

The plant state in our training is a -tuple that, for each passenger, contains the distances to the taxi on both axes. When the taxi collects a passenger, the agent receives a reward of . Multi-objective RL is notoriously difficult because the agent gets confused by the various targets. We thus found it useful to add a “hint” when the taxi does not collect a passenger: at time , if a passenger is not collected, the agent receives a reward of , where , for and , is the manhattan distance between the taxi and passenger  at time

. We use the Python library Keras

[33] and the “Adam” optimizer [34]

to minimize mean squared error loss. We train a NN with two hidden layers that use a ReLU activation function and with

and neurons, respectively, and a linear output layer. Each episode consists of steps and we train for episodes.

#### Extracting the magic book

We extract configuration-action pairs from episodes of the trained agent. We use Python’s scikit-learn library [35] to fit one of the tree-based classification model to the obtained dataset. Table I depicts a comparison between the models and the wizard on episodes. Performance refers to the total number of passengers collected in a simulation. It is encouraging that small forests with shallow trees (of depth not more than ) approximate the wizard well.

The specification we consider is “reach a gas station every time steps”, for some . Our controllers exhibit performance that is not too far from the wizard: see Table I for the performance with and synthesis based on different tree models (take into account that the wizard does not visit the gas station). We view this experiment as a success: we achieve our goal of synthesizing a correct controller that achieves favorable performance. We point out that since traditional synthesis does not address performance, a controller that it produces visits the gas station every steps but does not collect any passenger.

#### Comparing with a shield-based approach

A shield-based controller [13, 14] consists of a shield that uses a wizard as a black box: given a plant state , the wizard is run to obtain , then is fed to the shield to obtain , which is issued to the plant. We demonstrate how our synthesis procedure manages to open up the black-box wizard. In Fig. 5, we depict the result of an experiment in which we add a wall to the grid that was not present in training. Crossing a wall is inherently impossible for the shield-based controller since when the wizard suggests an action that is not allowed, the best the shield can do is choose an arbitrary substitute. Our controller, on the other hand, intuitively directs the taxi to areas in the grid where the magic book is “certain” of its actions (a notion which is convenient to define when the magic book is a forest). Since these positions are often located near passengers, the taxi manages to cross the wall.

#### BMC: Scalability and success rate

We use the standard state-of-the-art SMT solver Z3 [12] to solve BMC. In Table II, we consider the following specifications for XAI: “Passenger  is collected first and at time , even though it is not closest”, where is the bound for BMC and for . We perform the following experiment times and average the results. We run BMC to collect traces. We depict the amortized running time of finding a trace, i.e., the total running time divided by . Recall that the traces witness the magic book. We count the number of traces out of the that also witness the wizard, and depict their ratio. We find both results encouraging: finding a dataset of non-trivial witness traces of the wizard is feasible.

#### Wizard-based BMC

We implemented a BMC procedure that simulates the wizard instead of the magic book and ran it using Z3. We observe extremely poor scalability: an extremely modest SMT query to find a path of length timed-out at min, and even when the initial state is set, the running time is min!

#### BMC: Verification and Explainability

For verification, we consider the specifications “the taxi never hits the wall” and “the taxi never enters a loop in which no passenger is collected”. Even though violations of these specifications were not observed in numerous simulations, we find counterexamples for both (see a depiction for the second property in Fig. 3 on the right). We illustrate explainability with the property “the closest passenger is not collected first” by depicting an example trace for it in Fig. 3 on the left. In Fig. 4, we depict a decision tree, obtained from a dataset consisting of examples, as an attempt to explain when the wizard chooses to collect passenger  first, for .

### V-a Discussion

In this work, we address the controller-design problem using a combination of techniques from formal methods and machine learning. The challenge in this combination is that formal methods struggle with the use of neural networks (NNs). We bypass this difficulty using a novel procedure that, instead of reasoning on the NN that deep RL trains (the wizard), extracts from the wizard a small model that approximates its operation (the magic book). We illustrate the advantage of using the magic book by tackling problems that are out of reach for either formal methods or machine learning separately. Specifically, to the best of our knowledge, we are the first to incorporate a magic book in a reactive synthesis procedure thereby synthesizing a stand-alone controller with performance considerations. Second, we use a magic-book based BMC procedure as an XAI tool to increase the trustworthiness of the wizard.

We list several directions for future work. We find it an interesting and important problem to extract magic books with provable guarantees (see Remark 3). Another line of future work is finding other domains in which magic books can be extracted and other applications for magic books. One concrete domain is in speeding up solvers (e.g., SAT, SMT, QBF, etc). Recently, there are attempts at replacing traditional engineered heuristics with learned heuristics (e.g, [36, 37]). This approach was shown to be fruitful in [38], where an RL-based SAT solver performed less operations than a standard SAT solver. However, at runtime, the SAT solver has the upper hand since the bottleneck becomes the calls to the NN. We find it interesting to use a magic book instead of a NN in this domain so that a solver would benefit from using a learned heuristic without paying the cost of a high runtime.

Our synthesis procedure is based on an abstraction of the plant. In the future, we plan to investigate an iterative refinement scheme for the abstraction. Refinement in our setting is not standard since it includes a quantitative game (e.g., [39]), and more interesting, there is inaccuracy introduced by the magic book and wizard. Refinement can be applied both to the process of extracting the decision tree from the NN as well as improving the performance of the wizard using training.

## References

• [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller, “Playing atari with deep reinforcement learning,” CoRR, vol. abs/1312.5602, 2013. [Online]. Available: http://arxiv.org/abs/1312.5602
• [2] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode reinforcement learning,” JMLR, vol. 6, no. Apr, pp. 503–556, 2005.
• [3] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” CoRR, vol. abs/1503.02531, 2015. [Online]. Available: http://arxiv.org/abs/1503.02531
• [4] N. Frosst and G. Hinton, “Distilling a neural network into a soft decision tree,” 2017.
• [5] D. Shriver, D. Xu, S. Elbaum, and M. B. Dwyer, “Refactoring neural networks for verification,” arXiv preprint arXiv:1908.08026, 2019.
• [6]

IEEE Access, vol. 6, pp. 52 138–52 160, 2018.
• [7] A. Pnueli and R. Rosner, “On the synthesis of a reactive module,” in Proc. 16th POPL, 1989, pp. 179–190.
• [8] R. Bloem, K. Chatterjee, T. A. Henzinger, and B. Jobstmann, “Better quality in synthesis through quantitative objectives,” in Proc. 21st CAV, 2009, pp. 140–156.
• [9] A. Bohy, V. Bruyère, E. Filiot, and J. Raskin, “Synthesis from LTL specifications with mean-payoff objectives,” in Proc. 19th TACAS, 2013, pp. 169–184.
• [10] S. Almagor, O. Kupferman, J. O. Ringert, and Y. Velner, “Quantitative assume guarantee synthesis,” in Proc. 29th CAV, 2017, pp. 353–374.
• [11] A. Biere, A. Cimatti, E. M. Clarke, O. Strichman, and Y. Zhu, “Bounded model checking,” Advances in Computers, vol. 58, pp. 117–148, 2003.
• [12] L. M. de Moura and N. Bjørner, “Z3: an efficient SMT solver,” in Proc. 14th TACAS 2008, ser. LNCS, vol. 4963.   Springer, 2008, pp. 337–340. [Online]. Available: https://doi.org/10.1007/978-3-540-78800-3_24
• [13] B. Könighofer, M. Alshiekh, R. Bloem, L. Humphrey, R. Könighofer, U. Topcu, and C. Wang, “Shield synthesis,” FMSD, vol. 51, no. 2, pp. 332–361, 2017.
• [14] G. Avni, R. Bloem, K. Chatterjee, T. A. Henzinger, B. Könighofer, and S. Pranger, “Run-time optimization for learned controllers through quantitative games,” in Proc. 31st CAV, 2019, pp. 630–649.
• [15] O. Bastani, Y. Pu, and A. Solar-Lezama, “Verifiable reinforcement learning via policy extraction,” in Proc. 31st NeurIPS, 2018, pp. 2499–2509.
• [16] J. Tornblom and S. Nadjm-Tehrani, “Formal verification of input-output mappings of tree ensembles,” CoRR, vol. abs/1905.04194, 2019, https://arxiv.org/abs/1905.04194.
• [17] Y. Kazak, C. W. Barrett, G. Katz, and M. Schapira, “Verifying deep-RL-driven systems,” in Proc. of NetAI@SIGCOMM, 2019, pp. 83–89.
• [18] G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient SMT solver for verifying deep neural networks,” in Proc. 29th CAV, 2017, pp. 97–117.
• [19] T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. T. Vechev, “AI2: safety and robustness certification of neural networks with abstract interpretation,” in Proc. 39th SP, 2018, pp. 3–18.
• [20] X. Huang, M. Kwiatkowska, S. Wang, and M. Wu, “Safety verification of deep neural networks,” in Proc. 29th CAV, 2017, pp. 3–29.
• [21] G. Einziger, M. Goldstein, Y. Sa’ar, and I. Segall, “Verifying robustness of gradient boosted models,” in Proc. 33rd AAAI, 2019, pp. 2446–2453.
• [22] S. Drews, A. Albarghouthi, and L. D’Antoni, “Proving data-poisoning robustness in decision trees,” CoRR, vol. abs/1912.00981, 2019. [Online]. Available: http://arxiv.org/abs/1912.00981
• [23] L. Valkov, D. Chaudhari, A. Srivastava, C. A. Sutton, and S. Chaudhuri, “HOUDINI: lifelong learning as program synthesis,” in Proc. 31st NeurIPS, 2018, pp. 8701–8712.
• [24] A. Verma, V. Murali, R. Singh, P. Kohli, and S. Chaudhuri, “Programmatically interpretable reinforcement learning,” in Proc. 35th ICML, 2018, pp. 5052–5061.
• [25] M. Jaeger, P. G. Jensen, K. G. Larsen, A. Legay, S. Sedwards, and J. H. Taankvist, “Teaching stratego to play ball: Optimal synthesis for continuous space MDPs,” in Proc. 17th ATVA, 2019, pp. 81–97.
• [26] J. Kretínský, G. A. Pérez, and J. Raskin, “Learning-based mean-payoff optimization in an unknown MDP under omega-regular constraints,” in Proc. 29th CONCUR, 2018, pp. 8:1–8:18.
• [27] M. Wen, R. Ehlers, and U. Topcu, “Correct-by-synthesis reinforcement learning with temporal logic constraints,” in Proc. IROS, 2015, pp. 4983–4990.
• [28] R. S. Sutton, A. G. Barto, and R. J. Williams, “Reinforcement learning is direct adaptive optimal control,” IEEE CSM, vol. 12, no. 2, pp. 19–22, 1992.
• [29] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
• [30]

T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in

Proc. 22nd ACM SIGKDD.   ACM, 2016, pp. 785–794.
• [31] L. D. Pyeatt and A. E. Howe, “Decision tree function approximation in reinforcement learning,” in Proc. 3rd International Symposium on Adaptive Systems, 2001, pp. 70–77.
• [32] R. Bloem, K. Chatterjee, and B. Jobstmann, “Graph games and reactive synthesis,” in Handbook of Model Checking, 2018, pp. 921–962.
• [33] F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015.
• [34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
• [35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” JMLR, vol. 12, pp. 2825–2830, 2011.
• [36] M. Soos, R. Kulkarni, and K. S. Meel, “Crystalball: Gazing in the black box of SAT solving,” in Proc. 22nd SAT, 2019, pp. 371–387.
• [37] G. Lederman, M. N. Rabe, S. Seshia, and E. A. Lee, “Learning heuristics for quantified boolean formulas through reinforcement learning,” in Proc. 8th ICLR, 2020.
• [38] E. Yolcu and B. Póczos, “Learning local search heuristics for boolean satisfiability,” in Proc. 32nd NeurIPS, 2019, pp. 7990–8001.
• [39] G. Avni and O. Kupferman, “Making weighted containment feasible: A heuristic based on simulation and abstraction,” in Proc. 23rd CONCUR, 2012, pp. 84–99.