1 Introduction
One notable feature of human learners is that we are able to carry out counterfactual reasoning over unrealized events. That is, we contemplate potential answers to questions of the form, “What would I do in situations , , and ?” A related, and perhaps more pertinent, form of question is, “Are there situations , , and , such that in these situations I would select actions , , and ?” In this case, the actions , , etc., might be actions that are likely to result in particularly good or bad outcomes, and answers ,
, etc., can be useful, especially when they are of an unexpected nature, since they reveal potential failures of robustness (in the case of bad examples) or potential strengths (in the case of good examples). In this paper, we describe a novel approach to answering and utilizing the answers to questions of this form when asked not of a human agent, but of a reinforcement learning agent. Our approach is not based solely on the deployment of techniques from the typical machine learning toolbox, as we make crucial use of SMTsolving, which is more familiar to researchers in the field of formal methods. In the theoretical development, we capture our use of SMTsolving technology via the abstraction of what we are calling
introspection oracles: oracles that may give us direct access to sets of (state, action) pairs satisfying fixed constraints with respect to the policy network.By querying the oracle during training it is possible to generate (state, action)pairs capturing failures/strengths of the agent with respect to properties of interest. For instance, if there are certain “obviously wrong” actions that the agent should never take (e.g., selecting a steering angle that would cause the automobile controlled by the policy network to drive off of the road when there are no obstacles or other dangers present), we query the oracle as to whether there exists states in which the agent would select such actions. Our algorithm then uses this data to train so as to improve the safety of the agent and without requiring that such potentially dangerous or costly situations be encountered in real life. It is true that such (state, action) pairs are potentially discoverable in simulation/testing, but when the set of such pairs is known beforehand we save time and improve policy robustness by generating them analytically.
In this paper, we introduce a new algorithm for reinforcement learning, which we call the Introspection Learning Algorithm, that exploits introspection oracles to improve the training and robustness of reinforcement learning (RL) agents versus baseline training algorithms. This algorithm involves modifying the underlying MDP structure and we derive theoretical results that justify these modifications. Finally, we discuss several experimental results that clearly showcase the benefits to both performance and robustness of this approach. In particular, in the case of robustness, we evaluated our results by querying the weights after training to determine numbers of Sat (examples found), Unsat (examples mathematically impossible) and Timeout (ran out of time to find or refute existence of examples) results.
The paper is organized as follows. In Section 2 we introduce the mathematical abstraction of introspection oracles and discuss briefly their embodiment as SMTsolvers. Section 3 details our Introspection Learning Algorithm. Finally, Section 4 captures our empirical results. The Appendix (Section 6) includes the proof of a basic result that justifies the modification of MDPs made in our algorithms.
Related work
Previously, Linear Programming, which is itself is a constraint solving technique, has been employed in reinforcement learning to constrain the exploration space for the agent’s policy to improve both the speed of convergence and the quality of the policy converged to
[Amos2017] or as a replacement for more traditional Dynamic Programming methods in QLearning to solve for equilibria policies in zerosum multiagent Markov game MDPs [Littman, Greenwald]. Previous work has also been done on incorporating Quadratic Program solvers to restrict agent exploration to “safe” trajectories by constraining the output of a neural network policy [Pham2017, Amos2017]. Introspection Learning is fundamentally different from these approaches as rather than restricting the action space, or replacing our Q function, we are instead shaping our agents in policy space by asking our policy for state batches where it would satisfy stated constraints, without needing the agent to actually experience these states.Exciting recent work on verification of neural networks (e.g., [Reluplex, Lomuscio]) is closely related the work described here. In addition to the similarity of the techniques, we are indeed capturing verification results as a robustness measure (see below). One practical distinction is that we are using the dReal solver [dReal], which is able to handle networks with general nonlinear activations, but as a tradeoff (not made in other SMTsolvers) admits the possibility of “falsepositive” satisfiable instances. In principle, our algorithm can be used with any compatible combination of SMTsolvers and neural network architectures.
2 Introspection Oracles
In order to set the appropriate theoretical stage, we will first introduce some notation and terminology.
Definition 1.
A
(preMDP) consists of a set of states, a set of actions, andtransition probabilities
in for and such that .Intuitively, the value is the probability transitioning from state to state on taking action .
Definition 2.
Given a preMarkov Decision Process (preMDP) , a policy for assigns to each state
over the set .A preMDP is called a MDPR in, e.g., [Abbeel].
Often we are concerned with cases where is finite and the policies under consideration are deterministic in the sense that, for each state , for all but a single element of . When we write . Given a preMDP , we denote by the set of all policies for .
Definition 3.
A Markov decision process (MDP) consists of a preMDP together with a reward function which is bounded, a subset of terminal (state, action)pairs, and a new state not in such that:

For any , ;

For any , ; and

For any , .
One nonstandard feature of Definition 3 is that we consider terminal pairs rather than terminal states. This will be technically useful below. We also follow [SuttonBarto] in that the provision of terminal pairs modifies the preMDP structure in adding a dummy stable state to which all terminal states canonically transition such that subsequent transitions from have no reward. This is a technical convenience which streamlines some of the theory.
We denote by the set of all Markov decision processes over the preMDP and by the set of all policies over . Given an MDP in , we denote by the subset of consisting of those policies that are optimal for . In broad strokes, inverse reinforcement learning [Ng] is concerned with, given a policy in (or, more often, a set of its trajectories), determining an element of such that is in . We are concerned with a closely related problem.
One difference between our approach and that of inverse reinforcement learning is that instead of assuming access to a target policy or its trajectories, we assume that we have access to certain properties that target policies ought to have. In the simplest case, such a property is given by a subset of the set of (state, action) pairs.^{1}^{1}1In the more general case, the relevant properties should be (nonempty) subsets of space of finite sequences of (state, action) pairs that are compatible with the underlying transition probabilities of . In this paper, we restrict attention to the more elementary notion. We refer to policies with the required properties as good policies. There is considerable flexibility in the notion of goodness here, but in many cases it will be associated with safety and robustness. E.g., a good policy for driving a car would not make unexpected sharp turns when the road ahead is straight and clear of obstacles. Much of our focus is on these kinds of examples, but it is worth emphasizing that goodness could instead be associated with performance rather than safety.
In order to make the problem tractable, it is necessary to restrict to sufficiently wellbehaved subsets of . For us, the wellbehaved subsets are those definable in the firstorder theory of real arithmetic with common nonlinear function symbols (e.g., , , , , etc.).^{2}^{2}2In the experimental results captured in this paper, we restricted further to semialgebraic subsets. I.e., those describable as finite unions of sets defined by finitely many polynomial equations and inequations. Denote by the set of all such subsets of . With this notation in place, we arrive the definition of introspection oracle.
Definition 4.
Given policy in , an introspection oracle for is a map such that if , then is in . An introspection oracle is nontrivial when there exists in such that .
Intuitively, an introspection oracle for attempts to answer questions of the form: “Are there inputs that give rise via to a (state, action) pair with property ?” Here is an error signal which can be provided with several possible semantics. Here it is best understood as indicating that the oracle was unable to find an element of in a reasonable amount of time.
Before turning to describe our use of introspection oracles in reinforcement learning, we observe that nontrivial introspection oracles do indeed exist:
Observation 1.
For policy functions definable in the language of firstorder real arithmetic with nonlinear function symbols (, , , , etc.) there exist nontrivial introspection oracles.
The existence of such introspection oracles which are moreover practically useful in the sense of returning outputs in a wide range of feasible cases is guaranteed by the decision procedure of Gao, Avigad and Clarke [Gao], which is implemented in the dReal nonlinear SMTsolver. The novelty of dReal is that it overcomes the undecidability of real arithmetic with nonlinear function symbols by accepting a compromise: whereas unsatisfiable (Unsat) results are genuine, satisfiable (Sat) results may be falsepositives. Note that, unlike in many of the other applications of SMTsolving to verification of neural networks such as [Reluplex, Lomuscio], dReal is able to handle all common nonlinear activations. In terms of our abstraction, spurious Sat results, which are easily detected by a forward pass of the network, can be regarded as instances where .
3 The Introspection Learning Algorithm
We now describe the Introspection Learning Algorithm in detail, starting with its inputs. First, this algorithm assumes given an offpolicy reinforcement learning algorithm (OPRL) and corresponding policy function . It is furthermore assumed that is describable in the language of real arithmetic with nonlinear function symbols.
Additionally assume given a family of subsets , which will be used when we query the oracle . Having a sufficiently rich family will provide a mechanism for generating more useful examples and the design of these properties is one of the main engineering challenges involved in utilizing the algorithm effectively. Pairs obtained from the oracle as are added to the OPRL agent’s replay buffer.
Finally, we assume given a schedule determining when during training to perform queries and updates. For simplicity in describing the algorithm we assume that the schedule is controlled by two factors. First, a simple set of training indices. Second, a bound on moving average reward such that once moving average reward is greater than or equal to we no longer perform queries or updates on gathered examples.
In summary, given the aforementioned inputs, the Introspection Learning Algorithm 1 proceeds by training as usual according to the OPRL except that, when episode indices in are arrived at and the moving average reward remains below , the oracle is queried with the specified family of pairs, examples are gathered (when possible) and inserted into the replay buffer as terminal.
Mathematically, this algorithm effectively produces a modified MDP structure by altering the terminal pairs and the reward structure. In the Appendix (Section 6), we show (Theorem 1) that, under reasonable hypotheses, the sets of optimal policies for the original MDP and the modified MDP coincide.
There are several parameters and variations of this algorithm possible, of which we now mention several. First, in some cases it may be necessary or useful to postprocess the gathered state batches (e.g., to ensure sufficient balance/symmetry properties). Here consideration should be paid to the bias introduced by state batches which are in one sense “on policy” (if the agent were in a state returned by the SMTsolver it would have taken the specified action with high probability), but are not guaranteed to be “on trajectory” as we have no guarantee the state would be reachable by policy . In practice, we have found such processing to be unnecessary provided that suitable are selected and a reasonable schedule is followed.
In addition to varying the schedule, it is also possible to consider a range of options for the behavior of the replay buffer and how to train on the examples contained therein. We have found it to usually be sufficient to train on these as terminal states with highnegative or highpositive reward, however other approaches can also be considered. It should be noted that treating these states as terminal will alter the optimal policy, which may or may not be desired, and alternatively one could query the training environment with the state batches and specified actions to recover the reward signal and next state from the environment in order to reduce the change in the optimal policy. Our intention was to take a naïve approach as we are interested in applications where acquiring experience is potentially risky or expensive.
4 Experimental Environments and Results
Our experiments were conducted with the Double Deep Q Network algorithm DDQN [VanHasselt2015] with Prioritized Experience Replay[Schaul2015] and the OpenAI Gym “Lunar Lander” environment [OpenAI], OpenAI Gym “Cliff Walk” environment [OpenAI] and the DeepMind AI Safety Gridworld “Absent Supervisor” environment [SafetyGridworlds]. Prioritized Experience Replay augments the selection of experience tuples from the DDQN replay buffer by preferentially selecting experience with high TD error and simultaneously correcting for the bias this introduces by scaling the loss in the neural network update proportionally to the size of the TD error.
In the “Lunar Lander” environment the objective is to safely land a spacecraft on the surface of the moon by controlling four discrete actions for each of its three engines. The state space is eight dimensional with six continuous variables representing location in twodimensional cartesiancoordinates, linear velocity, angle and angular velocity, and two boolean variables indicating whether or not contact is being made with the ground by each of the lander’s two legs. The reward signal positively reinforces movement toward the landing pad, as well as bonus for making leg contact with the ground. Negative reward is given for moving away from the landing pad or losing contact with the ground. The environment is considered solved when the agent achieves a 100 episode moving average reward of at least 200.
In the “Cliff Walk” gridworld environment (Figure 1
) the objective is to reach the goal state while avoiding the row of terminal “cliff” states along the bottom edge by controlling four discrete actions up, down, left, right. The state is encoded as a binary vector. The environment provides the agent a reward of 1 at each step and a reward of 100 for entering the cliff. The goal provides no reward and terminates the episode. In our experimentation the environment was considered solved when the agent achieved a 100 episode moving average reward of at least 30.
In the “Absent Supervisor” gridworld environment (Figure 2) the objective is to reach the goal state by controlling four discrete actions up, down, left, right. The four center squares are impassable. For each episode a supervisor is absent or present with uniform probability. The state is encoded as a binary vector. The environment provides the agent a reward of 1 at each time step and a reward of +50 for entering the goal. When the supervisor is present the orange state, located immediately above the goal state, highlighted in Figure 2 provides a large negative reward (30) but no such reward when the supervisor is absent. We would like the agent to never pass through the orange punishment state. The intent of the environment is to demonstrate that when provided the opportunity to cheat by passing through the orange state when the supervisor is absent traditional deep reinforcement learning algorithms will do so.
In each case, the policy considered was a neural network with two hidden layers each having 32 nodes and hyperbolic tangent activations. The output activation was linear with one node for each action. DDQN with soft target network updates [Lillicrap2015], the proportional variant of Prioritized Experience Replay[Schaul2015], and an
greedy exploration policy were employed to train the agent with the hyperparameters summarized in Table
1.Hyperparameter  Value 
experience replay every timesteps  2 
replay buffer size  1e5 
batch size  64 
(Discount factor)  0.99 
(Learning rate)  1e3 
(Soft target network update rate)  1e2 
PER (TD error prioritization)  0.6 
PER (Bias correction)  0.6 
In the “Lunar Lander” environment, the Introspection Learning parameters were set as follows. For the query schedule, we determine at what interval batches will be searched for and when searching for batches will cease and training will proceed as normal. We experimented with solving for state batches at a predetermined interval (every 100 episodes) and ceasing when the 100 episode moving average reward crossed a predetermined threshold. For training on state batches, states found were treated as terminal states with high negative reward (100) as determined by the rules of the environment for terminal states. We have generally found that incorporating the state batches into the replay buffer is beneficial early in the learning process when the policy is poor, as it introduces bias (cf. [Schaul2015]).The query constraints in both cases were to look for states whose coordinates were outside of the landing zone ( or ), such that the agent favors selecting an action that would result in it moving further away from from the landing zone.^{3}^{3}3Note that alternative choices of query constraints are also possible including, e.g., querying for those states that move the agent in the correct direction, which could be given extra reward. Our approach here is based on trying to minimize the number of obviously risky actions the agent is likely to carry out during training, while allowing the agent freedom to explore reasonable actions.
This region of the statespace was divided into boxes using a simple quantization scheme that ignored regions of state space where examples satisfying the query constraints would be impossible to find. In general, such quantization schemes should be sufficiently finegrained to allow generation of many and diverse examples. Twenty training runs with a set of twenty random seeds were run with and without our approach for a maximum of 500,000 timesteps. Results averaged over the training runs are summarized in Figure 3. DDQN with Introspection Learning solved the environment in a mean of 893 episodes while DDQN without Introspection Learning (baseline) failed to successfully solve the environment on average within 500,000 timesteps.
In addition to observing performance benefits, we also evaluated the agents trained with Introspection Learning for robustness benefits. In particular, we periodically stored the weights of both the Introspection Learning agent and the baseline agent during training for each of the twenty runs. We then recorded, for different regions of state space, statistics regarding the Sat, Unsat and Timeout results obtained when querying the SMTsolver on these agents across training. To recall, in this case, a Sat result indicates that there exists a state in the specified region of state space such that an undesirable action (in this case, moving away from the landing zone) is selected by the agent. Likewise, an Unsat result indicates that there is a mathematical proof that there exists no state in such that is undesirable. We gathered Sat, Unsat and Timeout data across a number of different selections of . Tables 2 and 3 record the percentages of each kind of result across all twenty test runs that were captured at four points during training. The selection of queried here were a subset of the subsets of (state,action)space queried during the actual Introspection Learning training and the results show a clear improvement of robustness over the baseline. Timeouts during training were set to five seconds and to ten seconds during evaluation. One interesting point that we noticed in analyzing the robustness evaluation data is that larger numbers of Unsat results for the Introspection Learning agents were obtained at the beginning of training than the end. This is illustrated, for a typical example (the run with ID number 480951) in Figure 4. This is likely due to the schedule employed as part of the introspection learning algorithm and highlights the more general fact that reinforcement learning agents are sometimes subject to “forgetting” important learned behavior at later stages of training. Since the agents at the end of training were typically very good at solving the task, the regions of state space in which this forgetfulness would manifest themselves were likely offtrajectory (i.e., unreachable by the current policy).
In order to emphasize that this improvement is very much a function of the specific used during training, and tested at evaluation time, we include for comparison in Table 4 the average percentages for an alternative selection of used at evaluation time. Here the improvements are more modest.
Run ID  Unsat  Sat  Timeout 

34001  0%  62.5%  37.5% 
390797  0%  100%  0% 
747524  0%  75%  25% 
480621  25%  50%  25% 
475982  50%  25%  25% 
319324  25%  62.5%  12.5% 
449374  0%  50%  50% 
491386  0%  50%  50% 
532333  0%  50%  50% 
55487  0%  75%  25% 
4211  0%  50%  50% 
480951  0%  100%  0% 
219015  0%  87.5%  12.5% 
481614  0%  75%  25% 
367249  25%  50%  25% 
508732  0%  100%  0% 
521233  0%  50%  50% 
543696  0%  75%  25% 
998982  0%  100%  0% 
36067  0%  75%  25% 
Average  6.250%  68.125%  25.625% 
Run ID  Unsat  Sat  Timeout 

34001  50%  25%  25% 
390797  25%  25%  50% 
747524  0%  50%  50% 
480621  25%  0%  75% 
475982  25%  50%  25% 
319324  0%  75%  25% 
449374  0%  50%  50% 
491386  0%  45.8333%  54.1667% 
532333  25%  25%  50% 
55487  25%  50%  25% 
4211  50%  25%  25% 
480951  0%  75%  25% 
219015  25%  37.50%  37.50% 
481614  25%  75%  0% 
367249  25%  75%  0% 
508732  0%  75%  25% 
521233  50%  50%  0% 
543696  79.1667%  0%  20.8333% 
998982  25%  33.3333%  41.6667% 
36067  25%  0%  75% 
Average  23.958%  42.083%  33.958% 
Run ID  Unsat  Sat  Timeout 

Baseline  83.3%  1.4%  15.3% 
Introspection  85.3%  0.6%  14.2% 
In the “Absent Supervisor” environment the Introspection Learning parameters were set as follows. Solving for state batches is unnecessary as in this discrete state environment we are only concerned with the agent choosing to enter the orange punishment state from the state directly above it. For the query schedule solving for this specific behavior is performed at every timestep and during training this transition is treated as terminal with high negative reward (100). Results for DDQN with and without Introspection Learning are provided in Figures 5 and 6 respectively. One interesting point about the “Absent Supervisor” environment is that, for the evident notion of good policy, one of the hypotheses (the “Strong Compatiblity” assumption) of our Theorem 1 is violated.
In the “Cliff Walk” environment the Introspection Learning parameters were set as follows. Solving for state batches is unnecessary as in this discrete state environment we are only concerned with the agent choosing to enter the cliff states which can only be done from the state directly above each cliff state respectively. For the query schedule solving for these specific behaviors is performed at every timestep and during training this transition is treated as terminal with high negative reward (100). It should be noted that in this particular case the environment already treats these transitions as terminal with high negative reward (100) and thus Introspection Learning will not alter the optimal policy (in particular, the hypotheses of Theorem 1 are satisfied). In this experiment, five training runs with a set of five random seeds were run with and without our approach until the environment was solved. During training, at each timestep, a running count was kept of the number of states from which the agent would select to enter the cliff states “lemming”. During training the policies were found to lemming on average 112 times with Introspection Learning and 29,501 times without. It was experimentally found that an agent with Introspection Learning would rarely learn a policy during training that would enter the cliff after the first training episode while it was routine for an agent without Introspection Learning. Representative policies learned by DDQN with and without Introspection Learning after 30 training episodes are provided in Figures 7 and 8 respectively. Additionally, agents with Introspection Learning enjoyed a small performance benefit solving the environment in 208 episodes on average over the five training runs while agents without Introspection Learning averaged 229 episodes to solve the environment.
5 Conclusions
In this paper we have introduced a novel reinforcement learning algorithm based on ideas coming from formal methods and SMTsolving. We have shown that, on suitable problems, these techniques can be employed in order to improve robustness of RL agents and to speed up their training. We have also given examples of how SMTsolving can be used to analyze reinforcement learning agent robustness. There are a number of extensions of this preliminary work possible. We mention several prominent directions here.
First, the focus here has been on singlestep analysis of agent behavior, but a reachability analysis approach focused on trajectories leading to target states would likely generate more relevant data for learning. E.g., consider a geofenced space that we do not want the agent to enter and that is reachable through many different (state, action) combinations. Once a violation occurs, we would like to examine the trajectory in order to learn what earlier choices led the agent there.
Second, whereas in our “lunar lander” experiments we utilized an ad hoc quantization of the state space, it should be in many cases possible to learn such regions as part of the algorithm. This is a hard search problem so relying on these parameterizations is necessary and should therefore be automated. In conjunction with the reachability analysis mentioned above, this approach is likely to give more targeted and therefore useful data to include in the replay buffer.
Finally, while the SMTsolving technology being used is sufficient for lowdimensional statespaces, these techniques face scalability issues on large statespaces such as those coming from video data. How to handle these higherdimensional statespaces in a similar way is one of the exciting challenges in this area.
Acknowledgments
We would like to thank Ramesh S, Doug Stuart, Huafeng Yu, Sicun Gao, Aleksey Nogin, and Pape Sylla for useful conversations on topics related to this paper. We are also grateful to Tom Bui, Bala Chidambaram, Cem Saraydar, Roy Matic, Mike Daily and Son Dao for their support of and guidance regarding this research. Finally, we would like to thank Alessio Lomuscio and Clark Barrett for their interest in this work and for encouraging us to capture these results in a paper.
6 Appendix: Theoretical Results
Fix a preMDP and assume given a (nonempty) subset of which we regard as the good policies: those policies whose have the properties of interest.
Definition 5.
MDPs and are equivalent whenever and .
Furthermore, throughout this section we assume given a fixed MDP in . Additionally, assume given a fixed discount factor . We also adopt throughout this section two further hypotheses, which we now describe.
Assumption 1 (Bad Set).
There exists a subset such that is in if and only if, for all , .
Our next hypothesis guarantees that the reward structure is already sufficiently compatible with .
Assumption 2 (Strong Compatibility).
All optimal policies for are in . I.e., .
We define a new MDP structure in by
It is straightforward to prove that is bounded since is. Note that we are also modifying the underlying preMDP here by now imposing the condition that .
An immediate proof of the following proposition can be obtained using the notion of bounded corecursive algebra from [Moss], where it is shown that the statevalue functions are canonically determined by the generating maps given by
where is the probability distribution monad.
Proposition 1.
If is in , then .
Proof.
It suffices to show that , which is trivial for in . ∎
Corollary 1.
If is in , then if and only if .
Lemma 1.
.
Proof.
Suppose given an optimal policy for . By Bellman optimality, is optimal for if and only if, for all ,
Let and be given. There are two cases depending on whether or not .
Lemma 2.
.
Proof.
Let a policy for be given such that, for some , and let be an optimal policy for . Then
so that such a cannot be optimal. ∎
Theorem 1.
and are equivalent.
Proof.
By Lemma 1 it suffices to show that , which is immediate since
for any optimal policy for and any optimal policy for . Here the first equation is by Proposition 1 and Lemma 2, the second equation is by optimality of for by Lemma 1, and the final equation is by Proposition 1 and the Strong Compatibility hypothesis. ∎
Comments
There are no comments yet.