1 Introduction
Reinforcement learning (RL) is a technique for training a policy used to govern the interaction between an agent and an environment. It is based on repeated explorations of the environment, which yield rewards that the agent should aim to maximise. Deep reinforcement learning
combines RL and deep learning, by using neural networks to store a representation of a learnt reward function or optimal policy. These methods have been increasingly successful across a wide range of challenging application domains, including for example, autonomous driving
[37], robotics [10] and healthcare [15].In safety critical domains, it is particularly important to assure that policies learnt via RL will be executed safely, which makes the application of formal verification to this problem appealing. This is challenging, especially for deep RL, since it requires reasoning about multidimensional, continuous state spaces and complex policies encoded as deep neural networks.
There are several approaches to assuring safety in reinforcement learning, often leveraging ideas from formal verification, such as the use of temporal logic to specify safety conditions, or the use of abstract interpretation to build discretised models. One approach is shielding (e.g., [20]), which synthesises override mechanisms to prevent the RL agent from acting upon bad decisions; another is constrained or safe RL (e.g. [29]), which generates provably safe policies, typically by restricting the training process to safe explorations.
An alternative approach, which we take in this paper, is to verify an RL policy’s correctness after it has been learnt, rather than placing restrictions on the learning process or on its deployment. Progress has been made in the formal verification of policies for RL [21] and also for the specific case of deep RL [36, 18, 14], in the latter case by building on advances in abstraction and verification techniques for neural networks; [18] also exploits the development of efficient abstract domains such as template polyhedra [46], previously applied to the verification of continuousspace and hybrid systems [23, 28].
A useful tool in reinforcement learning is the notion of a probabilistic policy (or stochastic policy
), which chooses randomly between available actions in each state, according to a probability distribution specified by the policy. This brings a number of advantages (similarly to mixed strategies
[43]in game theory and contextual bandits
[39]), such as balancing the explorationexploitation tradeoff [13], dealing with partial observability of the environment [44], handling multiple objectives [5] or learning continuous actions [42].In this paper, we tackle the problem of verifying the safety of probabilistic policies for deep reinforcement learning. We define a formal model of their execution using (continuousstate, finitebranching) discretetime Markov processes. We then build and solve sound abstractions of these models. This approach was also taken in earlier work [14], which used Markov decision process abstractions to verify deep RL policies in which actions may exhibit failures.
However, a particular challenge for probabilistic policies, as generated by deep RL, is that policies tend to specify very different action distributions across states. We thus propose a novel abstraction based on interval Markov decision processes (IMDPs), in which transitions are labelled with intervals of probabilities, representing the range of possible events that can occur. We solve these IMDPs, over a finite time horizon, which we show yields
probabilistic guarantees, in the form of upper bounds on the actual probability of the RL policy leading the agent to a state designated to be unsafe.We present methods to construct IMDP abstractions using template polyhedra as an abstract domain, and mixedinteger linear programming (MILP) to reason symbolically about the neural network policy encoding and a model of the RL agent’s environment. We extend existing MILPbased methods for neural networks to cope with the softmax encoding used for probabilistic policies. Naive approaches to constructing these IMDPs yield abstractions that are too coarse, i.e., where the probability intervals are too wide and the resulting safety probability bounds are too high be useful. So, we present an iterative refinement approach based on sampling which splits abstract states via crossentropy minimisation based on the uncertainty of the overapproximation.
We implement our techniques, building on an extension of the probabilistic model checker PRISM [7] to solve IMDPs. We show that our approach successfully verifies probabilistic policies trained for several reinforcement learning benchmarks and explore tradeoffs in precision and computational efficiency.
Related work. As discussed above, other approaches to assuring safety in reinforcement learning include shielding [20, 22, 51, 38, 34] and constrained or safe RL [29, 31, 27, 49, 32, 41, 35, 17]. By contrast, we verify policies independently, without limiting the training process or imposing constraints on execution.
Formal verification of RL, but in a nonprobabilistic setting includes: [21]
, which extracts and analyses decision trees;
[36], which checks safety and liveness properties for deep RL; and [18], which also uses template polyhedra and MILP to build abstractions, but to check (nonprobabilistic) safety invariants.In the probabilistic setting, perhaps closest is our earlier work [14], which uses abstraction for finitehorizon probabilistic verification of deep RL, but for nonprobabilistic policies, thus using a simpler (MDP) abstraction, as well as a coarser (interval) abstract domain and a different, more basic approach to refinement. Another approach to generating formal probabilistic guarantees is [19], which, unlike us, does not need a model of the environment and instead learns an approximation and produces probably approximately correct (PAC) guarantees. Probabilistic verification of neural network policies on partially observable models, but for discrete state spaces, was considered in [16].
There is also a body of work on verifying continuous space probabilistic models and stochastic hybrid systems, by building finitestate abstractions as, e.g., interval Markov chains
[9] or interval MDPs [12, 25], but these do not consider control policies encoded as neural networks. Similarly, abstractions of discretestate probabilistic models use similar ideas to our approach, notably via the use of interval Markov chains [4] and stochastic games [6].2 Background
We first provide background on the two key probabilistic models used in this paper: discretetime Markov processes (DTMPs), used to model RL policy executions, and interval Markov decision processes (IMDPs), used for abstractions.
Notation. We write for the set of discrete probability distributions over a set , i.e., functions where . The support of , denoted , is defined as . We use the same notation where is uncountable but where has finite support. We write to denote the powerset of and for the
th element of a vector
.Definition 1 (Discretetime Markov process)
A (finitebranching) discretetime Markov process is a tuple , where: is a (possibly uncountably infinite) set of states; is a set of initial states; is a transition probability matrix, where for all ; is a set of atomic propositions; and is a labelling function.
A DTMP begins in some initial state and then moves between states at discrete time steps. From state , the probability of making a transition to state is . Note that, although the state space of DTMPs used here is continuous, each state only has a finite number of possible successors. This is always true for our models (where transitions represent policies choosing between a finite number of actions) and simplifies the model.
A path through a DTMP is an infinite sequence of states such that for all . The set of all paths starting in state is denoted and we define a probability space over in the usual way [3]. We use atomic propositions (from the set ) to label states of interest for verification, e.g., to denote them as safe or unsafe. For , we write if .
The probability of reaching a labelled state from within steps is:
which, since DTMPs are finitebranching models, can be computed recursively:
To build abstractions, we use interval Markov decision processes (IMDPs).
Definition 2 (Interval Markov decision process)
An interval Markov decision process is a tuple , where: is a finite set of states; are initial states; is the interval transition probability function, where is the set of probability intervals , assigning either a probability interval or the exact probability of 0 to any transition; is a set of atomic propositions; and is a labelling function.
Like a DTMP, an IMDP evolves through states in a state space , starting from an initial state . In each state , an action must be chosen. Because of the way we use IMDPs, and to avoid confusion with the actions taken by RL policies, we simply use integer indices for actions. The probability of moving to each successor state then falls within the interval .
To reason about IMDPs, we use policies, which resolve the nondeterminism in terms of actions and probabilities. A policy of the IMDP selects the choice to take in each state, based on the history of its execution so far. In addition, we have a socalled environment policy which selects probabilities for each transition that fall within the specified intervals. For a policy and environment policy , we have a probability space over the set of infinite paths starting in state . As above, we can define, for example, the probability of reaching a labelled state from within steps, under and .
If is an event of interest defined by a measurable set of paths (e.g., ), we can compute (through robust value iteration [8]) lower and upper bounds on, e.g., maximum probabilities, over the set of all allowable probability values:
3 Modelling and Abstraction of Reinforcement Learning
We begin by giving a formal definition of our model for the execution of a reinforcement learning system, under the control of a probabilistic policy. We also define the problem of verifying that this policy is executed safely, namely that the probability of visiting an unsafe system state, within a specified time horizon, is below an acceptable threshold.
Then we define abstractions of these models, given an abstract domain over the states of the model, and show how an analysis of the resulting abstraction yields probabilistic guarantees in the form of sound upper bounds on the probability of a failure occurring. In this section, we make no particular assumption about the representation of the policy, nor about the abstract domain.
3.1 Modelling and Verification of Reinforcement Learning
Our model takes the form of a controlled dynamical system over a continuous dimensional state space , assuming a finite set of actions performed at discrete time steps. A (time invariant) environment describes the effect of executing an action in a state, i.e., if is the state at time and is the action taken in that state, we have .
We assume a reinforcement learning system is controlled by a probabilistic policy, i.e., a function of the form , where specifies the probability with which action should be taken in state . Since we are interested in verifying the behaviour of a particular policy, not in the problem of learning such a policy, we ignore issues of partial observability. We also do not need to include any definition of rewards.
Furthermore, since our primary interest here is in the treatment of probabilistic policies, we do not consider other sources of stochasticity, such as the agent’s perception of its state or the environment’s response to an action. Our model could easily be extended with other discrete probabilistic aspects, such as the policy execution failure models considered in [14].
Combining all of the above, we define an RL execution model as a (continuousspace, finitebranching) discretetime Markov process (DTMP). In addition to a particular environment and policy , we also specify a set of possible initial states and a set of failure states, representing unsafe states.
Definition 3 (RL execution model)
Assuming a state space and action set , and given an environment , policy , initial states and failure states , the corresponding RL execution model is the DTMP where , for any , iff and, for states :
The summation in Definition 3 is required since distinct actions and applied in state could result in the same successor state .
Then, assuming the model above, we define the problem of verifying that an RL policy executes safely. We consider a fixed time horizon and an error probability threshold , and the check that the probability of reaching an unsafe state within time steps is always (from any start state) below .
Definition 4 (RL verification problem)
Given a DTMP model of an RL execution, as in Definition 3, a time horizon and a threshold , the RL verification problem is to check that for all .
In practice, we often tackle a numerical version of the verification problem, and instead compute the worstcase probability of error for any start state or (as we do later) an upper bound on this value.
3.2 Abstractions for Verification of Reinforcement Learning
Because our models of RL systems are over continuous state spaces, in order to verify them in practice, we construct finite abstractions. These represent an overapproximation of the original model, by grouping states with similar behaviour into abstract states, belonging to some abstract domain .
Such abstractions are usually necessarily nondeterministic since an abstract state groups states with similar, but distinct, behaviour. For example, abstraction of a probabilistic model such as a discretetime Markov process could be captured as a Markov decision process [14]. However, a further source of complexity for abstracting probabilistic policies, especially those represented as deep neural networks, is that states can also vary widely with regards to the probabilities with which policies select actions in those states.
So, in this work we represent abstractions as interval MDPs (IMDPs), in which transitions are labelled with intervals, representing a range of different possible probabilities. We will show that solving the IMDP (i.e., computing the maximum finitehorizon probability of reaching a failure state) yields an upper bound on the corresponding probability for the model being abstracted.
Below, we define this abstraction and state its correctness, first focusing separately on abstractions of an RL system’s environment and policy, and then combining these into a single IMDP abstraction.
Assuming an abstract domain , we first require an environment abstraction , which soundly overapproximates the environment , as follows.
Definition 5 (Environment abstraction)
For environment and set of abstract states , an environment abstraction is a function such that: for any abstract state , concrete state and action , we have .
Additionally, we need, for any policy , a policy abstraction , which gives a lower and upper bound on the probability with which each action is selected within the states grouped by each abstract state.
Definition 6 (Policy abstraction)
For a policy and a set of abstract states , a policy abstraction is a pair of functions of the form and , satisfying the following: for any abstract state , concrete state and action , we have .
Finally, combining these notions, we can define an RL execution abstraction, which is an IMDP abstraction of the execution of a policy in an environment.
Definition 7 (RL execution abstraction)
Let and be an RL environment and policy, DTMP be the corresponding RL execution model and be a set of abstract states. Given also a policy abstraction of and an environment abstraction of , an RL execution abstraction is an IMDP satisfying the following:

for all , for some ;

for each , there is a partition of such that, for each we have where:

and iff for some .
Intuitively, the nondeterminism in each abstract state (i.e., the choice between actions ) represents a partition of into groups of states that behave the same under the specified environment and policy abstractions.
Finally, we state the correctness of the abstraction, i.e., that solving the IMDP provides upper bounds on the probability of policy execution resulting in a failure. This is formalised as follows (see Appendix 0.A for a proof).
Theorem 3.1
Given a state of an RL execution model DTMP, and an abstract state of the corresponding abstraction IMDP for which :
In particular, this means that we can tackle the RL verification problem of checking that the error probability is below a threshold for all possible start states (see Definition 4). We can do this by finding an abstraction for which for all initial abstract states .
Although is not necessarily a lower bound on the failure probability, the value may still be useful to guide abstractionrefinement.
4 Templatebased Abstraction of Neural Network Policies
We now describe in more detail the process for constructing an IMDP abstraction, as given in Definition 7, to verify the execution of an agent with its environment, under the control of a probabilistic policy. We assume that the policy is encoded in neural network form and has already been learnt, prior to verification, and we use template polyhedra to represent abstract states.
The overall process works by building a step unfolding of the IMDP, starting from a set of initial states . For each abstract state explored during this process, we need to split into an appropriate partition . Then, for each and each action , we determine lower and upper bounds on the probabilities with which is selected in states in , i.e., we construct a policy abstraction . We also find the successor abstract state that results from executing in , i.e., we build an environment abstraction . Construction of the IMDP then follows directly from Definition 7.
In the following sections, we describe our techniques in more detail. First, we give brief details of the abstract domain used: bounded polyhedra. Next, we describe how to construct policy abstractions via MILP. Lastly, we describe how to partition abstract states via refinement. We omit details of the environment abstraction since we reuse the symbolic post operator over template polyhedra given in [18], also performed with MILP. This supports environments specified as linear, piecewise linear or nonlinear systems defined with polynomial and transcendental functions. The latter is dealt with using linearisation, subdividing into small intervals and overapproximating using interval arithmetic.
4.1 Bounded Template Polyhedra
We represent abstract states using template polyhedra [46], convex shapes constrained within a fixed set of directions. A finite set of directions is called a template and a polyhedron is a polyhedron with facets that are normal to the directions in . Given a (convex) abstract state , the polyhedron of is is the tightest polyhedron enclosing :
where denotes scalar product. In this paper, we restrict our attention to bounded template polyhedra (also called polytopes), in which every variable in the state space is bounded by a direction of the template, since this is needed for our refinement scheme. Important special cases of template polyhedra, which we use later, are rectangles (i.e., intervals) and octagons.
4.2 Constructing Policy Abstractions
We focus first on the abstraction of the probabilistic policy . Recall that the state space is over realvalued variables and assume there are actions: . Let be encoded by a neural network comprising
input neurons,
hidden layers, each containing neurons (), andoutput neurons, and using ReLU activation functions.
The policy is encoded as follows. We use variable vectors to denote the values of the neurons at each layer. The current state of the environment is fed to the input layer , each hidden layer’s values are as follows:
and the final layer is , where each is a matrix of weights connecting layers and and each is a vector of biases. In the usual fashion, . Finally, the probability that the encoded policy selects action is given by based on a softmax normalisation of the final layer:
For an abstract state , we compute the policy abstraction, i.e., lower and upper bounds and for all actions (see Definition 6), via mixedinteger linear programming (MILP), building on existing MILP encodings of neural networks [50, 26, 11]. The probability bounds cannot be directly computed via MILP due to the nonlinearity of the softmax function so, as a proxy, we maximise the corresponding entry (the
th logit) of the output layer (
). For the upper bound (the lower bound is computed analogously), we optimise:(1) 
over the variables , and , for .
Since abstract state is a convex polyhedron, the initial constraint on the vector of values fed to the input layer is represented by linear inequalities. ReLU functions are modelled using a bigM encoding [50], where we add integer variable vectors and is a constant representing an upper bound for the possible values of neurons.
We solve 2 MILPs to obtain lower and upper bounds on the logits for all actions. We then calculate bounds on the probabilities of each action by combining these values as described below. Since the exponential function in softmax is monotonic, it preserves the order of the intervals, allowing us to compute the bounds on the probabilities achievable in .
Let and denote the lower and upper bounds, respectively, obtained for each action via MILP (i.e., the optimised values in (1) above). Then, the upper bound for the probability of choosing action is :
and where is an intermediate vector of size . Again, the computation for the lower bound is performed analogously.
4.3 Refinement of Abstract States
As discussed above, each abstract state in the IMDP is split into a partition and, for each , the probability bounds and are determined for each action . If these intervals are two wide, the abstraction is too coarse and the results uninformative. To determine a good partition (i.e., one that groups states with similar behaviour in terms of the probabilities chosen by the policy), we use refinement, repeatedly splitting into finer partitions.
We define the maximum probability spread of , denoted , as:
and we refine until falls below a specified threshold . Varying allows us to tune the desired degree of precision.
When refining, our aim is minimise , i.e., to group areas of the state space that have similar probability ranges, but also to minimise the number of splits performed. We try to find a good compromise between improving the accuracy of the abstraction and reducing partition growth, which generates additional abstract states and increases the size of the IMDP abstraction.
Calculating the range via MILP, as described above, can be time consuming. So, during the first part of refinement for each abstract state, we sample probabilities for some states to compute an underestimate of the true range. If the sampled range is already wide enough to trigger further refinement, we do so; otherwise we calculate the exact range of probabilities using MILP to check whether there is a need for further refinement.
Each refinement step comprises three phases, described in more detail below: (i) sampling policy probabilities; (ii) selecting a direction to split; (iii) splitting. Figure 1 gives an illustrative example of a full refinement.
Sampling the neural network policy. We first generate a sample of the probabilities chosen by the policy within the abstract state. Since this is a convex region, we sample state points within it randomly using the Hit & Run method [48]. We then obtain, from the neural network, the probabilities of picking actions at each sampled state. We consider each action separately, and then later split according to the most promising one (i.e., with the widest probability spread across all actions). The probabilities for each are computed in a onevsall fashion: we generate a point cloud representing the probability of taking that action as opposed to any other action.
The number of samples used (and hence the time needed) is kept fixed, rather than fixing the density of the sampled points. We sample 1000 points per abstract state split but this parameter can be tuned depending on the machine and the desired time/accuracy tradeoff. This ensures that ever more accurate approximations are generated as the size of the polyhedra decreases.
Choosing candidate directions. We refine abstract states (represented as template polyhedra) by bisecting them along a chosen direction from the set used to define them. Since the polyhedra are bounded, we are free to pick any one. To find the direction that contributes most to reducing the probability spread, we use crossentropy minimisation to find the optimal boundary at which to split each direction, and then pick the direction that yields the lowest value.
Let be the set of sampled points and denote the true probability of choosing action in each point , as extracted from the probabilistic policy. For a direction , we project all points in onto and sort them accordingly, i.e., we let , where and index is sorted by . We determine the optimal boundary for splitting in direction by finding the optimal index that splits into and . To do so, we first define the function classifying the th point according to this split:
and then minimise, over
, the binary cross entropy loss function:
which reflects how well the true probability for each point matches the separation into the two groups.
One problem with this approach is that, if the distribution of probabilities is skewed to strongly favour some probabilities, a good decision boundary may not be picked. To counter this, we perform sample weighting by grouping the sampled probabilities into small bins, and counting the number of samples in each bin to calculate how much weight to give to each sample.
Abstract state splitting. Once a direction and bisection point are chosen, the abstract state is split into two with a corresponding pair of constraints that splits the polyhedron. Because we are constrained to the directions of the template, and the decision boundary is highly nonlinear, sometimes the bisection point falls close to the interval boundary and the resulting slices are extremely thin. This would cause the creation of an unnecessarily high number of polyhedra, which we prevent by imposing a minimum size of the split relative to the dimension chosen. By doing so we are guaranteed a minimum degree of progress and the complex shapes in the nonlinear policy space which are not easily classified (such as nonconvex shapes) are broken down into more manageable regions.
5 Experimental Evaluation
We evaluate our approach by implementing the techniques described in Section 4 and applying them to 3 reinforcement learning benchmarks, analysing performance and the impact of various configurations and optimisations.
5.1 Experimental Setup
Implementation.
The code is developed in a mixture of Python and Java. Neural network manipulation is done through Pytorch
[2], MILP solution through Gurobi [30], graph analysis with networkX [1] and crossentropy minimisation with Scikitlearn [45]. IMDPs are constructed and solved using an extension of PRISM [7] which implements robust value iteration [8]. The code is available from https://github.com/phate09/SafeDRL.Benchmarks. We use the following three RL benchmark environments:
(i) Bouncing ball [33]: The agent controls a ball with height and vertical velocity , choosing to either hit the ball downward with a paddle, adding speed, or do nothing. The ball accelerates while falling and bounces on the ground losing 10% of its energy; it eventually stops bouncing if its height is too low and it is out of reach of the paddle. The initial heights and speed vary. In our experiments, we consider two possible starting regions: “large” (), where and , and “small” (), where and . The safety constraint is that the ball never stops bouncing.
(ii) Adaptive cruise control [18]: The problem has two vehicles , whose state is determined by variables and for the position and speed of each car, respectively. The lead car proceeds at constant speed (), and the agent controls the acceleration () of using two actions. The range of possible start states allows a relative distance of metres and the speed of the ego vehicle is in m/s. Safety means preserving .
(iii) Inverted pendulum: This benchmark is a modified (discrete action) version of the “Pendulumv0” environment from the OpenAI Gym [24] where an agent applies left or right rotational force to a pole pivoting around one of its ends, with the aim of balancing the pole in an upright position. The state is modelled by 2 variables: the angular position and velocity of the pole. We consider initial conditions of an angle and speed . Safety constitutes remaining with a range of positions and velocities such that an upright position can be recovered. This benchmark is more challenging than the previous two: it allows 3 actions (noop, push left, push right) and the dynamics of the system are highly nonlinear, making the problem more complex.
Policy training. All agents have been trained using proximal policy optimisation (PPO) [47] in actorcritic configuration with Adam optimiser. The training is distributed over 8 actors with 10 instances of each environment, managing the collection of results and the update of the network with RLlib [40]
. Hyperparameters have been mostly kept unchanged from their default values except the learning rate and batch size which have been set to
and , respectively. We used a standard feed forward architecture with 2 hidden layers (size 32 for the bouncing ball and size 64 for the adaptive cruise control and inverted pendulum problems) and ReLU activation functions.Abstract domains. The abstraction techniques we present in Section 4 are based on the use of template polyhedra as an abstract domain. As special cases, this includes rectangles (intervals) and octagons. We use both of these in our experiments, but also the more general case of arbitrary bounded template polyhedra. In the latter case, we choose a set of directions by sampling a representative portion of the state space where the agent is expected to operate, and choosing appropriate slopes for the directions to better represents the decision boundaries. The effect of the choice of different template can be seen in Fig 2 where we show a representative abstract state and how the refinement algorithm is affected by the choice of template: as expected, increasing the generality of the abstract domain results in a smaller number of abstract states.
Containment checks. Lastly, we describe an optimisation implemented for construction of IMDP abstractions, whose effectiveness we will evaluate in the next section. When calculating the successors of abstract states to construct an IMDP, we sometimes find that successors that are partially or fully contained within previously visited abstract states. Against the possible tradeoff of decreasing the accuracy of the abstraction, we can attempt reduce the total size of the IMDP that is constructed by aggregating together states which are fully contained within previously visited abstract states.
5.2 Experimental Results
Benchmark  Abs.  Contain.  Num.  Num.  IMDP  Prob.  Runtime  

environment  dom.  check  poly.  visited  size  bound  (min.)  
Bouncing ball ()  20  Rect  0.1  ✓  337  28  411  0.0  1 
20  Oct  0.1  ✓  352  66  484  0.0  2  
Bouncing ball ()  20  Rect  0.1  ✓  1727  5534  7796  0.63  30 
20  Oct  0.1  ✓  2489  3045  6273  0.0  33  
20  Rect  0.1  ✗  18890  0  23337  0.006  91  
20  Oct  0.1  ✗  13437  0  16837  0.0  111  
Adaptive cruise control  7  Rect  0.33  ✓  1522  4770  10702  0.084  85 
7  Oct  0.33  ✓  1415  2299  6394  0.078  60  
7  Temp  0.33  ✓  2440  2475  9234  0.47  70  
7  Rect  0.5  ✓  593  1589  3776  0.62  29  
7  Oct  0.5  ✓  801  881  3063  0.12  30  
7  Temp  0.5  ✓  1102  1079  4045  0.53  34  
7  Rect  0.33  ✗  11334  0  24184  0.040  176  
7  Oct  0.33  ✗  7609  0  16899  0.031  152  
7  Temp  0.33  ✗  6710  0  14626  0.038  113  
7  Rect  0.5  ✗  3981  0  8395  0.17  64  
7  Oct  0.5  ✗  2662  0  5895  0.12  52  
7  Temp  0.5  ✗  2809  0  6178  0.16  48  
Inverted pendulum  6  Rect  0.5  ✓  1494  3788  14726  0.057  71 
6  Rect  0.5  ✗  5436  0  16695  0.057  69 
Table 1 summarises the experimental results across the different benchmark environments; denotes the time horizon considered. We use a range of configurations, varying: the abstract domain used (rectangles, octagons or general template polyhedra); the maximum probability spread threshold and whether the containment check optimisation is used.
The table lists, for each case: the number of independent polyhedra generated, the number of instances in which polyhedra are contained in previously visited abstract states and aggregated together; the final size of the IMDP abstraction (number of abstract states); the generated upper bound on the probability of encountering an unsafe state from an initial state; and the runtime of the whole process. Experiments were run on a 4core 4.2 GHz PC with 64 GB RAM.
Verification successfully produced probability bounds for all environments considered. Typically, the values of shown are the largest time horizons we could check, assuming a 3 hour timeout for verification. The majority of the runtime is for constructing the abstraction, not solving the IMDP.
As can be seen, the various configurations result in different safety probability bounds and runtimes for the same environments, so we are primarily interested in the impact that these choices have on the tradeoff between abstraction precision and performance. We summarise findings for each benchmark separately.
Bouncing ball. These are the quickest abstractions to construct and verify due to the low number of variables and the simplicity of the dynamics. For both initial regions considered, we can actually verify that it is fully safe (maximum probability 0). However, for the larger one, rectangles (particular with containment checks) are not accurate enough to show this.
Two main areas of the policy are identified for refinement: one where it can reach the ball and should hit it and one where the ball is out of reach and the paddle should not be activated to preserve energy. But even for threshold (lower than used for other benchmarks), rectangular abstractions resulted in large abstract states containing most of the other states visited by the agent, and which ultimately overlapped with the unsafe region.
Adaptive cruise control. On this benchmark, we use a wider range of configurations. Firstly, as expected, for smaller values of the maximum probability spread threshold , the probability bound obtained is lower (the overestimation error from the abstraction decreases, making it closer to the true maximum probability) but the abstraction size and runtime increase. Applying the containment check for previously visited states has a similar effect: it helps reduce the computation time, but at the expense of overapproximation (higher bounds)
The choice of abstract domain also has a significant impact. Octagons yield more precise results than rectangles, for the same values of , and also produce smaller abstractions (and therefore lower runtime). On the other hand, general template polyhedra (chosen to better approximate the decision boundary) do not appear to provide an improvement in time or precision on this example, instead causing higher probability bounds, especially when combined with the containment check. Our hypothesis is that this abstract domains groups larges area of the state space (as shown in Fig. 2) and this eventually leads to overlaps with the unsafe region.
Inverted pendulum. This benchmark is more challenging and, while we successfully generate bounds on the probability of unsafe behaviour, for smaller values of and other abstract domains, experiments timed out due to the high number of abstract states generated and the time needed for MILP solution. The abstract states generated were sufficiently small that the containment check could be used to reduce runtime without increasing the probability bound.
6 Conclusion
We presented an approach for verifying probabilistic policies for deep reinforcement learning agents. This is based on a formal model of their execution as continuousspace discrete time Markov process, and a novel abstraction represented as an interval MDP. We propose techniques to implement this framework with MILP and a samplingbased refinement method using crossentropy minimisation. Experiments on several RL benchmarks illustrate its effectiveness and show how we can tune the approach to trade off accuracy and performance.
Future work includes automating the selection of an appropriate template for abstraction and using lower bounds from the abstraction to improve refinement.
References
 [1] NetworkX  network analysis in python. Note: https://networkx.github.io/Accessed: 20200507 Cited by: §5.1.
 [2] PyTorch. Note: https://pytorch.org/Accessed: 20200507 Cited by: §5.1.
 [3] (1976) Denumerable Markov chains. 2nd edition, , Vol. , SpringerVerlag, . Note: Cited by: §2.
 [4] (2006) Don’t know in probabilistic systems. In Proc. SPIN’06, A. Valmari (Ed.), LNCS, Vol. 3925, pp. 71–88. Note: Cited by: §1.

[5]
(2009)
Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks.
In
Proc. Australasian Conference on Artificial Intelligence
, LNCS, Vol. 5866, , pp. 340–349. Note: Cited by: §1.  [6] (2010) A gamebased abstractionrefinement framework for Markov decision processes. Formal Methods in System Design 36 (3), pp. 246–280. Note: Cited by: §1.
 [7] (2011) PRISM 4.0: verification of probabilistic realtime systems. In Proc. 23rd International Conference on Computer Aided Verification (CAV’11), LNCS, Vol. 6806, , pp. 585–591. Note: Cited by: §1, §5.1.
 [8] (2012) Robust control of uncertain Markov decision processes with temporal logic specifications. In Proc. 51th IEEE Conference on Decision and Control (CDC’12), , Vol. , , pp. 3372–3379. Note: Cited by: §2, §5.1.
 [9] (2015) Formal verification and synthesis for discretetime stochastic systems. IEEE Transactions on Automatic Control 60 (8), pp. 2031–2045. Note: Cited by: §1.
 [10] (2017) Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In Proc. 2017 IEEE International Conference on Robotics and Automation (ICRA’17), , Vol. , , pp. 3389–3396. Note: Cited by: §1.
 [11] (2018) A unified view of piecewise linear neural network verification. In Proc. 32nd International Conference on Neural Information Processing Systems (NIPS’18), , Vol. , , pp. 4795–4804. Note: Cited by: §4.2.
 [12] (2018) Approximate abstractions of markov chains with interval decision processes. In Proc. 6th IFAC Conference on Analysis and Design of Hybrid Systems, , Vol. , , pp. . Note: Cited by: §1.
 [13] (2018) Probabilistic policy reuse for safe reinforcement learning. ACM Transactions on Autonomous and Adaptive Systems 13 (3), pp. 1–24. Note: Cited by: §1.
 [14] (2020) Probabilistic guarantees for safe deep reinforcement learning. In Proc. 18th International Conference on Formal Modelling and Analysis of Timed Systems (FORMATS’20), LNCS, Vol. 12288, , pp. 231–248. Note: Cited by: §1, §1, §1, §3.1, §3.2.
 [15] (2021) Reinforcement learning in healthcare: a survey. ACM Computing Surveys 55 (1), pp. 1–36. Note: Cited by: §1.
 [16] (2021) Taskaware verifiable RNNbased policies for partially observable Markov decision processes. Journal of Artificial Intelligence Research 72 (), pp. 819–847. Note: Cited by: §1.
 [17] (2021) Verifiably safe exploration for endtoend reinforcement learning. In Proc. 24th International Conference on Hybrid Systems: Computation and Control (HSCC’21), , Vol. , , pp. . Note: Cited by: §1.
 [18] (2021) Verifying reinforcement learning up to infinity. In Proc. 30th International Joint Conference on Artificial Intelligence (IJCAI’21), , Vol. , , pp. 2154–2160. Note: Cited by: §1, §1, §4, §5.1.
 [19] (2022) Distillation of RL policies with formal guarantees via variational abstraction of markov decision processes. In Proc. 36th AAAI Conference on Artificial Intelligence (AAAI’22), , Vol. , , pp. . Note: Cited by: §1.
 [20] (2018) Safe reinforcement learning via shielding. In Proc. 32nd AAAI Conference on Artificial Intelligence (AAAI’18), pp. 2669–2678. External Links: 1708.08611, ISBN 9781577358008 Cited by: §1, §1.
 [21] (2018) Verifiable reinforcement learning via policy extraction. In Proc. 2018 Annual Conference on Neural Information Processing Systems (NeurIPS’18), pp. 2499–2509. Cited by: §1, §1.
 [22] (2021) Safe Reinforcement Learning with Nonlinear Dynamics via Model Predictive Shielding. In Proceedings of the American Control Conference, pp. 3488–3494. Cited by: §1.
 [23] (2017) Counterexampleguided refinement of template polyhedra. In TACAS (1), pp. 589–606. Cited by: §1.
 [24] (201606) OpenAI Gym. External Links: 1606.01540 Cited by: §5.1.
 [25] (2019) Efficiency through uncertainty: scalable formal synthesis for stochastic hybrid systems. In 22nd ACM International Conference on Hybrid Systems: Computation and Control, Cited by: §1.
 [26] (2017) Maximum resilience of artificial neural networks. In Proc. International Symposium on Automated Technology for Verification and Analysis (ATVA’17), LNCS, Vol. 10482, pp. 251–268. External Links: 1705.01040, ISBN 9783319681665, ISSN 16113349 Cited by: §4.2.
 [27] (2019) Endtoend safe reinforcement learning through barrier functions for safetycritical continuous control tasks. In AAAI, pp. 3387–3395. Cited by: §1.

[28]
(2018)
Spacetime interpolants
. In CAV (1), pp. 468–486. Cited by: §1.  [29] (2018) Safe reinforcement learning via formal methods: toward safe control through proof and learning. In AAAI, pp. 6485–6492. Cited by: §1, §1.
 [30] (2021) Gurobi Optimizer Reference Manual. Cited by: §5.1.
 [31] (2019) Logicallyconstrained neural fitted qiteration. In AAMAS, pp. 2012–2014. Cited by: §1.
 [32] (2020) Cautious reinforcement learning with logical constraints. In AAMAS, pp. 483–491. Cited by: §1.
 [33] (2019) Teaching Stratego to play ball: optimal synthesis for continuous space MDPs. In ATVA, pp. 81–97. Cited by: §5.1.
 [34] (2020) Safe reinforcement learning using probabilistic shields. In Proc. 31st International Conference on Concurrency Theory (CONCUR’20), Vol. 171, pp. 31–316. External Links: 1807.06096, ISBN 9783959771603, ISSN 18688969 Cited by: §1.
 [35] (202106) Learning on Abstract Domains: A New Approach for Verifiable Guarantee in Reinforcement Learning. External Links: 2106.06931 Cited by: §1.
 [36] (2019) Verifying deepRLdriven systems. In Proceedings of the 2019 Workshop on Network Meets AI & ML, NetAI@SIGCOMM’19, pp. 83–89. Cited by: §1, §1.
 [37] (2019) Learning to drive in a day. In ICRA, pp. 8248–8254. Cited by: §1.
 [38] (202010) Shield Synthesis for Reinforcement Learning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 12476 LNCS, pp. 290–306. External Links: ISBN 9783030613617, ISSN 16113349 Cited by: §1.

[39]
(2007)
The epochgreedy algorithm for contextual multiarmed bandits
. Advances in neural information processing systems 20 (1), pp. 96–1. Cited by: §1. 
[40]
(201810–15 Jul)
RLlib: abstractions for distributed reinforcement learning.
In
Proceedings of the 35th International Conference on Machine Learning
, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 3053–3062. Cited by: §5.1.  [41] (2021) Feasible ActorCritic: Constrained Reinforcement Learning for Ensuring Statewise Safety. External Links: 2105.10682 Cited by: §1.
 [42] (2016) Asynchronous methods for deep reinforcement learning. In Proc. 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Vol. 48, pp. 1928–1937. Cited by: §1.
 [43] (2004) An introduction to game theory. Vol. 3, Oxford university press New York. Cited by: §1.
 [44] (2021) Agent modelling under partial observability for deep reinforcement learning. In Proceedings of the Neural Information Processing Systems (NeurIPS), Cited by: §1.
 [45] (2011) Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §5.1.
 [46] (2005) Scalable analysis of linear systems using mathematical programming. In VMCAI, pp. 25–41. Cited by: §1, §4.1.
 [47] (2017) Proximal policy optimization algorithms. arXiv:1707.06347. Cited by: §5.1.

[48]
(1984)
Efficient Monte Carlo procedures for generating points uniformly distributed over bounded regions
. Operations Research 32 (6), pp. 1296–1308. External Links: ISSN 0030364X, 15265463 Cited by: §4.3.  [49] (2020) Learning to be Safe: Deep RL with a Safety Critic. External Links: 2010.14603 Cited by: §1.
 [50] (2017) Evaluating robustness of neural networks with mixed integer programming. Vol. , , . Cited by: §4.2, §4.2.
 [51] (201906) An inductive synthesis framework for verifiable reinforcement learning. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 686–701. External Links: 1907.07273, ISBN 9781450367127 Cited by: §1.
Appendix
Appendix 0.A Proof of Theorem 3.1
Given a state of an RL execution model DTMP, and abstract state of the corresponding controller abstraction IMDP for which , we have:
By the definition of , it suffices to show that there is some policy and some environment policy in the IMDP such that:
(2) 
Recall that, in the construction of the IMDP (see Definition 7), an abstract state is associated with a partition of subsets of , each of which is used to define the labelled choice in state . Let be the policy that picks in each state (regardless of history) the unique index such that . Then, let be the environment policy that selects the upper bound of the interval for every transition probability. We use function to denote the chosen probabilities, i.e., we have for any .
The probabilities for these policies, starting in , are defined similarly to those for discretetime Markov processes (see Section 2):
Since this is defined recursively, we prove (2) by induction over . For the case , the definitions of and are equivalent: they equal 1 if (or ) and 0 otherwise. From Definition 7, implies . Therefore, .
Next, for the inductive step, we will assume, as the inductive hypothesis, that for and with . If then . Otherwise we have:
which completes the proof.
Appendix 0.B Further Experimental Results
We provide some additional experimental results omitted from the main paper.
Figure 3 shows the impact of refinement settings (abstract domain and maximum probability spread threshold ) on the abstraction process applied to a toy example: a Gaussian distribution, with abstracted results generated by sampling. Figure 3(a) shows the true distribution; Figure 3(b)(d) show several refinement, where denotes the number of polyhedra generated. By increasing the complexity of the template, we reduce the number of abstract states since we have more directions across which we can choose how to split. By reducing the maximum probability spread , the number of abstract states increases exponentially but each abstract state gives a better representation of the true distribution.
Figure 4 shows the abstraction process applied to another example (a state space fragment from the inverted pendulum benchmark) using both rectangles and octagons. It illustrates the probability of choosing one of three actions, coded by RGB colour: noop (red), right (green) and left (blue), The X axis represents angular speed and the Y axis represents the angle of the pendulum in radians. Notice the grey area towards the centre where all 3 actions have the same probability, the centre right area with yellow tints (red and green), and the centre left area with purple tints (red and blue). Towards the bottom of the heatmap, the colour fades to green as the agent try to push the pendulum to cause it to spin and balance once it reaches the opposite side.
Comments
There are no comments yet.