Learning on Abstract Domains: A New Approach for Verifiable Guarantee in Reinforcement Learning

Formally verifying Deep Reinforcement Learning (DRL) systems is a challenging task due to the dynamic continuity of system behaviors and the black-box feature of embedded neural networks. In this paper, we propose a novel abstraction-based approach to train DRL systems on finite abstract domains instead of concrete system states. It yields neural networks whose input states are finite, making hosting DRL systems directly verifiable using model checking techniques. Our approach is orthogonal to existing DRL algorithms and off-the-shelf model checkers. We implement a resulting prototype training and verification framework and conduct extensive experiments on the state-of-the-art benchmark. The results show that the systems trained in our approach can be verified more efficiently while they retain comparable performance against those that are trained without abstraction.



There are no comments yet.


page 1

page 2

page 3

page 4


Dependability Analysis of Deep Reinforcement Learning based Robotics and Autonomous Systems

While Deep Reinforcement Learning (DRL) provides transformational capabi...

Towards Scalable Verification of RL-Driven Systems

Deep neural networks (DNNs) have gained significant popularity in recent...

Verified Probabilistic Policies for Deep Reinforcement Learning

Deep reinforcement learning is an increasingly popular technique for syn...

Stealing Deep Reinforcement Learning Models for Fun and Profit

In this paper, we present the first attack methodology to extract black-...

Benchmarking Safe Deep Reinforcement Learning in Aquatic Navigation

We propose a novel benchmark environment for Safe Reinforcement Learning...

Unsupervised Paraphrasing via Deep Reinforcement Learning

Paraphrasing is expressing the meaning of an input sentence in different...

Deep Policy Networks for NPC Behaviors that Adapt to Changing Design Parameters in Roguelike Games

Recent advances in Deep Reinforcement Learning (DRL) have largely focuse...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the unparalleled potential that Deep Reinforcement Learning (DRL) techniques have exposed in plentiful control fields DBLP:journals/nature/MnihKSRVBGRFOPB15 ; DBLP:journals/corr/abs-1810-03259 ; DBLP:journals/ral/LambertDYLCP19 , real-world DRL applications are quite limited in safety-critical domains because they need certificates for their reliability. A typical example is the fully autonomous driving, which is still argued a long way off due to safety concerns gomes2016will . Verifiable guarantees on safety and reliability are both desirable and necessary to those DRL systems hasanbeig2020towards . Unfortunately, formally verifying DRL systems is a challenging task due to the dynamic continuity of system behaviors and the black-box feature of the AI models (neural networks) embedded in the systems. The dynamic continuity results in uncountably-infinite state space DBLP:conf/pldi/ZhuXMJ19 , while the black-box feature causes unexplainability of neural networks DBLP:journals/csr/HuangKRSSTWY20 .

Instead of directly verifying DRL systems, most of the existing approaches rely on transforming them into verifiable models. Representative works include exacting decision trees

bastani2018verifiable and programmatic policies DBLP:conf/icml/VermaMSKC18 , synthesizing deterministic programs DBLP:conf/pldi/ZhuXMJ19 and linear controllers xiong2021scalable , transforming into hybrid systems DBLP:conf/hybrid/IvanovWAPL19 and star sets tran2019safety . Although these transformation-based approaches are effective solutions, there are some limitations, e.g., extracted policies may not equivalently represent source neural networks and the properties that can be verified may be limited. For instance, only safety properties are supported by hybrid system and star set models by reachability analysis. Thus, it is desired that a trained DRL system can be directly and efficiently verified without transformation.

In this paper, we propose a novel training approach for DRL by learning on finite abstract domains, unlike the traditional approaches which learn directly on concrete system states. Specifically, we discretize continuous states into finite abstract states, on which we train a DRL system. For the finiteness of abstract states, the neural network trained on them is essentially a finite function that maps from abstract states to actions. Because the trained neural network adopts the same action for the concrete states of the same abstract state, we can leverage the abstract interpretation technique cousot1977abstract to model the DRL system as a finite-state transition system, which can be efficiently model checked.

Our training approach has two main features that distinguish itself from existing classic DRL approaches. Firstly, a DRL system trained in our approach is directly verifiable, and thus it avoids any shortage of transformation-based approaches. The novelty of learning on abstract domains makes it possible to model a DRL system into a finite-state system by abstracting continuous concrete states into corresponding abstract domains. The following verification becomes straightforward as a bunch of off-the-shelf model-checking tools such as Spot duret2016spot can be used to verify various properties efficiently. Secondly, our approach is orthogonal to existing DRL algorithms and can be naturally implemented by extending them. We have implemented a resulting prototype training and verification framework, and performed extensive experiments on four classic continuous control tasks. The experimental results demonstrate that the systems trained in our approach have comparable performance with those trained by existing DRL algorithms. Meanwhile, they can be formally verified against desired properties.

In summary, this paper makes the following two major contributions:

  1. A novel abstraction-based DRL approach to train continuous control systems on abstract domains such that the trained systems are amenable to formal verification while retaining comparable performance to those trained on concrete states.

  2. A subsequent abstraction-based verification approach and a resulting prototype tool for model checking the trained DRL systems, coupled with a benchmark of four verified DRL systems for corresponding classic control problems.

2 DRL and its Formal Verification

DRL is usually modeled as a Markov Decision Process (MDP) 

feinberg2012handbook , which is a 4-tuple , where is a set of states called the state space, is a set of actions called the action space,

is the probability of the transition from

to based on action , and is the reward received by the controller after the given transition from to . Since the system dynamics of safety-critical systems are generally known and deterministic bastani2018verifiable ; DBLP:conf/hybrid/IvanovWAPL19 ; DBLP:conf/pldi/ZhuXMJ19 , it implies that the effect of an action on a state results in only one successor state. Thus, we write to indicate that there is a transition from to due to action .

DRL aims to train a DNN-based controller to learn a deterministic policy that specifies a unique action adopted in a state to achieve specific goals. A trained DRL system can be represented as a tuple with being a set of initial states of the system. Let be the set of all the reachable states of . We have , and for two states , if and then there is .

Figure 1: State transition of Inverted Pendulum DBLP:conf/pldi/ZhuXMJ19

The formal verification of a DRL system is to check whether satisfies some desired properties that are formalized as logical formulas in some logic such as Linear Temporal Logic (LTL) pnueli1977temporal . satisfies , denoted as , if and only if all the paths of satisfy . There are two key factors make it intractable to directly verify . One is that the number of paths of is infinite when contains infinite states. The other is that the set of successor states is difficult to compute and represent due to the non-linearity of the system dynamics. Figure 1 shows an example of computing successor state of the state using the change of rate , where time is discretized into time interval and the transition from time to is approximated by the equation DBLP:conf/pldi/ZhuXMJ19 . Further, it needs to compute the control action by querying DNN in every transition in order to build the state transition system of a DRL system, which drastically reduces the efficiency of verification.

Perturbation is another factor making the verification problem of DRL systems more difficult. A trained controller may face perturbations in the real world or caused by modeling errors and differences in training and test scenarios  DBLP:conf/icml/TesslerEM19 ; zhang2020robust . It is necessary to ensure the robustness of DRL system, so perturbations must be taken into account to verify system robustness. Perturbations may cause nondeterministic transitions between states because the actual successor state may deviate from the expected state due to perturbation lutjens2020certified

. We use the perturbation vector

to describe the offset range. Then the target system for verification can be modeled as a tuple , where , , and are the same as previously defined. Given the perturbation vector , denotes all reachable states after applying to the states . Specifically, for the expected transition from to , actual reachable states , where is the dimension of the state. Apparently, perturbation to concrete states may lead to state space exploration.

Figure 2: Abstraction-based deep reinforcement learning

3 Abstraction-Based Reinforcement Learning

Figure 2 shows the framework with an illustrative example. The state of the mountain car is a pair of position and velocity. We suppose a region where the position is in and the velocity is in . Before a concrete state, e.g., , is fed to the neural network, we transform it into the representation of its corresponding region, i.e., the interval vector , as the actual input. The neural network produces an action based on its current setting and the input. The action takes effect on the concrete state to drive the system under training to proceed.

The essential difference of our framework from classic DRL approaches is that the states fed into neural networks are abstract states. An abstract state corresponds to an infinite set of concrete states, and is represented as a vector of intervals in our framework. Thus, we call our learning approach abstraction-based reinforcement learning.

3.1 State Discretization and Abstraction

Our abstraction mechanism is based on the assumption that a trained controller usually adopts the same action for those concrete states that are adjacent DBLP:conf/formats/Bacci020 . We consider a concrete state to be a vector of () real numbers. The distance between two states can be measured by norms.

Definition 1 (Adjacent states).

Two states are called adjacent with respect to an -norm distance , denoted by , if and only if .

Given a state and an -norm distance , the set of all the adjacent states of is essentially an -norm ball .

Let be the lower and upper bounds for the -th dimension element in . Then the state space of the control system is

. The basic idea of state discretization and abstraction is to classify all adjacent states into a set, and represent the set as an abstract domain such as Polyhedra, Octagon, and Interval

singh2017practical .

In our abstraction approach, we choose Interval as the abstract domain for its simplicity and efficiency. Specifically, we divide the interval of each dimension into a finite set of unit intervals. For each dimension, let () be the diameter of each unit interval, represent the vector of diameters for the dimensions. We call the abstraction granularity of , and use to represent the set of all the divided unit intervals. Then, we obtain an abstract-state space , where an abstract state is essentially a vector of unit intervals . Apparently, is finite. A concrete state belongs to the abstract state , denoted by , if and only if for each .

Definition 2 (Interval-based abstraction).

Given a state space and an abstraction granularity , a state is abstracted to be an interval vector where for each with .

3.2 Learning on Abstract States

The abstraction-based reinforcement learning approach is orthogonal to most of the state-of-the-art DRL algorithms and can be smoothly implemented atop them. We only need to insert an abstract transformer between the control system and the neural network to transform concrete states into abstract ones before feeding them to the neural network.

1 for episode = 1,  do
2       Initialize after resetting the Environment
4       for t = 1,  do
5             Take action based on (-greedy)
6             Execute , then observe and
8             Store in Buffer
9             Sample batch from Buffer
10             Update parameters based on Loss Function
Algorithm 1 Abstraction-Based Deep Q-Learning

We consider incorporating the operation to extend Deep Q-Learning (DQL) DBLP:journals/corr/MnihKSGAWR13 as an illustrative example. Algorithm 1 depicts the main workflow, where abstractionMapping is an abstraction function that maps concrete states to their corresponding abstract states and is the abstraction granularity. In our abstraction approach, given a concrete state , we first compute the unit interval according to the preset abstraction granularity such that with . Then the interval vector is fed into neural network. It is worth mentioning that we need to double the input dimension of the neural network in order to accept the interval vector. We omit explanations of other steps as they are well-established in DQL.

We also applied the abstraction technique on Deep Deterministic Policy Gradient (DDPG) DBLP:journals/corr/LillicrapHPHETS15 and Proximal Policy Optimization (PPO) DBLP:journals/corr/SchulmanWDRK17

algorithms, then conducted experiments using the extended learning algorithm based on the open-sourced DRL library TF2RL 


, where various DRL algorithms are implemented using TensorFlow 2.x.

Abstraction plays a crucial role in our framework. Its granularity determines the performance of a trained network and the verification difficulty of the hosting system. The finer the abstraction is, the better performance a trained network is of, while the more costly the verification becomes due to state space explosion. This assertion is confirmed by the experimental results in Section 5.2. Therefore, it is important to determine an appropriate abstraction granularity to reach a trade-off between the performance and verification cost. We set

as a hyperparameter in training algorithms, which means the adjustment to it depends on the corresponding training performance.

4 Abstraction-Based Formal Verification

In this section, we propose an abstraction-based verification approach to model check the DRL systems trained on abstract domains. The basic idea of our approach is based on the Abstract Interpretation technique cousot1977abstract , which builds transition systems on finite abstract-state spaces by transforming concrete states into abstract ones for the purpose of model checking. Because the abstract state space is finite, its verification can be achieved by classic model-checking techniques DBLP:journals/fac/Konnov19 .

4.1 Building Abstract-State Transition System

We abstract a continuous state space into a finite abstract-state space in the same way as we do in the training phase, and then build an abstract-state transition system by establishing the transition relations among abstract states according to the actions produced by the trained neural network.

As mentioned in Section 2, a trained DRL system can be modeled as when perturbation is considered. Here, is a neural network that can be modeled as a black-box function . Let be a set of abstract states such that if and only if there exists a state such that .

Next we define the relation between abstract states. Figure 3 depicts the abstract transformer for abstract states. Given a abstract state , we can obtain a unique action by feeding it to the trained network. After applying to , we calculate the interval vector to cover the irregular state space generated by . If the situation with perturbation is considered, can be smoothly expanded to to include extra reachable states. Then we use a set of abstract states to over approximate . Let be the -th interval in the vector , then must be the sub-interval of either a unit interval or the concatenation of multiple unit intervals of . Without loss of generality, we assume at least () unit interval(s) is (are) needed to concatenate each other to cover . So we need abstract states, whose union is the least over-approximation of the resulting vector. There is a transition relation from to each abstract state in the union e.g., in the figure.

Figure 3: Transitions between abstract states

Let us consider an example of the mountain car shown in Figure 2. We assume that the current abstract state is . The trained DNN takes the same action on all the concrete states that are represented by the abstract state. We assume that action is to accelerate the car to the right side. We calculate the maximal and minimal values on 2 dimensions based on the system dynamics. Then we construct an interval vector with them to represent all states transited from those in the preceding abstract state. We assume the vector is . It can be over-approximated by four abstract states, i.e., , , and .

4.2 Model Checking of LTL Properties

Input: Initial abstract state , LTL formula , threshold , transition function , Neural Network , perturbation
Output: True, False
1 = spot.formula.Not()
2 = spot.translate()
3 = spot.make_kripke_graph()
6 while  is not empty and  do
7       Fetch from
8       for  do
11       for  do
12             if  is not traversed then
13                   add_edge(, )
14                   Push into
return spot.intersecting_run(, )
Algorithm 2 LTL Model Checking

Since we can construct the explicit finite-state transition system, the verification work can be delivered to existing model-checking tools. This observation indicates that the abstraction is decoupled from the subsequent verification procedure, which means that our approach can benefit from any future improvement in model-checking techniques.

In practice, we leverage Spot duret2016spot to complete the verification work. Algorithm 2 describes the implementation details of our verification framework, where Input lists the settings that users need to provide and functions that start with "spot." can be called directly from Spot.

is the automata that corresponds the negative form of the LTL formula (Line 1-2), where we refer readers to DBLP:journals/fac/Konnov19 for more details of LTL verification. We traverse the abstract states via breadth-first search to build the explicit transition system , where successive abstract states are computed in the way explained in Section 4.1. Function (Line 9) takes an interval vector and the corresponding action returned by DNN, and returns the irregular state space which we will not compute explicitly. Instead, we directly obtain the set of abstract sets after applying functions and (mentioned in Section 4.3). Besides, threshold will force to terminate the verification when the model checker cannot verify all reachable abstract states.

Then, we calculate the propositions satisfied by each abstract state in . Note that for guaranteeing the soundness of verification results, when judging whether the abstract state satisfies proposition , we believe that it satisfies only if all concrete states in it satisfy . Finally, we call the method in Spot to construct the transition diagram generated by and to obtain the verification result.

4.3 Soundness of the Abstraction Transformer

We prove that the abstraction transformer is sound in that it preserves propositions. Let be the set of interval vectors of . The abstract transformer is a function , which is a composition of and . Intuitively, denotes the vector of intervals after the action determined by the neural network is applied to , and returns the least set of abstract states whose union is an over approximation of an interval vector . Note that for the brevity of proof, we omit the expansion operation for perturbation without loss of validity.

Lemma 1.

Given a state in , let denote the successor state after an action is applied to and be the abstract state of . Then, .

Lemma 1 says that guarantees that after an action is applied to , it generates an interval vector that contains the successor state caused by applying the same action to .

Lemma 2.

Given a vector and a state such that , let be the abstract state of . Then, .

Figure 4: Soundness of abstract transformer

Lemma 2 guarantees over-approximation of interval vectors. That is, for each state that is contained in , the abstract state of , i.e., , must be in . The formal definitions of and and proofs are provided in the appendix as supplementary document.

Figure 4 graphically shows the soundness of the abstract transformer . It says that for any abstract state , the transitions from to its successor abstract states in cover all the transitions from the concrete states that abstracts to their successor states that are caused by the same action.

Theorem 1 (Soundness).

For each , holds for all the states that abstracts.


The proof is straightforward with Lemmas 1 and 2. Let be an arbitrary state that is abstracted by , i.e., . By Lemma 1, there is . Let and , Because , we have according to Lemma 2. Namely, . Because and , we conclude . ∎

5 Experimental Evaluation

We first study the impact of abstract granularity on the performance of trained systems by training a system under different abstract granularities and comparing their performance. Then, we demonstrate the effectiveness of our approach by showing that the systems trained in our approach have comparable performance with whose trained in classical DRL algorithms. Finally, we verify the trained systems against their desirable properties to show the efficiency of the verification.

5.1 Benchmark and Experimental Settings

We choose three classic control problems from Gym 1606.01540 , including Pendulum, Mountain Car and Cartpole, and another adapted control task 4-Car Platoon DBLP:conf/pldi/ZhuXMJ19 .

  • Pendulum It delineates a pendulum that can rotate around an endpoint. By starting from a random position, a pendulum is expected to swing up and stay upright. The expected property of Pendulum is that its angle must be always in the preset range.

  • Mountain Car A car is positioned on a one-dimensional track between two mountains. It is expected to drive up the right mountain by first driving to the left one to get enough power via inertia after training. We need to guarantee that the car can finally reach the destination.

  • Cartpole A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The controller aims to keep the angle of the pole and the displacement of the cart within fixed thresholds, which must be guaranteed to satisfy.

  • 4-Car Platoon Four cars on the road are supposed to drive in a platoon behind each other. Each car aims to drive close to the front car so as to save fuel and reduce driving time. A straightforward safety requirement is that the four cars must never cause any collision.

Experimental settings

All experiments are conducted on a workstation running Ubuntu 18.04 with a 32-core AMD Ryzen Threadripper CPU @ 3.7GHz and 128GB RAM.

5.2 Impact of Abstraction Granularity

Controller Fine Intermediate Coarse
Mountain Car
4-Car platoon
Table 1: Values of abstraction granularity

We trained the four systems in the abstraction-based approach. To evaluate the impact of abstraction granularity, we set three different abstraction granularity values for each system, and examine their performance. Table  4 shows the values of the four systems. We use to indicate that there are consecutive s in the vector for convenience. Smaller intervals imply finer abstract granularity. We classify the abstraction granularity into three levels, i.e., fine, intermediate and coarse.

To compute a precise result, we train each controller for 10 rounds and record its performance at corresponding steps. The performance of a system is measured by the average reward value based on 5 episodes. To make the comparison clear, we omit confidence intervals and only show the trend of mean rewards in the figure.

Figure 5 shows the comparison results of the four trained systems. It can be seen that the performances of controllers in Pendulum are almost the same under the three different abstract granularity. It implies that even the coarse one is enough to train the system with a good performance. However, the performance of MountainCar and Cartpole varies with abstract granularities. Finer abstract granularity leads to a better performance. Note that the trajectories of the 4-Car platoon’s performance fluctuate heavily because the controller will receive a big negative reward when the cars collide according to the reward setting.

The experiments also show that it is important to choose an appropriate abstraction granularity to achieve a trade-off between the performance of trained systems and the size of their abstract state spaces. One way to determine is that we can start training a controller with the relatively coarse abstraction granularity, and refine it until the controller converges steadily to the optimal reward that is indicated by the controller trained with classic DRL algorithms.

5.3 Performance Comparison with Classical DRL Algorithms

Figure 5: Performance comparison under different abstraction granularity
Controller Network Algorithm Granularity
Pendulum DDPG
Mountain Car DQL
CartPole DQL
4-Car platoon DDPG
Table 2: Basic settings of four systems for training

We compare the performance of our training approach with classical DRL algorithms. We train each control system using a classic DRL algorithm and its corresponding extension with our abstraction technique, respectively. The main training settings can be found in Table 2. For those remaining adjustable hyperparameters, we use the default values in the TF2RL framework ota2020tf2rl .

Figure 6: Performance comparison of the four systems with and without abstraction

Figure 6 depicts the trend of four controllers’ performance as the training proceeds under different training frameworks. The blue line indicates the mean rewards of a controller trained with the classic DRL algorithm. The light blue area shows the corresponding confidence intervals. The orange dashed line and area represent the performance of the controllers that are trained with the same DRL algorithms but extended with our abstraction approach. It can be observed that the trends of mean rewards are similar in all the four cases. Although in Mountain Car and 4-Car platoon, there is a performance gap during the training process, the controller trained with abstraction can achieve the optimal reward eventually. Thus, the controllers trained by the abstraction technique can retain comparable performance against those trained without abstraction.

5.4 Verification Analysis

In this section, we model check the four controllers that are trained in our abstraction-based approach and present the verification results. Table 3 shows the experimental data, where represents the perturbation vector, All indicates whether the abstract state space is completely verified, States means the number of traversed abstract states, Verified indicates whether the property is verified true or false (✗), and Time denotes the time cost in second. We preset a threshold of the number of traversed abstract states to force the verification to terminate.


One property of Pendulum is that the pole’s angle must always be in . We use a tuple to define a state of Pendulum, where and denote the pole’s angle and angular velocity, respectively. Then the property can be defined by the LTL formula , where is the global operator indicating that the proposition following must hold in all the reachable states. We assume the initial state space of the controller is . The property can be verified under different perturbations in several seconds. The other property is that the angular velocity must be greater than eventually if the angle is less than , which is not verified. Due to the over-approximation of reachable concrete states, the set of successor abstract states of the specific one in the violated path not only contains the valid abstract state, but also include the one with angular velocity less than , which means that not all paths satisfy the property.

Mountain Car

We use and to represent the position and velocity of the state in Mountain Car. There are two properties to ensure that the car can eventually reach the destination. The first says that the speed of car must be greater than around position , which is represented by the LTL formula . The other property is that the car can always reach the position , which can be formulated as , where is the finally operator in LTL. The initial position of the car is set . Both the two properties can be verified to be true under different perturbations.


One property of Cartpole is that the angle of the pole and the displacement of the cart should never exceed preset thresholds. We assume the thresholds are 2.4 and , respectively. The property can be defined as , where and represent the angle and displacement respectively. It is partially verified to be true on abstract states in 4.3 hours, where is the threshold we set for the number of traversed abstract states.

4-Car Platoon

One safety property of the system is that there must be no collisions between cars. We use to denote the distance between the -th car and -th car, the property can be formulated as . Due to the large state space, the property is verified to be true on a partial number of states in nearly 6.0 hours.

Case Initial State Space Property All States Verified Time
Pendulum 2
Mountain Car 1164
Cartpole 14869
4-Car Platoon 21562
Table 3: Verification results of the four trained controllers

Efficiency and Scalability

The experimental results show that the time cost on verification mainly depends on the size of reachable abstract-state space. It is possible to verify a DRL system which has millions of abstract states in a few hours. For the systems that have larger abstract-state space, we can fine-tune the abstraction granularity to reduce the abstract-state space during the training phase and meanwhile we guarantee that all the desired properties must be verified to be true under that granularity. A case study in the supplementary document shows the feasibility of the approach.

6 Related Work

Our abstraction-based verification approach is inspired by a bunch of recently emerging works on abstraction-based verification of neural networks singh2019abstract ; pulina2010abstraction ; prabhakar2019abstraction

. These works have demonstrated the effectiveness of abstraction techniques on formal verification of neural networks. By contrast, there are two major differences in our abstraction approach. One is that we introduce abstraction in the training process, and the other is that the abstraction objects in our framework are system states, while the abstract objects in these approaches are neurons.

To the best of our knowledge, existing verification approaches for DRL-enabled systems can be divided into three categories. One is based on model transformation, which transforms the embedded DNN model into an interpretable model such as decision trees and programs bastani2018verifiable ; DBLP:conf/icml/VermaMSKC18 . Another is to synthesize barrier functions that assist the DNN in decision making can ensure safety during deployment DBLP:conf/pldi/ZhuXMJ19 ; xiong2021scalable . The last is to incorporate the DNN into the system dynamics DBLP:conf/hybrid/IvanovWAPL19 ; tran2019safety . However, these approaches fail to simultaneously fulfill the three key functionalities that our approach achieves, i.e., the scalability that is impervious to the size of DNN, supporting more complex temporal properties other than safety, and the capability of dealing with perturbations in verification.

Runtime verification is another perspective of applying formal methods to DRL-enabled systems DBLP:journals/jmlr/GarciaF15 . For instance, runtime monitoring based on formal methods can guide agents to behave under predefined requirements DBLP:conf/aiia/HasanbeigKA19 ; DBLP:conf/aaai/FultonP18 . Agents can be prevented from unsafe behaviors by constructing the safety shield via formal verification DBLP:journals/corr/abs-1708-08611 ; jansen2018shielded . However, runtime approaches inevitably incur extra system overheads during training and deployment.

7 Conclusion and Future Work

We have presented an abstraction technique for training and verifying DRL-enabled systems. By abstraction, continuous state space is discretized into a finite set of abstract states, on which DNNs in control systems are trained. By the same abstraction, we can model the trained DRL-enabled systems by finite state transition systems and resort to state-of-the-art model checking techniques to verify various system properties. We proved the soundness of the abstraction in verification and implemented a training and verification framework. We conducted experiments on four classical control problems. The experimental results demonstrated that the controllers trained with abstraction have comparable performance with those trained without abstraction. The four trained controllers were verified, which showed the feasibility and efficiency of our verification approach.

We believe that learning on abstract domain would be a promising technique for training verifiable AI-enabled systems. It provides a flexible mechanism of achieving a balance between performance of trained systems and the size of abstract state space by fine-tuning abstraction granularity. Based on this work, it would be possible to integrate training and verification techniques. Guided by the counterexamples generated during verification, the learning process can be more strategic to produce well-trained DRL systems with formal guarantees.


  • (1) Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, et al. Safe reinforcement learning via shielding. In AAAI’18, pages 2669–2678, 2018.
  • (2) Edoardo Bacci and David Parker. Probabilistic guarantees for safe deep reinforcement learning. In FORMATS’20, pages 231–248, 2020.
  • (3) Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. Verifiable reinforcement learning via policy extraction. NeurIPS’18, 31:2494–2504, 2018.
  • (4) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016.
  • (5) Edmund M Clarke, Thomas A Henzinger, Helmut Veith, and Roderick Bloem. Handbook of model checking. Springer, 2018.
  • (6) Patrick Cousot and Radhia Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approxima-tion of fixpoints. In POPL’77, pages 238–252, 1977.
  • (7) Alexandre Duret-Lutz, Alexandre Lewkowicz, Amaury Fauchille, Thibaud Michaud, Etienne Renault, and Laurent Xu. Spot 2.0—a framework for ltl and -automata manipulation. In International Symposium on Automated Technology for Verification and Analysis, pages 122–129. Springer, 2016.
  • (8) Eugene A Feinberg and Adam Shwartz. Handbook of Markov decision processes: methods and applications, volume 40. Springer Science & Business Media, 2012.
  • (9) Nathan Fulton and André Platzer. Safe reinforcement learning via formal methods: Toward safe control through proof and learning. In AAAI’18, pages 6485–6492, 2018.
  • (10) Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16:1437–1480, 2015.
  • (11) Lee Gomes. When will Google’s self-driving car really be ready? it depends on where you live and what you mean by“ready". IEEE Spectrum, 53(5):13–14, 2016.
  • (12) Mohammadhosein Hasanbeig, Daniel Kroening, and Alessandro Abate. Towards verifiable and safe model-free reinforcement learning. CEUR Workshop Proceedings, 2020.
  • (13) Xiaowei Huang, Daniel Kroening, Wenjie Ruan, James Sharp, et al. A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability. Comput. Sci. Rev., 37:100270, 2020.
  • (14) Radoslav Ivanov, James Weimer, Rajeev Alur, George J. Pappas, and Insup Lee. Verisig: verifying safety properties of hybrid systems with neural network controllers. In Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, pages 169–178, 2019.
  • (15) Nils Jansen, Bettina Könighofer, Sebastian Junges, and Roderick Bloem. Shielded decision-making in mdps. arXiv preprint arXiv:1807.06096, 2018.
  • (16) Nathan Jay, Noga H. Rotman, Philip B. Godfrey, et al. Internet congestion control via deep reinforcement learning. CoRR, abs/1810.03259, 2018.
  • (17) D Kroening, A Abate, and M Hasanbeig. Towards verifiable and safe model-free reinforcement learning. CEUR Workshop Proceedings, 2020.
  • (18) Nathan O. Lambert, Daniel S. Drew, Joseph Yaconelli, et al. Low-level control of a quadrotor with deep model-based reinforcement learning. IEEE Robotics Autom. Lett., 4(4):4224–4230, 2019.
  • (19) Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, et al. Continuous control with deep reinforcement learning. In ICLR’16, 2016.
  • (20) Björn Lütjens, Michael Everett, and Jonathan P How. Certified adversarial robustness for deep reinforcement learning. In Conference on Robot Learning, pages 1328–1337. PMLR, 2020.
  • (21) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
  • (22) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • (23) Kei Ota. TF2RL. https://github.com/keiohta/tf2rl/, 2020.
  • (24) Amir Pnueli. The temporal logic of programs. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 46–57. IEEE, 1977.
  • (25) Pavithra Prabhakar and Zahra Afzal. Abstraction based output range analysis for neural networks. In NeurIPS’19, pages 15788–15798, 2019.
  • (26) Luca Pulina and Armando Tacchella. An abstraction-refinement approach to verification of artificial neural networks. In CAV’10, pages 243–257. Springer, 2010.
  • (27) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, et al. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
  • (28) Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin Vechev. An abstract domain for certifying neural networks. In POPL’19, pages 1–30, 2019.
  • (29) Gagandeep Singh, Markus Püschel, and Martin Vechev. A practical construction for decomposing numerical abstract domains. Proceedings of the ACM on Programming Languages, 2(POPL):1–28, 2017.
  • (30) Chen Tessler, Yonathan Efroni, and Shie Mannor. Action robust reinforcement learning and applications in continuous control. In ICML’19, pages 6215–6224, 2019.
  • (31) Hoang-Dung Tran, Feiyang Cai, Manzanas Lopez Diego, Patrick Musau, Taylor T Johnson, and Xenofon Koutsoukos. Safety verification of cyber-physical systems with reinforcement learning control. ACM Trans. on Emb. Comp. Sys., 18(5s):1–22, 2019.
  • (32) Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. In ICML’18, pages 5052–5061, 2018.
  • (33) Zikang Xiong and Suresh Jagannathan. Scalable synthesis of verified controllers in deep reinforcement learning. arXiv preprint arXiv:2104.10219, 2021.
  • (34) Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. arXiv preprint arXiv:2003.08938, 2020.
  • (35) He Zhu, Zikang Xiong, Stephen Magill, and Suresh Jagannathan. An inductive synthesis framework for verifiable reinforcement learning. In PLDI’19, pages 686–701, 2019.


The appendix consists of two sections. In Section A, we present an experiment of achieving the balance between the performance of trained systems and the size of their abstract-state space by adjusting its abstraction granularity during learning. The result can support our claim in our paper that by fine-tuning abstraction granularity we can reduce the abstract-state space and meanwhile guarantee that all the desired properties are verifiable. Section B gives the proofs of the lemmas that are used to prove the soundness of our abstraction in verification.

Appendix A Fine-tuning Abstraction Granularity

In this section, we use Cartpole as a supplementary example for Section 5.4 to illustrate the feasibility of our approach for achieving a balance between the performance of trained systems and the size of their abstract-state space by fine-tuning the abstraction granularity. The approach makes the verification of trained systems amenable without losing their performance.

a.1 Determining Abstraction Granularity Baseline

Figure 7: Performance comparison of the trained controller under different abstraction granularities

In the example of Cartpole, a state consists of four elements, i.e., the displacement and the velocity of the cart and the angle and the angular velocity of the pole. Because states are continuous, the whole state space is infinite. The controller aims to keep the angle of the pole and the displacement of the cart within fixed thresholds.

As we have discussed in our submitted paper, we can start training the example with the relatively coarse abstraction granularity, e.g., , but observed that there is a big gap in the performance when we compare the trained controller under this granularity with the one trained using classical DRL algorithm without any abstraction. We call the latter benchmark. We fine-tune the granularity until the controller converges steadily to the rewards that are indicated by the benchmark controller. As shown in Figure 7, the performance of the intermediate abstraction granularity () is close to the benchmark, and therefore we can choose it as the baseline of the abstraction granularity. An abstraction granularity that is finer than the baseline basically guarantees that the controller’s performance is close to the benchmark.

a.2 Model Checking under Different Abstraction Granularities

The property for verification can be formally defined by the following LTL formula:

where, and represent the displacement and the angle respectively. There are three possible verification results, i.e., completely verified to be true (), partially verified to be true (), or verified to be false (✗). Table 4 shows the verification results of the property under different abstraction granularities. It can be seen that the property fails in the first three cases. The verification framework found some abstract states where the property is violated before reaching the threshold. Note that finding violations does not imply the trained controllers do not satisfy the property. That is a common phenomena caused by introducing abstraction into verification. We continued to refine the abstraction granularity to be . The property is verified to be true on abstract states before it reaches the state-space threshold. This result is the same as one shown in the paper.

To better illustrate how we can achieve a complete verification of a desired property, we slightly relaxed the safety property by increasing the safe range of the pole’s angle by 0.1. That is, we replace (approximately ) in the property with . The relaxed safety property becomes:

The right-hand part shows the verification results of the property under different granularities. In the case of the coarse granularity, a violation to the property is found, indicating that the property may not be satisfied by the controller trained under this abstraction. When we fine-tuned the granularity to be , the relaxed property was completely verified within a reasonable time. Because this granularity is finer than the baseline and the property is verified, we can safely stop training with finer granularities in practice. In this experiment, we continued to train the controller under two finer granularities for the purpose of comparison. It shows that the property is preserved as expected under finer granularities. In the case under , the property was partially verified due to reaching the threshold of traversed abstract states.

Result States Time Result States Time
Coarse 697 710
799 793
462 456
Fine - 1361 - 1347
Table 4: Verification results with different properties on the angle and abstraction granularities. (The state threshold of the number of traversed abstract states is , and time is in second.)

Appendix B Proof of Lemmas in Section 4.3

We first formally define concretization function , abstraction function , concrete transformer , and abstract transformer in sequence.

Definition 3 (Concretization function).

A concrete function maps abstract states into sets of concrete system states , where .

Definition 4 (Abstraction function).

An abstraction function is the mapping from concrete domain to abstraction domain such that for all there is and for each .

Definition 5 (Concrete transformer).

A concrete transformer defines the state transition function based on the system dynamics.

Definition 6 (Abstract transformer).

An abstract transformer defines transitions between abstract states, which is a composition of an interval transition function and an over-approximation function .

Intuitively, denotes the interval vector after the action determined by the neural network is applied to , and returns the least set of abstract states whose union is an over approximation of the interval vector .

We use the tuple to denote values of the concrete state and to denote the variation in each variable based on the system dynamics. Specifically, given the concrete state and the adopted action , the next state of can be represented by .

Definition 7 (Interval transition function).

An interval transition function returns an interval vector for a abstract state . Let and . Then for each , and .

Definition 8 (Over-approximation function).

An over-approximation function returns a set of abstract states for a given interval vector . Let the interval vector denote the range of the union of abstract states. Then , and . Let , there are abstract states in the set.

Lemma 1.

Let the state . Then . is the abstract state that can be represented by , where for each . is the interval vector . Since the actions adopted for states belonging to the same abstract state are same, for each , we have and . Intuitively, , and , so we have , i.e., . So . ∎

We prove the Lemma 2 via proof by contradiction. That is, we assume and deduce the contradiction.

Lemma 2.

Let the state , the vector and . According to the definition of , , we have and . Let . Since , we have or for each . However, we have for each , therefore , which contradicts the definition of . ∎