1 Introduction
Despite the unparalleled potential that Deep Reinforcement Learning (DRL) techniques have exposed in plentiful control fields DBLP:journals/nature/MnihKSRVBGRFOPB15 ; DBLP:journals/corr/abs181003259 ; DBLP:journals/ral/LambertDYLCP19 , realworld DRL applications are quite limited in safetycritical domains because they need certificates for their reliability. A typical example is the fully autonomous driving, which is still argued a long way off due to safety concerns gomes2016will . Verifiable guarantees on safety and reliability are both desirable and necessary to those DRL systems hasanbeig2020towards . Unfortunately, formally verifying DRL systems is a challenging task due to the dynamic continuity of system behaviors and the blackbox feature of the AI models (neural networks) embedded in the systems. The dynamic continuity results in uncountablyinfinite state space DBLP:conf/pldi/ZhuXMJ19 , while the blackbox feature causes unexplainability of neural networks DBLP:journals/csr/HuangKRSSTWY20 .
Instead of directly verifying DRL systems, most of the existing approaches rely on transforming them into verifiable models. Representative works include exacting decision trees
bastani2018verifiable and programmatic policies DBLP:conf/icml/VermaMSKC18 , synthesizing deterministic programs DBLP:conf/pldi/ZhuXMJ19 and linear controllers xiong2021scalable , transforming into hybrid systems DBLP:conf/hybrid/IvanovWAPL19 and star sets tran2019safety . Although these transformationbased approaches are effective solutions, there are some limitations, e.g., extracted policies may not equivalently represent source neural networks and the properties that can be verified may be limited. For instance, only safety properties are supported by hybrid system and star set models by reachability analysis. Thus, it is desired that a trained DRL system can be directly and efficiently verified without transformation.In this paper, we propose a novel training approach for DRL by learning on finite abstract domains, unlike the traditional approaches which learn directly on concrete system states. Specifically, we discretize continuous states into finite abstract states, on which we train a DRL system. For the finiteness of abstract states, the neural network trained on them is essentially a finite function that maps from abstract states to actions. Because the trained neural network adopts the same action for the concrete states of the same abstract state, we can leverage the abstract interpretation technique cousot1977abstract to model the DRL system as a finitestate transition system, which can be efficiently model checked.
Our training approach has two main features that distinguish itself from existing classic DRL approaches. Firstly, a DRL system trained in our approach is directly verifiable, and thus it avoids any shortage of transformationbased approaches. The novelty of learning on abstract domains makes it possible to model a DRL system into a finitestate system by abstracting continuous concrete states into corresponding abstract domains. The following verification becomes straightforward as a bunch of offtheshelf modelchecking tools such as Spot duret2016spot can be used to verify various properties efficiently. Secondly, our approach is orthogonal to existing DRL algorithms and can be naturally implemented by extending them. We have implemented a resulting prototype training and verification framework, and performed extensive experiments on four classic continuous control tasks. The experimental results demonstrate that the systems trained in our approach have comparable performance with those trained by existing DRL algorithms. Meanwhile, they can be formally verified against desired properties.
In summary, this paper makes the following two major contributions:

A novel abstractionbased DRL approach to train continuous control systems on abstract domains such that the trained systems are amenable to formal verification while retaining comparable performance to those trained on concrete states.

A subsequent abstractionbased verification approach and a resulting prototype tool for model checking the trained DRL systems, coupled with a benchmark of four verified DRL systems for corresponding classic control problems.
2 DRL and its Formal Verification
DRL is usually modeled as a Markov Decision Process (MDP)
feinberg2012handbook , which is a 4tuple , where is a set of states called the state space, is a set of actions called the action space,is the probability of the transition from
to based on action , and is the reward received by the controller after the given transition from to . Since the system dynamics of safetycritical systems are generally known and deterministic bastani2018verifiable ; DBLP:conf/hybrid/IvanovWAPL19 ; DBLP:conf/pldi/ZhuXMJ19 , it implies that the effect of an action on a state results in only one successor state. Thus, we write to indicate that there is a transition from to due to action .DRL aims to train a DNNbased controller to learn a deterministic policy that specifies a unique action adopted in a state to achieve specific goals. A trained DRL system can be represented as a tuple with being a set of initial states of the system. Let be the set of all the reachable states of . We have , and for two states , if and then there is .
The formal verification of a DRL system is to check whether satisfies some desired properties that are formalized as logical formulas in some logic such as Linear Temporal Logic (LTL) pnueli1977temporal . satisfies , denoted as , if and only if all the paths of satisfy . There are two key factors make it intractable to directly verify . One is that the number of paths of is infinite when contains infinite states. The other is that the set of successor states is difficult to compute and represent due to the nonlinearity of the system dynamics. Figure 1 shows an example of computing successor state of the state using the change of rate , where time is discretized into time interval and the transition from time to is approximated by the equation DBLP:conf/pldi/ZhuXMJ19 . Further, it needs to compute the control action by querying DNN in every transition in order to build the state transition system of a DRL system, which drastically reduces the efficiency of verification.
Perturbation is another factor making the verification problem of DRL systems more difficult. A trained controller may face perturbations in the real world or caused by modeling errors and differences in training and test scenarios DBLP:conf/icml/TesslerEM19 ; zhang2020robust . It is necessary to ensure the robustness of DRL system, so perturbations must be taken into account to verify system robustness. Perturbations may cause nondeterministic transitions between states because the actual successor state may deviate from the expected state due to perturbation lutjens2020certified
. We use the perturbation vector
to describe the offset range. Then the target system for verification can be modeled as a tuple , where , , and are the same as previously defined. Given the perturbation vector , denotes all reachable states after applying to the states . Specifically, for the expected transition from to , actual reachable states , where is the dimension of the state. Apparently, perturbation to concrete states may lead to state space exploration.3 AbstractionBased Reinforcement Learning
Figure 2 shows the framework with an illustrative example. The state of the mountain car is a pair of position and velocity. We suppose a region where the position is in and the velocity is in . Before a concrete state, e.g., , is fed to the neural network, we transform it into the representation of its corresponding region, i.e., the interval vector , as the actual input. The neural network produces an action based on its current setting and the input. The action takes effect on the concrete state to drive the system under training to proceed.
The essential difference of our framework from classic DRL approaches is that the states fed into neural networks are abstract states. An abstract state corresponds to an infinite set of concrete states, and is represented as a vector of intervals in our framework. Thus, we call our learning approach abstractionbased reinforcement learning.
3.1 State Discretization and Abstraction
Our abstraction mechanism is based on the assumption that a trained controller usually adopts the same action for those concrete states that are adjacent DBLP:conf/formats/Bacci020 . We consider a concrete state to be a vector of () real numbers. The distance between two states can be measured by norms.
Definition 1 (Adjacent states).
Two states are called adjacent with respect to an norm distance , denoted by , if and only if .
Given a state and an norm distance , the set of all the adjacent states of is essentially an norm ball .
Let be the lower and upper bounds for the th dimension element in . Then the state space of the control system is
. The basic idea of state discretization and abstraction is to classify all adjacent states into a set, and represent the set as an abstract domain such as Polyhedra, Octagon, and Interval
singh2017practical .In our abstraction approach, we choose Interval as the abstract domain for its simplicity and efficiency. Specifically, we divide the interval of each dimension into a finite set of unit intervals. For each dimension, let () be the diameter of each unit interval, represent the vector of diameters for the dimensions. We call the abstraction granularity of , and use to represent the set of all the divided unit intervals. Then, we obtain an abstractstate space , where an abstract state is essentially a vector of unit intervals . Apparently, is finite. A concrete state belongs to the abstract state , denoted by , if and only if for each .
Definition 2 (Intervalbased abstraction).
Given a state space and an abstraction granularity , a state is abstracted to be an interval vector where for each with .
3.2 Learning on Abstract States
The abstractionbased reinforcement learning approach is orthogonal to most of the stateoftheart DRL algorithms and can be smoothly implemented atop them. We only need to insert an abstract transformer between the control system and the neural network to transform concrete states into abstract ones before feeding them to the neural network.
We consider incorporating the operation to extend Deep QLearning (DQL) DBLP:journals/corr/MnihKSGAWR13 as an illustrative example. Algorithm 1 depicts the main workflow, where abstractionMapping is an abstraction function that maps concrete states to their corresponding abstract states and is the abstraction granularity. In our abstraction approach, given a concrete state , we first compute the unit interval according to the preset abstraction granularity such that with . Then the interval vector is fed into neural network. It is worth mentioning that we need to double the input dimension of the neural network in order to accept the interval vector. We omit explanations of other steps as they are wellestablished in DQL.
We also applied the abstraction technique on Deep Deterministic Policy Gradient (DDPG) DBLP:journals/corr/LillicrapHPHETS15 and Proximal Policy Optimization (PPO) DBLP:journals/corr/SchulmanWDRK17
algorithms, then conducted experiments using the extended learning algorithm based on the opensourced DRL library TF2RL
ota2020tf2rl, where various DRL algorithms are implemented using TensorFlow 2.x.
Abstraction plays a crucial role in our framework. Its granularity determines the performance of a trained network and the verification difficulty of the hosting system. The finer the abstraction is, the better performance a trained network is of, while the more costly the verification becomes due to state space explosion. This assertion is confirmed by the experimental results in Section 5.2. Therefore, it is important to determine an appropriate abstraction granularity to reach a tradeoff between the performance and verification cost. We set
as a hyperparameter in training algorithms, which means the adjustment to it depends on the corresponding training performance.
4 AbstractionBased Formal Verification
In this section, we propose an abstractionbased verification approach to model check the DRL systems trained on abstract domains. The basic idea of our approach is based on the Abstract Interpretation technique cousot1977abstract , which builds transition systems on finite abstractstate spaces by transforming concrete states into abstract ones for the purpose of model checking. Because the abstract state space is finite, its verification can be achieved by classic modelchecking techniques DBLP:journals/fac/Konnov19 .
4.1 Building AbstractState Transition System
We abstract a continuous state space into a finite abstractstate space in the same way as we do in the training phase, and then build an abstractstate transition system by establishing the transition relations among abstract states according to the actions produced by the trained neural network.
As mentioned in Section 2, a trained DRL system can be modeled as when perturbation is considered. Here, is a neural network that can be modeled as a blackbox function . Let be a set of abstract states such that if and only if there exists a state such that .
Next we define the relation between abstract states. Figure 3 depicts the abstract transformer for abstract states. Given a abstract state , we can obtain a unique action by feeding it to the trained network. After applying to , we calculate the interval vector to cover the irregular state space generated by . If the situation with perturbation is considered, can be smoothly expanded to to include extra reachable states. Then we use a set of abstract states to over approximate . Let be the th interval in the vector , then must be the subinterval of either a unit interval or the concatenation of multiple unit intervals of . Without loss of generality, we assume at least () unit interval(s) is (are) needed to concatenate each other to cover . So we need abstract states, whose union is the least overapproximation of the resulting vector. There is a transition relation from to each abstract state in the union e.g., in the figure.
Let us consider an example of the mountain car shown in Figure 2. We assume that the current abstract state is . The trained DNN takes the same action on all the concrete states that are represented by the abstract state. We assume that action is to accelerate the car to the right side. We calculate the maximal and minimal values on 2 dimensions based on the system dynamics. Then we construct an interval vector with them to represent all states transited from those in the preceding abstract state. We assume the vector is . It can be overapproximated by four abstract states, i.e., , , and .
4.2 Model Checking of LTL Properties
Since we can construct the explicit finitestate transition system, the verification work can be delivered to existing modelchecking tools. This observation indicates that the abstraction is decoupled from the subsequent verification procedure, which means that our approach can benefit from any future improvement in modelchecking techniques.
In practice, we leverage Spot duret2016spot to complete the verification work. Algorithm 2 describes the implementation details of our verification framework, where Input lists the settings that users need to provide and functions that start with "spot." can be called directly from Spot.
is the automata that corresponds the negative form of the LTL formula (Line 12), where we refer readers to DBLP:journals/fac/Konnov19 for more details of LTL verification. We traverse the abstract states via breadthfirst search to build the explicit transition system , where successive abstract states are computed in the way explained in Section 4.1. Function (Line 9) takes an interval vector and the corresponding action returned by DNN, and returns the irregular state space which we will not compute explicitly. Instead, we directly obtain the set of abstract sets after applying functions and (mentioned in Section 4.3). Besides, threshold will force to terminate the verification when the model checker cannot verify all reachable abstract states.
Then, we calculate the propositions satisfied by each abstract state in . Note that for guaranteeing the soundness of verification results, when judging whether the abstract state satisfies proposition , we believe that it satisfies only if all concrete states in it satisfy . Finally, we call the method in Spot to construct the transition diagram generated by and to obtain the verification result.
4.3 Soundness of the Abstraction Transformer
We prove that the abstraction transformer is sound in that it preserves propositions. Let be the set of interval vectors of . The abstract transformer is a function , which is a composition of and . Intuitively, denotes the vector of intervals after the action determined by the neural network is applied to , and returns the least set of abstract states whose union is an over approximation of an interval vector . Note that for the brevity of proof, we omit the expansion operation for perturbation without loss of validity.
Lemma 1.
Given a state in , let denote the successor state after an action is applied to and be the abstract state of . Then, .
Lemma 1 says that guarantees that after an action is applied to , it generates an interval vector that contains the successor state caused by applying the same action to .
Lemma 2.
Given a vector and a state such that , let be the abstract state of . Then, .
Lemma 2 guarantees overapproximation of interval vectors. That is, for each state that is contained in , the abstract state of , i.e., , must be in . The formal definitions of and and proofs are provided in the appendix as supplementary document.
Figure 4 graphically shows the soundness of the abstract transformer . It says that for any abstract state , the transitions from to its successor abstract states in cover all the transitions from the concrete states that abstracts to their successor states that are caused by the same action.
Theorem 1 (Soundness).
For each , holds for all the states that abstracts.
5 Experimental Evaluation
We first study the impact of abstract granularity on the performance of trained systems by training a system under different abstract granularities and comparing their performance. Then, we demonstrate the effectiveness of our approach by showing that the systems trained in our approach have comparable performance with whose trained in classical DRL algorithms. Finally, we verify the trained systems against their desirable properties to show the efficiency of the verification.
5.1 Benchmark and Experimental Settings
We choose three classic control problems from Gym 1606.01540 , including Pendulum, Mountain Car and Cartpole, and another adapted control task 4Car Platoon DBLP:conf/pldi/ZhuXMJ19 .

Pendulum It delineates a pendulum that can rotate around an endpoint. By starting from a random position, a pendulum is expected to swing up and stay upright. The expected property of Pendulum is that its angle must be always in the preset range.

Mountain Car A car is positioned on a onedimensional track between two mountains. It is expected to drive up the right mountain by first driving to the left one to get enough power via inertia after training. We need to guarantee that the car can finally reach the destination.

Cartpole A pole is attached by an unactuated joint to a cart, which moves along a frictionless track. The controller aims to keep the angle of the pole and the displacement of the cart within fixed thresholds, which must be guaranteed to satisfy.

4Car Platoon Four cars on the road are supposed to drive in a platoon behind each other. Each car aims to drive close to the front car so as to save fuel and reduce driving time. A straightforward safety requirement is that the four cars must never cause any collision.
Experimental settings
All experiments are conducted on a workstation running Ubuntu 18.04 with a 32core AMD Ryzen Threadripper CPU @ 3.7GHz and 128GB RAM.
5.2 Impact of Abstraction Granularity
Controller  Fine  Intermediate  Coarse 

Pendulum  
Mountain Car  
CartPole  
4Car platoon 
We trained the four systems in the abstractionbased approach. To evaluate the impact of abstraction granularity, we set three different abstraction granularity values for each system, and examine their performance. Table 4 shows the values of the four systems. We use to indicate that there are consecutive s in the vector for convenience. Smaller intervals imply finer abstract granularity. We classify the abstraction granularity into three levels, i.e., fine, intermediate and coarse.
To compute a precise result, we train each controller for 10 rounds and record its performance at corresponding steps. The performance of a system is measured by the average reward value based on 5 episodes. To make the comparison clear, we omit confidence intervals and only show the trend of mean rewards in the figure.
Figure 5 shows the comparison results of the four trained systems. It can be seen that the performances of controllers in Pendulum are almost the same under the three different abstract granularity. It implies that even the coarse one is enough to train the system with a good performance. However, the performance of MountainCar and Cartpole varies with abstract granularities. Finer abstract granularity leads to a better performance. Note that the trajectories of the 4Car platoon’s performance fluctuate heavily because the controller will receive a big negative reward when the cars collide according to the reward setting.
The experiments also show that it is important to choose an appropriate abstraction granularity to achieve a tradeoff between the performance of trained systems and the size of their abstract state spaces. One way to determine is that we can start training a controller with the relatively coarse abstraction granularity, and refine it until the controller converges steadily to the optimal reward that is indicated by the controller trained with classic DRL algorithms.
5.3 Performance Comparison with Classical DRL Algorithms
Controller  Network  Algorithm  Granularity 

Pendulum  DDPG  
Mountain Car  DQL  
CartPole  DQL  
4Car platoon  DDPG 
We compare the performance of our training approach with classical DRL algorithms. We train each control system using a classic DRL algorithm and its corresponding extension with our abstraction technique, respectively. The main training settings can be found in Table 2. For those remaining adjustable hyperparameters, we use the default values in the TF2RL framework ota2020tf2rl .
Figure 6 depicts the trend of four controllers’ performance as the training proceeds under different training frameworks. The blue line indicates the mean rewards of a controller trained with the classic DRL algorithm. The light blue area shows the corresponding confidence intervals. The orange dashed line and area represent the performance of the controllers that are trained with the same DRL algorithms but extended with our abstraction approach. It can be observed that the trends of mean rewards are similar in all the four cases. Although in Mountain Car and 4Car platoon, there is a performance gap during the training process, the controller trained with abstraction can achieve the optimal reward eventually. Thus, the controllers trained by the abstraction technique can retain comparable performance against those trained without abstraction.
5.4 Verification Analysis
In this section, we model check the four controllers that are trained in our abstractionbased approach and present the verification results. Table 3 shows the experimental data, where represents the perturbation vector, All indicates whether the abstract state space is completely verified, States means the number of traversed abstract states, Verified indicates whether the property is verified true or false (✗), and Time denotes the time cost in second. We preset a threshold of the number of traversed abstract states to force the verification to terminate.
Pendulum
One property of Pendulum is that the pole’s angle must always be in . We use a tuple to define a state of Pendulum, where and denote the pole’s angle and angular velocity, respectively. Then the property can be defined by the LTL formula , where is the global operator indicating that the proposition following must hold in all the reachable states. We assume the initial state space of the controller is . The property can be verified under different perturbations in several seconds. The other property is that the angular velocity must be greater than eventually if the angle is less than , which is not verified. Due to the overapproximation of reachable concrete states, the set of successor abstract states of the specific one in the violated path not only contains the valid abstract state, but also include the one with angular velocity less than , which means that not all paths satisfy the property.
Mountain Car
We use and to represent the position and velocity of the state in Mountain Car. There are two properties to ensure that the car can eventually reach the destination. The first says that the speed of car must be greater than around position , which is represented by the LTL formula . The other property is that the car can always reach the position , which can be formulated as , where is the finally operator in LTL. The initial position of the car is set . Both the two properties can be verified to be true under different perturbations.
Cartpole
One property of Cartpole is that the angle of the pole and the displacement of the cart should never exceed preset thresholds. We assume the thresholds are 2.4 and , respectively. The property can be defined as , where and represent the angle and displacement respectively. It is partially verified to be true on abstract states in 4.3 hours, where is the threshold we set for the number of traversed abstract states.
4Car Platoon
One safety property of the system is that there must be no collisions between cars. We use to denote the distance between the th car and th car, the property can be formulated as . Due to the large state space, the property is verified to be true on a partial number of states in nearly 6.0 hours.
Case  Initial State Space  Property  All  States  Verified  Time  
Pendulum  2  
2  
16  
✗  2  
Mountain Car  1164  
1555  
3717  
1154  
1542  
3742  
Cartpole  ✗  14869  
✗  15443  
4Car Platoon  ✗  21562  
Efficiency and Scalability
The experimental results show that the time cost on verification mainly depends on the size of reachable abstractstate space. It is possible to verify a DRL system which has millions of abstract states in a few hours. For the systems that have larger abstractstate space, we can finetune the abstraction granularity to reduce the abstractstate space during the training phase and meanwhile we guarantee that all the desired properties must be verified to be true under that granularity. A case study in the supplementary document shows the feasibility of the approach.
6 Related Work
Our abstractionbased verification approach is inspired by a bunch of recently emerging works on abstractionbased verification of neural networks singh2019abstract ; pulina2010abstraction ; prabhakar2019abstraction
. These works have demonstrated the effectiveness of abstraction techniques on formal verification of neural networks. By contrast, there are two major differences in our abstraction approach. One is that we introduce abstraction in the training process, and the other is that the abstraction objects in our framework are system states, while the abstract objects in these approaches are neurons.
To the best of our knowledge, existing verification approaches for DRLenabled systems can be divided into three categories. One is based on model transformation, which transforms the embedded DNN model into an interpretable model such as decision trees and programs bastani2018verifiable ; DBLP:conf/icml/VermaMSKC18 . Another is to synthesize barrier functions that assist the DNN in decision making can ensure safety during deployment DBLP:conf/pldi/ZhuXMJ19 ; xiong2021scalable . The last is to incorporate the DNN into the system dynamics DBLP:conf/hybrid/IvanovWAPL19 ; tran2019safety . However, these approaches fail to simultaneously fulfill the three key functionalities that our approach achieves, i.e., the scalability that is impervious to the size of DNN, supporting more complex temporal properties other than safety, and the capability of dealing with perturbations in verification.
Runtime verification is another perspective of applying formal methods to DRLenabled systems DBLP:journals/jmlr/GarciaF15 . For instance, runtime monitoring based on formal methods can guide agents to behave under predefined requirements DBLP:conf/aiia/HasanbeigKA19 ; DBLP:conf/aaai/FultonP18 . Agents can be prevented from unsafe behaviors by constructing the safety shield via formal verification DBLP:journals/corr/abs170808611 ; jansen2018shielded . However, runtime approaches inevitably incur extra system overheads during training and deployment.
7 Conclusion and Future Work
We have presented an abstraction technique for training and verifying DRLenabled systems. By abstraction, continuous state space is discretized into a finite set of abstract states, on which DNNs in control systems are trained. By the same abstraction, we can model the trained DRLenabled systems by finite state transition systems and resort to stateoftheart model checking techniques to verify various system properties. We proved the soundness of the abstraction in verification and implemented a training and verification framework. We conducted experiments on four classical control problems. The experimental results demonstrated that the controllers trained with abstraction have comparable performance with those trained without abstraction. The four trained controllers were verified, which showed the feasibility and efficiency of our verification approach.
We believe that learning on abstract domain would be a promising technique for training verifiable AIenabled systems. It provides a flexible mechanism of achieving a balance between performance of trained systems and the size of abstract state space by finetuning abstraction granularity. Based on this work, it would be possible to integrate training and verification techniques. Guided by the counterexamples generated during verification, the learning process can be more strategic to produce welltrained DRL systems with formal guarantees.
References
 (1) Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, et al. Safe reinforcement learning via shielding. In AAAI’18, pages 2669–2678, 2018.
 (2) Edoardo Bacci and David Parker. Probabilistic guarantees for safe deep reinforcement learning. In FORMATS’20, pages 231–248, 2020.
 (3) Osbert Bastani, Yewen Pu, and Armando SolarLezama. Verifiable reinforcement learning via policy extraction. NeurIPS’18, 31:2494–2504, 2018.
 (4) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016.
 (5) Edmund M Clarke, Thomas A Henzinger, Helmut Veith, and Roderick Bloem. Handbook of model checking. Springer, 2018.
 (6) Patrick Cousot and Radhia Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In POPL’77, pages 238–252, 1977.
 (7) Alexandre DuretLutz, Alexandre Lewkowicz, Amaury Fauchille, Thibaud Michaud, Etienne Renault, and Laurent Xu. Spot 2.0—a framework for ltl and automata manipulation. In International Symposium on Automated Technology for Verification and Analysis, pages 122–129. Springer, 2016.
 (8) Eugene A Feinberg and Adam Shwartz. Handbook of Markov decision processes: methods and applications, volume 40. Springer Science & Business Media, 2012.
 (9) Nathan Fulton and André Platzer. Safe reinforcement learning via formal methods: Toward safe control through proof and learning. In AAAI’18, pages 6485–6492, 2018.
 (10) Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16:1437–1480, 2015.
 (11) Lee Gomes. When will Google’s selfdriving car really be ready? it depends on where you live and what you mean by“ready". IEEE Spectrum, 53(5):13–14, 2016.
 (12) Mohammadhosein Hasanbeig, Daniel Kroening, and Alessandro Abate. Towards verifiable and safe modelfree reinforcement learning. CEUR Workshop Proceedings, 2020.
 (13) Xiaowei Huang, Daniel Kroening, Wenjie Ruan, James Sharp, et al. A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability. Comput. Sci. Rev., 37:100270, 2020.
 (14) Radoslav Ivanov, James Weimer, Rajeev Alur, George J. Pappas, and Insup Lee. Verisig: verifying safety properties of hybrid systems with neural network controllers. In Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, pages 169–178, 2019.
 (15) Nils Jansen, Bettina Könighofer, Sebastian Junges, and Roderick Bloem. Shielded decisionmaking in mdps. arXiv preprint arXiv:1807.06096, 2018.
 (16) Nathan Jay, Noga H. Rotman, Philip B. Godfrey, et al. Internet congestion control via deep reinforcement learning. CoRR, abs/1810.03259, 2018.
 (17) D Kroening, A Abate, and M Hasanbeig. Towards verifiable and safe modelfree reinforcement learning. CEUR Workshop Proceedings, 2020.
 (18) Nathan O. Lambert, Daniel S. Drew, Joseph Yaconelli, et al. Lowlevel control of a quadrotor with deep modelbased reinforcement learning. IEEE Robotics Autom. Lett., 4(4):4224–4230, 2019.
 (19) Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, et al. Continuous control with deep reinforcement learning. In ICLR’16, 2016.
 (20) Björn Lütjens, Michael Everett, and Jonathan P How. Certified adversarial robustness for deep reinforcement learning. In Conference on Robot Learning, pages 1328–1337. PMLR, 2020.
 (21) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
 (22) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 (23) Kei Ota. TF2RL. https://github.com/keiohta/tf2rl/, 2020.
 (24) Amir Pnueli. The temporal logic of programs. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 46–57. IEEE, 1977.
 (25) Pavithra Prabhakar and Zahra Afzal. Abstraction based output range analysis for neural networks. In NeurIPS’19, pages 15788–15798, 2019.
 (26) Luca Pulina and Armando Tacchella. An abstractionrefinement approach to verification of artificial neural networks. In CAV’10, pages 243–257. Springer, 2010.
 (27) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, et al. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 (28) Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin Vechev. An abstract domain for certifying neural networks. In POPL’19, pages 1–30, 2019.
 (29) Gagandeep Singh, Markus Püschel, and Martin Vechev. A practical construction for decomposing numerical abstract domains. Proceedings of the ACM on Programming Languages, 2(POPL):1–28, 2017.
 (30) Chen Tessler, Yonathan Efroni, and Shie Mannor. Action robust reinforcement learning and applications in continuous control. In ICML’19, pages 6215–6224, 2019.
 (31) HoangDung Tran, Feiyang Cai, Manzanas Lopez Diego, Patrick Musau, Taylor T Johnson, and Xenofon Koutsoukos. Safety verification of cyberphysical systems with reinforcement learning control. ACM Trans. on Emb. Comp. Sys., 18(5s):1–22, 2019.
 (32) Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. In ICML’18, pages 5052–5061, 2018.
 (33) Zikang Xiong and Suresh Jagannathan. Scalable synthesis of verified controllers in deep reinforcement learning. arXiv preprint arXiv:2104.10219, 2021.
 (34) Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and ChoJui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. arXiv preprint arXiv:2003.08938, 2020.
 (35) He Zhu, Zikang Xiong, Stephen Magill, and Suresh Jagannathan. An inductive synthesis framework for verifiable reinforcement learning. In PLDI’19, pages 686–701, 2019.
Appendix
The appendix consists of two sections. In Section A, we present an experiment of achieving the balance between the performance of trained systems and the size of their abstractstate space by adjusting its abstraction granularity during learning. The result can support our claim in our paper that by finetuning abstraction granularity we can reduce the abstractstate space and meanwhile guarantee that all the desired properties are verifiable. Section B gives the proofs of the lemmas that are used to prove the soundness of our abstraction in verification.
Appendix A Finetuning Abstraction Granularity
In this section, we use Cartpole as a supplementary example for Section 5.4 to illustrate the feasibility of our approach for achieving a balance between the performance of trained systems and the size of their abstractstate space by finetuning the abstraction granularity. The approach makes the verification of trained systems amenable without losing their performance.
a.1 Determining Abstraction Granularity Baseline
In the example of Cartpole, a state consists of four elements, i.e., the displacement and the velocity of the cart and the angle and the angular velocity of the pole. Because states are continuous, the whole state space is infinite. The controller aims to keep the angle of the pole and the displacement of the cart within fixed thresholds.
As we have discussed in our submitted paper, we can start training the example with the relatively coarse abstraction granularity, e.g., , but observed that there is a big gap in the performance when we compare the trained controller under this granularity with the one trained using classical DRL algorithm without any abstraction. We call the latter benchmark. We finetune the granularity until the controller converges steadily to the rewards that are indicated by the benchmark controller. As shown in Figure 7, the performance of the intermediate abstraction granularity () is close to the benchmark, and therefore we can choose it as the baseline of the abstraction granularity. An abstraction granularity that is finer than the baseline basically guarantees that the controller’s performance is close to the benchmark.
a.2 Model Checking under Different Abstraction Granularities
The property for verification can be formally defined by the following LTL formula:
where, and represent the displacement and the angle respectively. There are three possible verification results, i.e., completely verified to be true (), partially verified to be true (), or verified to be false (✗). Table 4 shows the verification results of the property under different abstraction granularities. It can be seen that the property fails in the first three cases. The verification framework found some abstract states where the property is violated before reaching the threshold. Note that finding violations does not imply the trained controllers do not satisfy the property. That is a common phenomena caused by introducing abstraction into verification. We continued to refine the abstraction granularity to be . The property is verified to be true on abstract states before it reaches the statespace threshold. This result is the same as one shown in the paper.
To better illustrate how we can achieve a complete verification of a desired property, we slightly relaxed the safety property by increasing the safe range of the pole’s angle by 0.1. That is, we replace (approximately ) in the property with . The relaxed safety property becomes:
The righthand part shows the verification results of the property under different granularities. In the case of the coarse granularity, a violation to the property is found, indicating that the property may not be satisfied by the controller trained under this abstraction. When we finetuned the granularity to be , the relaxed property was completely verified within a reasonable time. Because this granularity is finer than the baseline and the property is verified, we can safely stop training with finer granularities in practice. In this experiment, we continued to train the controller under two finer granularities for the purpose of comparison. It shows that the property is preserved as expected under finer granularities. In the case under , the property was partially verified due to reaching the threshold of traversed abstract states.
Granularity  

Result  States  Time  Result  States  Time  
Coarse  ✗  697  ✗  710  
✗  799  ✓  793  
✗  462  ✓  456  
Fine    1361    1347 
Appendix B Proof of Lemmas in Section 4.3
We first formally define concretization function , abstraction function , concrete transformer , and abstract transformer in sequence.
Definition 3 (Concretization function).
A concrete function maps abstract states into sets of concrete system states , where .
Definition 4 (Abstraction function).
An abstraction function is the mapping from concrete domain to abstraction domain such that for all there is and for each .
Definition 5 (Concrete transformer).
A concrete transformer defines the state transition function based on the system dynamics.
Definition 6 (Abstract transformer).
An abstract transformer defines transitions between abstract states, which is a composition of an interval transition function and an overapproximation function .
Intuitively, denotes the interval vector after the action determined by the neural network is applied to , and returns the least set of abstract states whose union is an over approximation of the interval vector .
We use the tuple to denote values of the concrete state and to denote the variation in each variable based on the system dynamics. Specifically, given the concrete state and the adopted action , the next state of can be represented by .
Definition 7 (Interval transition function).
An interval transition function returns an interval vector for a abstract state . Let and . Then for each , and .
Definition 8 (Overapproximation function).
An overapproximation function returns a set of abstract states for a given interval vector . Let the interval vector denote the range of the union of abstract states. Then , and . Let , there are abstract states in the set.
Lemma 1.
Let the state . Then . is the abstract state that can be represented by , where for each . is the interval vector . Since the actions adopted for states belonging to the same abstract state are same, for each , we have and . Intuitively, , and , so we have , i.e., . So . ∎
We prove the Lemma 2 via proof by contradiction. That is, we assume and deduce the contradiction.
Lemma 2.
Let the state , the vector and . According to the definition of , , we have and . Let . Since , we have or for each . However, we have for each , therefore , which contradicts the definition of . ∎
Comments
There are no comments yet.