On Assessing The Safety of Reinforcement Learning algorithms Using Formal Methods

The increasing adoption of Reinforcement Learning in safety-critical systems domains such as autonomous vehicles, health, and aviation raises the need for ensuring their safety. Existing safety mechanisms such as adversarial training, adversarial detection, and robust learning are not always adapted to all disturbances in which the agent is deployed. Those disturbances include moving adversaries whose behavior can be unpredictable by the agent, and as a matter of fact harmful to its learning. Ensuring the safety of critical systems also requires methods that give formal guarantees on the behaviour of the agent evolving in a perturbed environment. It is therefore necessary to propose new solutions adapted to the learning challenges faced by the agent. In this paper, first we generate adversarial agents that exhibit flaws in the agent's policy by presenting moving adversaries. Secondly, We use reward shaping and a modified Q-learning algorithm as defense mechanisms to improve the agent's policy when facing adversarial perturbations. Finally, probabilistic model checking is employed to evaluate the effectiveness of both mechanisms. We have conducted experiments on a discrete grid world with a single agent facing non-learning and learning adversaries. Our results show a diminution in the number of collisions between the agent and the adversaries. Probabilistic model checking provides lower and upper probabilistic bounds regarding the agent's safety in the adversarial environment.



There are no comments yet.


page 1

page 6


An Abstraction-based Method to Check Multi-Agent Deep Reinforcement-Learning Behaviors

Multi-agent reinforcement learning (RL) often struggles to ensure the sa...

Certified Adversarial Robustness for Deep Reinforcement Learning

Deep Neural Network-based systems are now the state-of-the-art in many r...

Probabilistic Guarantees for Safe Deep Reinforcement Learning

Deep reinforcement learning has been successfully applied to many contro...

Falsification-Based Robust Adversarial Reinforcement Learning

Reinforcement learning (RL) has achieved tremendous progress in solving ...

Automatically Learning Fallback Strategies with Model-Free Reinforcement Learning in Safety-Critical Driving Scenarios

When learning to behave in a stochastic environment where safety is crit...

Adversarial Robustness Verification and Attack Synthesis in Stochastic Systems

Probabilistic model checking is a useful technique for specifying and ve...

Cooperation for Scalable Supervision of Autonomy in Mixed Traffic

Improvements in autonomy offer the potential for positive outcomes in a ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine Learning (ML) has gained tremendous interest in the academic and industrial fields, due to the increased data volumes, advanced algorithms, and improvements in computing power and storage [18]. Among ML algorithms, Reinforcement Learning (RL) has been positively recognized for the ability of agents to learn through their interactions with the environment in which they are deployed. This technique has been useful to collect informations about the environment of some applications [34, 15]. On the downside, interacting with the environment gives rise to some safety challenges, linked to the level of criticality of the applications/domains involved.

The safety of RL agents has been studied in the literature. Some approaches consist of retraining the model with adversarial examples to improve the agent’s resilience to them [16, 30]. Some other approaches consist of designing a controller to find an optimal policy for the agent in presence of disturbances [20, 26]. Improving the safety of RL agents has also been studied in the context of adversaries represented by fixed (non-moving) obstacles. In [36, 5], the agent was trained to avoid obstacles existing in the environment. Collision avoidance applications have been studied in the literature [35, 39] but to the best of our knowledge, no research has been conducted for moving obstacles which we referred to as moving adversaries. Moreover, providing formal guarantees on the safety of an agent evolving in a perturbed environment has not receive enough attention in the literature. This is due to the uncertainty of the environment which makes the problem very challenging [4].

In this paper, we aim to assess the safety of a RL agent against moving adversaries and give provable guarantees.The agent is trained through an online RL algorithm in presence of moving adversaries. At the end of the training, the trained agent appears to show some weaknesses linked to the vulnerability during the training. We first consider two types of moving adversaries that uncover some flaws in the learnt policy of the agent: static adversary (non-adaptive and non-learnable) and learnable adversary. Actually, these adversaries create threats that prevent the agent from achieving its objective. Then, we have implemented two defense mechanisms to improve the learning process. One of them is based on reward shaping, a method for engineering the reward function to guide the agent during training. The second defense mechanism consists of modifying the RL algorithm by instructing the agent to avoid the adversary during training. Finally, we provide formal probabilistic guarantees regarding both defense mechanisms. Experimentally, we have verified reachability properties, i.e., the ability of the agent to reach specific states of the environment [2]

on the well-known grid world environment to examine the safety and overall performance. The results show that by applying the proposed defense mechanisms, the agent is capable of performing more safely in the presence of various adversaries, e.g., by reducing the collision rate. Formal guarantees are provided to assess the effectiveness of our defense mechanisms, by computing probabilities of our reachability properties. Our key findings in this paper are:

  • An adversarial environment consisting of static/learnable moving adversaries can badly harm the agent policy,

  • Strengthening the agent policy with our proposed defense mechanisms helps in improving the agent’s behaviour,

  • Probabilistic model checking provides a range of guarantees on the safety of the agent’s behaviour.

The rest of this paper is organized as follows. Section V briefly reviews related works on the safety of RL agents. In Section II, we present the required background knowledge on RL, reward Shaping, and probabilistic model checking. Section III describes our methodology and Section IV presents the results of our experiments. Finally, we present concluding remarks in Section VI.

Ii Background

Ii-a Reinforcement Learning

A RL problem can be mathematically described as an MDP [38]. Figure 1 illustrates the interaction of the agent with its environment in a MDP. An MDP is defined by:

  • A set of states S,

  • A set of actions A,

  • A transition function determining the probability that the agent, moves to the state and gets as a reward given the preceding state by taking action :

  • The reward function representing the result of the agent’s action,

  • The starting state ,

  • The discount factor which expresses how much a reward counts [31].

The beginning of a trajectory in an MDP can be written as follows:


The trajectory describes the succession of states, actions taken by the agent and the rewards it received over time. The agent (also called the learner) interacts with the environment, takes actions allowing him to evolve in different states of the environment. The latter sends back feedback in the forms of rewards, values that the agent attempts to maximize over time.

Fig. 1:

Interaction between the agent and the environment in a Markov decision process 


The agent’s objective is to find the optimal policy that maximizes the cumulative expected reward discounted which can be done using Bellman equations. They express how good it is for an agent to be in a state following the policy : Equation 3.


Q-learning algorithm [6] is a well-known RL algorithm to train the agent when the model of the environment is unknown. In Q-learning, the action-value function tells us how good it is to take a particular action from a particular state following the policy ; Equation 4.


At the step of an episode, the agent:

  • Observes its current state

  • Selects and performs an action

  • Observes the next state

  • Receives an immediate reward

  • Adjusts its Q-values by using the learning rate and the discount factor


The agent has to decide between exploration and exploitation [33]

. The agent can explore the environment to have a better estimation of the Q-values or can exploit its knowledge and always choose the action that maximizes the Q-values. To trade-off between exploration and exploitation, one can implement the

-greedy algorithm () to select a random action time and the action that maximize ,   time. It is possible to change through time so that in the beginning the agent should select actions randomly, as soon as the environment is explored it should act more greedily.

Ii-B Reward Shaping

Reward shaping consists of modifying the original reward function with a shaping function that incorporates domain knowledge [13]. The new reward function can be formalized as in Equation 6:


where is the original reward function, is the shaping reward function, and is the modified reward function. Potential-based reward shaping (PBRS) [27] is the first approach that guarantees invariance of the optimal policy. Specifically, PBRS defines as the difference of potential values:


where is a potential function which gives some kind of hints on states [13].

Ng et al. [27] demonstrates that every optimal policy produced under the new reward shaping function will also be an optimal policy without the shaping process.

Ii-C Probabilistic Model Checking

Probabilistic model checking is a formal verification process of systems with stochastic behaviour. It allows us to analyse quantitative properties on those systems. PRISM [12]

is the tool that we use to verify the satisfiability of the RL agent properties. It is a probability model checker that takes as input a probabilistic model, a temporal logic property and returns whether or not the property is satisfied on the model. We have extracted a DTMC (Discrete-time Markov chain) from the agent policy, whose behaviour at each time is described as a discrete probabilistic choice over several outcomes. A DTMC is a tuple

[17] where :

  • is a set of states,

  • is an initial state,

  •   is a transition probability matrix such that   for all ,

  • is a set of atomic propositions,

  • is a labelling function that assigns, to each state , a set of atomic propositions.

PCTL (Probabilistic Computation Tree Logic) [10] specifications can be verified on DTMC. Their formalism is as follow:

where is an atomic proposition, {}, and . A state satisfies a PCTL formula if it is true for . means that the probability of a path formula satisfies the bound . For a DTMC path , the next state formula holds iff is satisfied in the next path state (i.e., in state ); the bounded until formula holds iff before becomes true in some state is true for states to ; and the unbounded until formula removes the constraint from the bounded until formula.

Iii Approach

Iii-a Environment

First, we consider an environment that consists of a set of locations. An agent and an adversary are in the environment residing in their locations and can move around. The agent has an objective to reach a specific location which is called the goal. In real-world environments, e.g., autonomous driving, agents can perceive their environment via their sensors, i.e., observations. In this paper, we define observations of agents as a set of the environment’s locations: at each time step the agent can observe its current location and the content of its neighboring locations. As such, the agent can observe the adversary when it is in its neighboring locations. The location of the goal is fixed and announced to the agent before it starts its experience in the environment. At each time step: 1) the agent receives its observations from the environment, 2) it selects an action to be taken in the environment (i.e., moving to a neighboring location) and 3) the environment returns a reward signal to the agent. The agent employs a RL algorithm to learn the optimal policy for reaching the goal in the environment. Without loss of generality, we assume that the agent employs the Q-learning algorithm to learn a path (a series of locations to move in consequently) to achieve its objective. As shown in Algorithm 1, the Q-learning algorithm is modified to take into account the observations in order to preemptively avoid the adversary at the next move. We keep track of a Q function, , that we update when the agent has the adversary in its surroundings. In such a case, the Q-value associated to that cell where the adversary is located, is negatively rewarded . This new Q function is employed whenever the adversary is seen. Otherwise, the normal Q learning algorithm is employed.

Iii-B Adversaries

We design various adversarial policies to test the agent against them; assessing safety and overall performance: non-learnable and learnable adversaries. Both adversaries can move in the environment, the property that makes them different from typical non-moving obstacles. The non-learnable adversaries are designed to patrol in the environment around the goal location to prevent the agent from achieving its objective. Their policies are deterministic and can harm the agent when they are in the same location which is called a collision. It is a threat against the safety of the agent since the ideal path to the goal for the agent should be collison-free. We base the formulation of the agent’s safety properties with reference to this ideal path. A safety property describes the consequences of the behavior of the agent for taking the ideal path. During the formal verification process, we will search for collision-free paths to return whether the agent’s behavior is safe. The learnable adversary is designed to be smarter, with non-predefined behaviour and the capability of adapting itself to the agent’s behaviour as the agent learns how to behave. The learning adversary also uses the Q-learning algorithm.

Iii-C Defense mechanisms

The adversaries expose flaws in the behaviour of the agent leaving us to design some defense mechanisms to help the agent. We have proposed two defense mechanisms in this paper. The first one is based on PBRS and similar to the work presented in [27], our potential functions are as follows:

  • The Manhattan distance between the location of the agent and the goal’s location.


    where is the Manhattan distance, the current location of the agent and the goal’s location.

  • An extra reward for the agent discounted by the probability of observing no collisions at its current location.


    where is the original reward function obtained by the agent at location and is the probability of having no collisions at location . This probability is computed as the number of moves made in a location without colliding with the adversary divided by the total number of moves made to that same location.

In addition to PBRS, we implement another defense mechanism that is more conservative. The Q-learning algorithm is adopted as shown in Algorithm 2 to instruct the agent to avoid any collisions with the adversary. To do so, we slightly modify the algorithm in its exploitation phase. We check for the presence of the adversary in the neighbouring cells of the agent. When the selected action that has the best q-value leads to a collision with the adversary, we ignore that action and consider the action with the second best q-value.

Result: act
if random.uniform(0, 1) epsilon (the exploration probability) then
       Explore: select a random action act;
       Exploit: select the action act with the max q-value ;
       if act leads to adversary then
             select new action act=actNew with the max q-value;
       end if
end if
Algorithm 1 Q-learning with observations
Result: act
if random.uniform(0, 1) epsilon (the exploration probability) then
       Explore: select a random action act;
       Exploit: select the action act with the max q-value;
       if act leads to adversary then
             select the action act with the second max q-value;
       end if
end if
Algorithm 2 Modified Q-learning

Iii-D Formal method

We have employed formal methods to investigate potential guarantees for the proposed defense mechanisms. We are going to check whether we are able to provide formal guarantees on the agent’s behaviour in the environment. To do so, we evaluate the effectiveness of the defense approaches through probabilistic model checking. We verify the agent’s reachability properties using the PRISM model checker. We extracted a DTMC after the training process, reflecting the agent’s behavior. Since, at each location, the Q-learning algorithm uses the Q-function to determine the agent’s next action, the DTMC is constructed by relying on that Q-function to compute the probability of taking a specific action for each transition. We construct the latter for the next episodes after episodes. The nodes of the DTMC are the agent’s locations in the environment. The agent’s reachability properties are verified on the DTMC. We measure the safety of the agent through the probability of reaching the goal within some particular steps in presence of adversaries.

We want to evaluate the agent’s policy augmented with the employed defense mechanisms, trained in an adversarial environment. The agent’s policy is verified when acting against the learnable adversary as it is the most challenging adversary in our study. Its reachability properties are formalized onto PRISM in PCTL logic as follows:

  • P1 : 
    With probability the agent will always reach its goal.

  • P2 : 
    What is the probability that the agent reaches its goal within 100 steps

  • P3 : 
    What is the probability that the agent reaches its goal

where is the goal’s location.

Iv Experiments

In this section, first we describe our research questions. Then, we present the experimental design for collecting results. Finally, we will explain the conducted experiments to assess the agent’s behaviour against adversaries and answer our RQs. The interested reader can refer to the source code 111https://github.com/movingadv/On-Assessing-The-Safety-of-Reinforcement-Learning-algorithms-Using-Formal-Methods.

Iv-a Research questions

We formulate the following research questions to address the proper assessment of the agent’s safety:
(RQ1) What impact do moving adversaries have on the learning of the RL agent?
In this research question, we want to determine the level of harm that a moving adversary (learnable or non-learnable) can cause.
(RQ2) To what extent can PBRS-based defense mechanisms improve the agent’s policy so that the agent can achieve its objective in presence of adversaries? This research question aims to evaluate the effectiveness of PBRS as a defense mechanism against the learnable adversary.
(RQ3) To what extent the direct modification of Q-learning algorithm can improve the learning policy so that the agent can achieve its objective in presence of adversaries? This research question aims to evaluate the effectiveness of the modified Q-learning algorithm as a defense mechanism against the learnable adversary.
(RQ4) What are the formal guarantees that can be provided to the defense mechanisms regarding the agent safety?
In this research question, we provide lower and upper bounds on the agent’s reachability properties. The latter assess the effectiveness of the defense strategies.

Iv-B Experimental design

First of all, we set up an environment for our experiments. We implement a 6 6 grid world where the agent starts from the upper left of the grid () and has to make its way to the goal’s location, i.e., lower right of the grid . Figure 1(a) displays a possible location of the agent in the environment, and the black flags represent one of its possible locations at the next step after taking an action. As well as shows the environment and associated cell numbers. The actions correspond to moving to the four compass directions: (right, up, down, left)

For every non-goal cell, the agent gets a reward of , and when reaching the goal cell. The agent will get a reward when a collision with the adversary occurs. The objective is to reach the goal using the shortest path and without interruption, which allows the increase of the cumulative reward. We define cumulative reward at episode number , as the summation of reward obtained from episode number until episode number . Moreover, the agent’s behaviour is considered safe if there is no collision with the adversary. The agent is trained via Q-learning algorithm with -greedy policy. All the reported results conducted with this environment are average of independent runs and each run lasts for episodes. In fact, three events may lead to terminate an episode:

  1. The agent reaches its goal, cell ,

  2. A timeout of steps per episode is reached. In such a case, no additional reward is given neither to the agent nor the adversary for that episode,

  3. A collision occurred between the agent and the adversary.

For (RQ1), we have considered that the environment is threatened by three kinds of moving adversaries:

  • A 3-cell patrolling adversary, patrolling successively back and forth on (5,4), (4,4), (4,5) cells. The adversary moves around these cells at every episode as shown in Figure 1(b).

  • A 5-cell patrolling adversary, patrolling successively back and forth on (5,4), (4,4), (4,5), (3,5), (3,4) cells. The adversary moves around these cells at every episode as shown in Figure 1(c).

  • A learning adversary which also uses a Q-learning with -greedy policy starting at the cell (0,0).

Fig. 2: a) an example of the agent location, its surrounding and the goal location, b) 3-cell patrolling adversary trajectory, c) 5-cell patrolling adversary trajectory, d) The agent’s path with the modified Q-learning process employed.

The adversaries’ goal is to create as many collisions as possible with the agent. They get a reward of when a collision occurs and otherwise in each episode. By creating collisions with the agent, they demonstrate unsafe behavior throughout training. We have collected the rate of collisions caused by the adversarial policies as well as the cumulative reward obtained by the agent while acting against such adversaries. The rate of collisions represents the average number of collisions in the last episodes.

For (RQ2), we have collected the collision rate and the cumulative reward obtained by the agent when facing the learning adversary as it is the most challenging adversary among those we presented in this paper. Also, the results are collected when the agent’s strategy is augmented with the PBRS-based defense mechanism using two different potential functions: the probability of avoiding a collision and the distance to the goal.

For (RQ3), we have implemented the modified Q-learning algorithm presented in Section III. We run this defense mechanism against the learning adversary and evaluate the agent’s behaviour by the collision rate and the cumulative reward. Additionally, for (RQ2) and (RQ3) the results are collected when the agent can observe its surrounding cells. An example of the agent and its surroundings are shown in Figure 2(b).

For (RQ4), we have conducted a verification process using PRISM model checker222https://www.prismmodelchecker.org. We verify the agent’s behaviour in four scenarios where the agent can observe its neighbouring cells in all scenarios:

  • Scenario 1: The agent is facing the learning adversary,

  • Scenario 2: The agent is acting against the learning adversary and the PBRS-based defense mechanism is employed using the Manhattan distance to the goal as the potential function,

  • Scenario 3: The agent is facing the learning adversary and the PBRS-based defense mechanism is employed using the probability of avoiding a collision as the potential function,

  • Scenario 4: The agent is acting against the learning adversary and the modified Q-learning algorithm is employed.

For simplicity, the agent’s states are encoded into PRISM from to . With the initial state and the goal state.

Iv-C Results of the evaluation

In the following, we present the obtained results and describe answers to our four research questions.

RQ1: What impact do moving adversaries have on the learning policy of the RL agent?
At episode #, the agent faces collision rate when performing against the learning adversary, for the 3-cell patrolling adversary and for the 5-cell patrolling adversary. These results are collected when the agent and the learning adversary do not have any observations. The learning adversary appears to be the most harmful in terms of collision rate created with the agent. When allowing the agent to have observations, we witness almost zero collisions per episode when the agent faces the 5-cell patrolling adversary and less than collision rate when facing the 3-cell patrolling adversary. The results presented in Figure 2(a) report the mean of collision rate for all episodes.

Adversary Collision rate SD
5-cell patrolling adversary 0.00 0.00
3-cell patrolling adversary 0.07 0.00
Learning adversary with observations 0.39 0.35
Learning adversary without observations 0.41 0.15
TABLE I: Collision rate between the agent and the adversaries when the agent has observations and no defense mechanisms is applied against these adversaries at Episode #2000.

Table I

includes the value of the collision rate between the agent and the adversary and its standard deviation at the end of experiment (i.e., Episode No.

) for the patrolling adversaries. No defense mechanisms have been applied against the patrolling adversaries due to the low collision rate. These values underlie the effect of the observations on reducing the collision rate, as the agent faces the 3-cell and 5-cell patrolling adversaries. Figures 2(b) and 2(c) show the cumulative reward obtained by the agent and the patrolling adversaries over runs. Numerical values along with standard deviations are reported on Tables II. Larger amounts of the cumulative reward of the agent compared to the patrolling adversaries prove that the agent can reach its goal without interruption despite the presence of patrolling adversaries in the environment. The patrolling adversaries represent less threat to the agent’s safety when adding observations to agent’s knowledge.

Fig. 3: a) Collision rate between the agent and patrolling adversaries, b) Cumulative reward of the agent and the 3-cell patrolling adversary, c) Cumulative reward of the agent and the 5-cell patrolling adversary.
Episode #500 Episode #1000 Episode #1500 Episode #2000
Agent with observations 7892.12, 86.5 4 20979.87, 63.25 69.8 33515.62, -77 0 45234.62, 91 0
3-cell patrolling adversary 50, 0 0 4037.5, 12.5 35.35 7900, 0 0 15000, 0 0
Agent with observations 5607.5, -14 15.40 11043.75, -53.5 10.13 19514.75, -53 3 28974.75, 91 0
5-cell patrolling adversary 0, 0 0 50, 0 0 100, 0 0 100, 0 0
TABLE II: Cumulative reward of the agent and the 3-cell, 5-cell patrolling adversaries (cumulative reward, mean reward standard deviation).

Nevertheless, the agent’s policy with observations is not that successful in presence of the learning adversary. The learning adversary learns to predict the agent’s moves which result in high collision rate, negative rewards earned and then a decrease of the cumulative reward earned by the agent in comparison to patrolling adversaries. We did two experiments regarding the behaviour of the learning adversary to find out if a learning adversary with observations is more harmful. Figure 3(a) reports at each episode the collision rate between the agent and the learning adversary along with the corresponding standard deviation, when the agent and the learning adversary have observations and when they do not. Figure 3(b) reports the cumulative reward with the corresponding standard deviation for these experiments. Table I and Table III report the collision rate between the agent and the adversary, the cumulative reward, and their standard deviation respectively at episode #. Although it has observations, the learning adversary does not appear to be more harmful. Its observations do not affect the agent’s behaviour, since the latter still sees its surroundings and can manage to avoid the adversary. So, for the rest of our experiments, we focus on a learning adversary without observations.

Configuration Cumulative reward Mean reward SD
Agent -208902.67 -101.50 0.86
Learning adversary without observations 76557.14 28.57 48.79
Learning adversary with observations 45625.00 25.00 46.30
TABLE III: Comparison between the environment configurations at Episode #2000. The agent can observe its surroundings.

RQ2: To what extent can PBRS-based defense mechanisms improve the agent’s policy so that the agent can achieve its objective in presence of adversaries?
The collision rate after applying both PBRS-based defense mechanisms are shown in Figure 5 along with standard deviation. As in Table IV, the value of the collision rate obtained by the agent equipped with defense mechanisms at episode #, shows that these mechanisms were able to reduce collisions between the agent and the learning adversary. However, PBRS with the probability of avoiding a collision as a potential function did not contribute to improving the learning as the cumulative reward still decreases during training. Figure 5(b) reports this decrease in cumulative reward and the corresponding standard deviation. The reason is that the location of the learning adversary changes at each step which reduces the accuracy of the probability of avoiding a collision at the next step.

Fig. 4: a) Collision rate between the agent and the adversary with and without observation, b) Cumulative reward of the agent and the learning adversary with and without observation.
Fig. 5: Comparison between collision rates after applying the defense mechanisms against learning adversary.
Defense mechanism Episode #2000
Modified Q-learning 0.00 0.14
PBRS (distance to goal) 0.05 0.14
PBRS (probability of collisions) 0.30 0.30
TABLE IV: Collision rate collected of the environment configurations after applying the defense mechanisms (collision rate standard deviation).
Fig. 6: a) The potential function is the distance to the goal, b) The potential function is the probability of avoiding a collision.
Fig. 7: Cumulative reward earned by the adversary and the agent with the modified Q-learning process

PBRS with the distance of the agent to the goal as a potential function presents better performances than the previous one. The agent is capable of reaching the goal while lowering its collision rate with the learning adversary and increases its cumulative reward. Figure 5(a) reports the cumulative reward obtained by the agent with the corresponding standard deviation when the distance to the goal location is employed as a potential function. Table V reports, at episode #, a better cumulative reward when using PBRS with the distance of the agent to the goal as a potential function in comparison to the probability of avoiding a collision.

RQ3: To what extent the direct modification of Q-learning algorithm can improve the learning policy so that the agent can achieve its objective in presence of adversaries?
The collision rate collected after applying the modified q-learning process is shown in Figure 5. We have reported a low collision rate (Table V) but a decreasing cumulative reward, as in Figure 7. The modified Q-learning algorithm appears to be conservative as the priority for the agent is to avoid the learning adversary at the expense of not reaching the goal. Figure 1(d) shows an example of a path taken by the agent when observing the adversary in its surroundings. The agent was on its way to the goal location, but since it has to avoid the adversary, the chosen path does not include the goal eventually. Also, we observe in some cases that the standard deviation is greater than the mean. This is due to the disparity between the reward obtained by both the adversaries and the agent.

Configuration Cumulative reward Mean reward SD
Agent with observations + PBRS (distance to goal) 45240.10 29.02 62.43
Learning adversary without observations 10050 2.00 22.73
Agent with observations + PBRS (probability of collision) -126488.30 28.73 681.28
Learning adversary without observations 63330 10.00 666.67
Agent with observations + Modified Q-learning -200476.54 -101.00 12.22
Learning adversary without observations 142 0.00 6.72
TABLE V: Comparison between different environment configurations.

RQ4: What are the formal guarantees that can be provided to the defense mechanisms regarding the agent safety?
We choose reachability analysis to verify the capability of the agent to reach its goal without facing an obstacle. The results of the verification are formal guarantees regarding the safety of the trained agent against the moving adversaries. The verification process consists of checking for each property the paths that satisfy the following scenarios. We have verified properties P1, P2 and P3 on Scenario 1, Scenario 2, Scenario 3 and Scenario 4. We have reported the results in Table VI. The decrease of the cumulative reward on Scenario 1, Scenario 3 and Scenario 4 explained well the results of the verification process. The verification process provides with no doubt that the environment remains unsafe after applying the defense mechanisms to these scenarios. On the contrary, in Scenario 2 we can see that the applied defense mechanism helps the agent to reach its goal with a minimum probability of . More explanatory, within steps the agent has chance to complete its task. Its chance increases when the timeout happens more frequently as well.

P1 P2 P3
Scenario 1 False 0 0
Scenario 2 True 0.1 1
Scenario 3 False 0 0
Scenario 4 False 0 0
TABLE VI: Results of probabilistic model checking.

V Related work

An RL process is usually formulated as a Markov Decision Process (MDP), one can see those parts as opportunities to attack an adversary. Hence, the ability to attack RL applications sometimes comes from the interdependence of different actions taken by the agent [14]. Initially, the agent is in an exploration phase, making it difficult to differentiate between legitimate and illegitimate actions. Also, the policy followed by the agent to achieve the objective may be deterministic or stochastic, having loopholes that can be used by an attacker. Finally, the manipulation of the environment modifies the observations and the actions taken by the agent. If those manipulations lead to unsafe states, the final reward obtained by the agent can be corrupted. By compromising the policy, the overall performance of the agent can be degraded. Lin et al. [21]

showed the presence of adversarial samples on Neural Networks policies, by leveraging a prediction frame module. Moreover, RL agents face misspecifications

[22] due to setting changes between the training environment and the test environment. These modelling errors have a negative impact on the agent’s policy and future rewards [26]. Adversarial samples can also be detected by introducing agents with adversarial policies in the environment. Wang et al [40] propose techniques to uncover flaws in the training of the agent, by introducing an attacker which leads the agent to failure by perturbing the rewards observed. Pinto et al. [32], use a zero-sum games configuration to model failures inside the environment. Antagonists players described a policy that is harmful to the agent. Our work is similar to these previous works since we introduced an adversary which negatively challenges the agent in achieving its goal.

Several works have investigated mechanisms to help agents face perturbations that might occur during the training. Augmenting the reward distribution of the agent can improve its learning policy when doing it in an appropriate way. Some other works on reward shaping to improve the learning policy include automatic reward shaping approaches [8] [24], multi-agent reward shaping [7] [37], and some novel approaches such as belief reward shaping [23], ethics shaping [41], and reward shaping via meta-learning [42]. In our work, we implement PBRS as our defense strategy to face adversarial policy. Our potential function is partially inspired by the work done in [27] and adapted to the model-free RL. The authors propose a potential function, the manhattan distance between the agent location and its goal. They show that it is an effective approach that can be implemented by researchers when trying to improve their RL systems.

Adversarial training which consists of retraining a machine learning model with adversarial examples to increase the robustness of the model against adversarial examples can be implemented as defense mechanisms. Researchers in [16], [30], [9], [3], proposed to re-train the model with perturbations generated by adding noises to states and rewards during the training. In the game-theoretic approaches the attacker is considered to be competing in a game with the agent, they adjust their choices based on the payoff during the training. Ogunmolu et al. [28] propose a minimax iterative dynamic game framework for designing robust policies in the presence of adversarial inputs. Pinto et al. [32] modelled interactions between both the agent and the attacker as a zero-sum game. The agent improves its policy by trying to win the attacker. In presence of modelling errors or misspecifications of training parameters, one can implement a robust method that incites the agent to learn only the optimal policy [22], [26]. Li et al. [20] propose a robust controller that allows the agent to face model uncertainties and external disturbances. In comparison to adversarial training, the agent is retrained in presence of adversaries, but with additional knowledge about them. This additional information is incorporated in its training algorithm.

The above studies have contributed to improving the RL agent’s learning. However, they suffer from the lack of guarantees on the safety of the agent which reduces their applicability. In this paper, we propose to implement formal methods that have provided strong guarantees on the expected behaviour of systems including software systems and critical systems. Some researchers have used formal methods to improve the agent’s policy. They considered the MDP of the environment to build a verification strategy. Mason et al. [25] generate a set of policies and force the agent to learn only policies that satisfy a set of predefined constraints. Probabilistic model checking has also been applied to verify the satisfiability of safety requirements [11], [29]. Li et al. [19] formulate safety requirements into temporal logic formulas which they use to construct a finite-state automaton. The automaton then synthesizes a safe optimal controller which can notify when the system enters into traps or executes tasks that are impossible to be completed. Alshiekh et al. [1] propose a method for the RL agent to learn optimal policies on the MDP of the environment while enforcing temporal logic properties. These works assume the model of the environment is known. In our work the agent is trained through Q-learning, a model-free RL algorithm and the policy is represented by a state action-value function (Q function). In [4], the authors provide probability guarantees by computing probabilistic reachability properties using a value iteration algorithm. Our work has similarities with this work as we also provide probabilistic reachability properties for a set of states, but we rely on a model checker to compute those probabilities. Moreover, we are able to provide formal guarantees after the verification process on a set of desirable properties.

Vi Conclusion and Discussions

In this paper, we have presented an approach to assess the safety of an RL agent, using formal methods. The proposed approach consists of designing moving adversaries that harm the agent’s behaviour. This leads to implementing some defense mechanisms to strengthen the agent’s policy. The experiments show that the defense mechanisms improve the learning in terms of reducing the collision rate with the adversaries. We employ the probabilistic model checking on four scenarios, as mentioned on the Experiments section IV,to provide provable safety guarantees of the agent’s behaviour. During the verification process, we consider simple reachability properties to provide provable guarantees for the safety of the agent’s behaviour. A learnable adversary, which is the most harmful adversary that we design, demonstrates resilience to our defense mechanisms in comparison to the non-learnable adversaries. This is due to its high capability to threaten learning. Nevertheless, according to our formal verification neither of evaluated scenarios is safe for the agent. For future work, we plan to investigate more effective defense mechanisms against the learnable adversary. The main idea is to deceive the adversary so that she can not predict the agent’s behaviour and make collisions. Moreover, we plan to examine the proposed adversaries and defense mechanisms in more complex environments.


  • [1] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu (2018) Safe reinforcement learning via shielding. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §V.
  • [2] C. Baier and J. Katoen (2008) Principles of model checking. MIT press. Cited by: §I.
  • [3] V. Behzadan and A. Munir (2017) Whatever does not kill deep reinforcement learning, makes it stronger. arXiv preprint arXiv:1712.09344. Cited by: §V.
  • [4] M. Bouton, J. Karlsson, A. Nakhaei, K. Fujimura, M. J. Kochenderfer, and J. Tumova (2019) Reinforcement learning with probabilistic guarantees for autonomous driving. arXiv preprint arXiv:1904.07189. Cited by: §I, §V.
  • [5] R. Cimurs, J. H. Lee, and I. H. Suh (2020) Goal-oriented obstacle avoidance with deep reinforcement learning in continuous action space. Electronics 9 (3), pp. 411. Cited by: §I.
  • [6] P. Dayan and C. Watkins (1992) Q-learning. Machine learning 8 (3), pp. 279–292. Cited by: §II-A.
  • [7] S. Devlin and D. Kudenko (2011) Theoretical considerations of potential-based reward shaping for multi-agent systems. In The 10th International Conference on Autonomous Agents and Multiagent Systems, pp. 225–232. Cited by: §V.
  • [8] M. Grzes and D. Kudenko (2008) Learning potential for reward shaping in reinforcement learning with tile coding. In Proceedings AAMAS 2008 Workshop on Adaptive and Learning Agents and Multi-Agent Systems (ALAMAS-ALAg 2008), pp. 17–23. Cited by: §V.
  • [9] Y. Han, B. I. Rubinstein, T. Abraham, T. Alpcan, O. De Vel, S. Erfani, D. Hubczenko, C. Leckie, and P. Montague (2018) Reinforcement learning for autonomous defence in software-defined networking. In

    International Conference on Decision and Game Theory for Security

    pp. 145–165. Cited by: §V.
  • [10] H. Hansson and B. Jonsson (1994) A logic for reasoning about time and reliability. Formal aspects of computing 6 (5), pp. 512–535. Cited by: §II-C.
  • [11] M. Hasanbeig, A. Abate, and D. Kroening (2018) Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099. Cited by: §V.
  • [12] A. Hinton, M. Kwiatkowska, G. Norman, and D. Parker (2006) PRISM: a tool for automatic verification of probabilistic systems. In International conference on tools and algorithms for the construction and analysis of systems, pp. 441–444. Cited by: §II-C.
  • [13] Y. Hu, W. Wang, H. Jia, Y. Wang, Y. Chen, J. Hao, F. Wu, and C. Fan (2020) Learning to utilize shaping rewards: a new approach of reward shaping. arXiv preprint arXiv:2011.02669. Cited by: §II-B.
  • [14] I. Ilahi, M. Usama, J. Qadir, M. U. Janjua, A. Al-Fuqaha, D. T. Hoang, and D. Niyato (2020) Challenges and countermeasures for adversarial attacks on deep reinforcement learning. arXiv preprint arXiv:2001.09684. Cited by: §V.
  • [15] P. Kormushev, S. Calinon, and D. G. Caldwell (2013) Reinforcement learning in robotics: applications and real-world challenges. Robotics 2 (3), pp. 122–148. Cited by: §I.
  • [16] J. Kos and D. Song (2017) Delving into adversarial attacks on deep policies. arXiv preprint arXiv:1705.06452. Cited by: §I, §V.
  • [17] M. Kwiatkowska, G. Norman, and D. Parker (2010) Advances and challenges of probabilistic model checking. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1691–1698. Cited by: §II-C.
  • [18] A. L’heureux, K. Grolinger, H. F. Elyamany, and M. A. Capretz (2017) Machine learning with big data: challenges and approaches. IEEE Access 5, pp. 7776–7797. Cited by: §I.
  • [19] X. Li, Z. Serlin, G. Yang, and C. Belta (2019) A formal methods approach to interpretable reinforcement learning for robotic planning. Science Robotics 4 (37). Cited by: §V.
  • [20] Z. Li, S. Xue, W. Lin, and M. Tong (2018) Training a robust reinforcement learning controller for the uncertain system based on policy gradient method. Neurocomputing 316, pp. 313–321. Cited by: §I, §V.
  • [21] Y. Lin, M. Liu, M. Sun, and J. Huang (2017) Detecting adversarial attacks on neural network policies with visual foresight. arXiv preprint arXiv:1710.00814. Cited by: §V.
  • [22] D. J. Mankowitz, N. Levine, R. Jeong, Y. Shi, J. Kay, A. Abdolmaleki, J. T. Springenberg, T. Mann, T. Hester, and M. Riedmiller (2019) Robust reinforcement learning for continuous control with model misspecification. arXiv preprint arXiv:1906.07516. Cited by: §V, §V.
  • [23] O. Marom and B. Rosman (2018) Belief reward shaping in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §V.
  • [24] B. Marthi (2007) Automatic shaping and decomposition of reward functions. In Proceedings of the 24th International Conference on Machine learning, pp. 601–608. Cited by: §V.
  • [25] G. R. Mason, R. C. Calinescu, D. Kudenko, and A. Banks (2017) Assured reinforcement learning with formally verified abstract policies. In 9th International Conference on Agents and Artificial Intelligence (ICAART), Cited by: §V.
  • [26] J. Morimoto and K. Doya (2005) Robust reinforcement learning. Neural computation 17 (2), pp. 335–359. Cited by: §I, §V, §V.
  • [27] A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §II-B, §II-B, §III-C, §V.
  • [28] O. Ogunmolu, N. Gans, and T. Summers (2018) Minimax iterative dynamic game: application to nonlinear robot control tasks. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6919–6925. Cited by: §V.
  • [29] S. Pathak, L. Pulina, and A. Tacchella (2018) Verification and repair of control policies for safe reinforcement learning. Applied Intelligence 48 (4), pp. 886–908. Cited by: §V.
  • [30] A. Pattanaik, Z. Tang, S. Liu, G. Bommannan, and G. Chowdhary (2018) Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2040–2042. Cited by: §I, §V.
  • [31] M. Petrik and B. Scherrer (2009) Biasing approximate dynamic programming with a lower discount factor. In Advances in neural information processing systems, pp. 1265–1272. Cited by: 6th item.
  • [32] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta (2017) Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2817–2826. Cited by: §V, §V.
  • [33] D. L. Poole and A. K. Mackworth (2010) Artificial intelligence: foundations of computational agents. Cambridge University Press. Cited by: §II-A.
  • [34] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani (2017) Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017 (19), pp. 70–76. Cited by: §I.
  • [35] B. Sangiovanni, A. Rendiniello, G. P. Incremona, A. Ferrara, and M. Piastra (2018) Deep reinforcement learning for collision avoidance of robotic manipulators. In 2018 European Control Conference (ECC), pp. 2063–2068. Cited by: §I.
  • [36] A. Singla, S. Padakandla, and S. Bhatnagar (2019) Memory-based deep reinforcement learning for obstacle avoidance in uav with limited environment knowledge. IEEE Transactions on Intelligent Transportation Systems. Cited by: §I.
  • [37] F. Sun, Y. Chang, Y. Wu, and S. Lin (2018) Designing non-greedy reinforcement learning agents with diminishing reward shaping. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 297–302. Cited by: §V.
  • [38] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: Fig. 1, §II-A.
  • [39] S. Theie Havenstrøm, A. Rasheed, and O. San (2020) Deep reinforcement learning controller for 3d path-following and collision avoidance by autonomous underwater vehicles. arXiv e-prints, pp. arXiv–2006. Cited by: §I.
  • [40] J. Wang, Y. Liu, and B. Li (2020) Reinforcement learning with perturbed rewards. In AAAI, pp. 6202–6209. Cited by: §V.
  • [41] Y. Wu and S. Lin (2018) A low-cost ethics shaping approach for designing reinforcement learning agents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §V.
  • [42] H. Zou, T. Ren, D. Yan, H. Su, and J. Zhu (2019) Reward shaping via meta-learning. arXiv preprint arXiv:1901.09330. Cited by: §V.