Reconnaissance and Planning algorithm for constrained MDP

09/20/2019 ∙ by Shin-ichi Maeda, et al. ∙ Preferred Infrastructure The University of Tokyo 5

Practical reinforcement learning problems are often formulated as constrained Markov decision process (CMDP) problems, in which the agent has to maximize the expected return while satisfying a set of prescribed safety constraints. In this study, we propose a novel simulator-based method to approximately solve a CMDP problem without making any compromise on the safety constraints. We achieve this by decomposing the CMDP into a pair of MDPs; reconnaissance MDP and planning MDP. The purpose of reconnaissance MDP is to evaluate the set of actions that are safe, and the purpose of planning MDP is to maximize the return while using the actions authorized by reconnaissance MDP. RMDP can define a set of safe policies for any given set of safety constraint, and this set of safe policies can be used to solve another CMDP problem with different reward. Our method is not only computationally less demanding than the previous simulator-based approaches to CMDP, but also capable of finding a competitive reward-seeking policy in a high dimensional environment, including those involving multiple moving obstacles.



There are no comments yet.


page 3

page 4

page 5

page 6

page 11

page 12

page 13

page 14

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With recent advances in reinforcement learning (RL), it is becoming possible to learn complex reward-maximizing policy in an increasingly more complex environment Mnih et al. (2015); Silver et al. (2016); Andrychowicz et al. (2018); James et al. (2018); Kalashnikov et al. (2018). However, not all policies found by standard RL methods are physically safe in real-world applications, and a naive application of RL can lead to catastrophic results. This has long been one of the greatest challenges in the application of reinforcement learning to mission-critical systems. In a popular setup, one assumes a Markovian system together with a predefined set of dangerous states that must be avoided, and formulates the problem as a type of constrained Markov decision process (CMDP) problem. That is, based on the classical RL notations in which represents a policy of the agent, we aim to solve


where is a trajectory of state-action pairs, is the total return that can be obtained by , and is the measure of how dangerous the trajectory is. Most methods of reinforcement learning solves the optimization problem about by a sequence of iterative updates. The difficulty of CMDP problem lies in the evaluation of the safeness of the suggested at every update. The evaluation of the safeness requires the evaluation of integrals with respect to future possibilities, whose cardinality increases exponentially with the length of the future and the number of randomly moving objects in the environment.

Lagrange multiplier-based methods Altman (1999); Geibel and Wysotzki (2005) tackle this problem by aiming to satisfy the constraint softly, and provide the guarantee that the obtained solution is safe if optimal lambda is chosen. Trust region optimization (TRO) Achiam et al. (2017a); Chow et al. (2019), and the methods based on Lyapunov function Chow et al. (2018, 2019) take the approach of constructing at each update step a pool of policies that are most likely safe. The precise construction of safe pool and the finding of optimal hyper-parameter, however, are computationally heavy tasks in high-dimensional state spaces, and a strong regularity assumption about the system becomes necessary in using these methods in practice.

Presence of a good simulator is particularly important for safe applications when the event to be avoided is a “rare" catastrophic accident, because an immense number of samples will be required to collect information about the cause of the accident.

Model Predictive Control (MPC) is perhaps the oldest family of simulator-based methods Falcone et al. (2007); Wang and Boyd (2010); Di Cairano et al. (2013); Weiskircher et al. (2017) for carrying out tasks under safe constraints. Model Predictive Control uses the philosophy of receding horizon and predicts the future outcome of actions in order to determine what action the agent should take in the next step. If the future-horizon to consider is sufficiently short and the dynamics is deterministic, the prediction can often be approximated well by linear dynamics, which can be evaluated instantly. Because MPC must finish its assessment of the future before taking an action, its performance is limited by the speed of the predictions. If only a short horizon is taken into account, MPC may suggest a move to a state leading to a catastrophe.

In this study, we propose a novel simulator-based approach that looks for a solution of a CMDP problem by decomposing the CMDP into a pair of MDPs: a reconnaissance MDP (R-MDP) and planning MDP (P-MDP). The purpose of R-MDP is to (1) recon the state space, (2) evaluate the threat function that measures the potential danger at each state, and (3) construct a pool of policies that are safe in the sense of satisfying a user-specified constraint. After solving the R-MDP problem, we solve the P-MDP problem consisting of the original MDP while restricting our policy-search to the R-MDP specified pool of safe policies. If we can find one safe policy, we can use the threat function construct non-empty set of policies that are guaranteed to be safe.

The threat function we compute in R-MDP is mathematically close to the Lyapunov function considered previously by Chow et al. Chow et al. (2018, 2019). However, unlike these prior works, we do not have to evaluate the safety of a given policy more than once. Because our method is computationally light, it can be used to solve CMDP problems in high-dimensional spaces with relative ease. Fig. 1 illustrates the routes on the circuit taken by agents trained with various methods of CMDP and the locations of accidents made by the agents. The agent trained with our algorithm is finding a safe and efficient route.

Figure 1: The trajectories produced by the the policy trained by our proposed method ((a) and (d)), 4-step MPC ((b), (e)), the policy trained with penalized DQN ((c) and (f)). The trajectories on circular circuit were produced by the policies trained on the original circuit. represents the initial position of the agent. The red marks represents the places at which the agent crashed into the wall. Our method can maneuver through the environment without any accident. The policy of DQN cannot adapt to new environment because it is not aware of the new location of the wall in the new circuit. Meanwhile, the policy found by our method can skirt the danger in new environment because it is finding a policy from a set of policies defined by a reward independent threat function constructed just for the sake of safety. MPC can also finish a lap without any accidents. MPC, however, takes an order of magnitude more computation time than our method (See Fig. 1).

The advantages of our approach are multifold. Because the threat function alone specifies the pool of safe policies in our framework, we can re-use the pool specified by the R-MDP constructed for one CMDP problem to solve another CMDP problem with a different reward function and the same safety constraint. Our formulation of the threat function can also be used to solve a MDP problem with a constraint on the probability of catastrophic failure. By applying a basic rule of probability measure to a set of threat functions, we can solve a CMDP problem with multiple safety constraints as well. This allows us to find a good reward-seeking policy for a sophisticated task like

safely navigating through a crowd of randomly moving objects. Although our method does not guarantee to find the optimal solution of the CMDP problem, our method prioritizes safety and is still able to find a safe policy that is competitive in terms of reward-seeking ability. To the best of our knowledge, there has not been any study to date that has succeeded in solving a CMDP in dynamical environments as high-dimensional as the ones discussed in this study.

The work that is algorithmically closest to our approach is Bouton et al. (2018), which computes for each state the state-dependent set of actions that guarantees the safety when the constraint can be written as Linear Temporal Logic. This paper provides similar approach for the form of risk that is more commonly used in the field of safe-reinforcement learning. The paper also provides more solid theoretical justification to the safety-guarantee. The following list summarizes the advantages of our new framework.

  • The R-MDP problem of identifying the set of safe policies needs to be solved only once. Moreover, one does not necessarily need to obtain the absolute optimal solution for the R-MDP problem in order to find a good reward seeking safe-policy from the ensuing C-MDP problem.

  • The policy proposed by our method is almost always safe. If we can find a safe policy from the RMDP problem, we can always guarantee the safety.

  • The threat function evaluated by the R-MDP can be re-used for another CMDP problem with safety constraints on the same quantities.

  • The P-MDP can be solved with or without access to a simulator.

2 Method

2.1 Problem formulation and setup

We begin this section with the notations and assumptions that we are going to use throughout the paper. We assume that the system in consideration is a discrete-time constrained Markov Decision Process with finite horizon, defined by a tuple , where is the set of states, is the set of actions, is the density of the state transition probability from to when the action is , is the reward obtained by action at state , is the non-negative danger of taking action at state , and is the distribution of the initial state. We use to denote the policy ’s probability of taking an action at a state . Also, for ease of notation, we use to denote , and to denote . Likewise, we will use and to denote and respectively. Finally, for an arbitrary set , we will use to denote its complement.

Next, we present the optimization problem (1) in more formality. The ultimate goal of CMDP(Constrained Markoc Decision Process Problem) is to find the policy that solves


where , and denotes the expectation with respect , and . Unless otherwise denoted, we will use to refer to the integration with respect to both and . In our formulation, we use the following threat function as a danger-analogue of the action-value function. We define the threat function for a policy at by


Informally, we can think of as the aggregated measure of threat that the agent with policy must face after taking the action in the state at time . We may say that a policy is safe if . To reiterate, out strategy is to (1) evaluate the threat function for a baseline policy, (2) construct a pool of safe policy using the threat function, and (3) to look for a reward-seeking policy in the pool of the safe policies. Before we proceed further, we describe several key definitions and lemmas that stem from the definition of threat function.

2.2 Properties of threat functions and secure policies

For now, let us consider a time-dependent safety threshold defined at each time , and let be a baseline policy. Then the set of -secure actions the set of actions that are deemed safe by for for risk threshold in the sense that agent’s safety is guaranteed if it follows afterward.

Definition 1 (-secure actions).

where , and is a non-negative time-dependent constant.

Let . Then is a set of actions that very much represents the agent’s freedom in seeking reward when following the policy . But indeed, this set of actions is not always non-empty. Let us define -secure states to be the set of states for which there is non-empty -secure actions. Over such set of states, we want the agent to take actions that are safer than . Let

We are going to use the following set of policies as the first candidate of pool from which to look for a reward-seeking safe policy:


We will refer this policy as the set of -secure policies. Intuitively, this set shall increase as becomes safer. While this intuition unfornately does not hold in general, it holds for its lower bound subset:


That is, whenever for all .

We are still not yet done. Up until here, we have been defining the set of actions based on -defined measure of safety. As we will be using a policy other than to maximize the reward, we must take into account the risk that will be incurred in taking an action from a policy other than the one used for determining its risk:

Theorem 1.

Let be the total variation distance111Total variation distance is defined as . between two distributions and . For a given policy , let be a policy such that . If , then


where .

For the proof, see the appendix. The bound (7) in its raw form is not too useful because the RHS depends on and the threat of is bounded implicitly. However, if appeal to the trivial upperbound for the total variation distance and set , we can achieve . The summation term in parenthesis is the very penalty that the agent must pay in taking action other than , the safety-evaluating policy. We can thus guarantee by just setting to a value smaller than :

Corollary 2.

Let . Then is safe.

Thus, from any baseline policy satisfying , we can construct a pool of absolutely safe policies whose membership condition is based explicitly on and alone. That the threshold expression in Eq. (2) is free of is what allows us to decompose the CMDP problem into two separate MDP problems. We can seek a solution to the CMDP problem by (1) looking for an satisfying , and (2) looking for the reward maximizing policy in . We address the first problem by R-MDP, and the second problem by P-MDP.

Now, several remarks are in order. First, if we take the limit of , the in the above statement will approach , and this just gives us the requirement that itself must be safe in order for to serve as a pool of safe policies. Next, if we set , then as

. This is in agreement with the law of large numbers; that is, any accident with positive probability is bound to happen at some point. Also, recall that we have

whenever . for any . Thus, by finding the risk-minimizing , we can maximize the pool of safe policies. Whenever we can, we shall therefore look not just for a baseline policy that satisfies , but also for the threat minimizing policy. Lastly, if is -safe, then is guaranteed so that the is not empty. Unless otherwise denoted. we will use to denote , and use to denote .

3 Reconnaissance-MDP (R-MDP) and Planning-MDP (P-MDP)

As stated in the previous section, we can obtain a set of safe policy from any baseline policy satisfying , and that we can maximize this set by constructing the set of safe policy from the threat minimizing . The purpose of R-MDP is thus to recon the system prior to the reward maximization and look for the policy with minimal threat (maximal ). As a process, R-MDP is same as the original MDP except that we have a danger function instead of a reward function, and the goal of the agent in the system is to find the minimizer of the risk: . This is, indeed, more than what we need when the safety is our only concern. So long that , the pool of policies is guaranteed safe.

The purpose of Planning-MDP (P-MDP) is to search within for a good reward-seeking policy.

P-MDP is the same as original MDP, except that action set is state and time dependent; that is, the agent is allowed to take action only from whenever , and take the deterministic action whenever .

The purpose of P-MDP is to find the policy


The following algorithm will find a safe good policy if

1:Obtain defined in Eq.(3) for any

or prepare a heuristically selected

2:Evaluate and Construct
3:Obtain using either model-free or model-based RL.
Algorithm 1 RP-algorithm

3.1 Variants of Reconnaissance and Planning Algorithm

In what follows, we describe important variants of the RP-algorithm that are useful in practice.

3.1.1 Constraint on the probability of fatal accident

So far, we have considered the constraints of the form . If the danger to be avoided is so catastrophic that one accident alone is enough to ruin the project, one might want to directly constrain the probability of an accident. Our RP-Algorithm can be used to find a safe solution for a CMDP with this type of constraint as well. Let us use to represent the binary indicator that takes the value 1 only if the agent encounters the accident upon taking the action at the state . Using this notation, we can write our constraint as , and our threat function for this case can be recursively defined as follows:


Notice that this is a variant of the Bellman relation for the original (3) in which is replaced with . With straight-forward computations, it can be verified that theorem 2.2 follows if we replace in the statement with the maximum possible value of . We can replace with its upperbound () as well. That is, we may construct a set of safe policies by setting . With this strict constraint, however, approaches 0 as approaches infinity. This is in agreement with the law of large numbers; any accident with non-zero probability will happen almost surely if we wait for infinite length of time.

3.1.2 Constraint on the probability of multiple fatal accidents

Many application of CMDP involves multiple fatal events. For example, during the navigation of highway with heavy traffic, the driver must be wary of the movements of multiple other cars. Industrial robots in hazardous environment might also have to avoid numerous obstacles.

Our setup in the previous subsection can be used to find a solution for this type of problem. Let us consider a model in which the full state of the system is given by , where the state of the -th obstacle is and the aggregate state of all other objects in the system is (i.e. the location of the agent, etc). Let be the probability of collision in the subsystem containing only the agent and the -th obstacle. Under this set up, we can appeal to a basic property of probability measure regarding a union of events to obtain the following interesting result:

Theorem 3.

Let us assume that agent can take action based solely on , and that are all conditionally independent of each other given . Then


We can then follow the procedure described in Section 2.2 with this new threat function to construct the set of secure policies in Corollary 2.

Theorem 3 is closely related to the risk potential approaches Wolf and Burdick (2008); Rasekhipour et al. (2016); Ji et al. (2016).These methods also work by evaluating the risk of collision with each obstacle for each location in the environment and by superimposing the results. However, most of them define the risk potential heuristically.

4 Experiment

We conducted a series of experiments to find answers to the following set of questions:

  1. What does the threat function obtained from our method look like?

  2. How effective is the safe policy obtained from our method when suboptimal policy was used for the baseline policy ?

  3. How well does the policy trained by our RP-method perform in new environments?

We compared our algorithm’s results on these experiments against those of other methods, including (1) classical MPC, (2) DQN with Lagrange penalty, and (3) Constrained Policy Optimization (CPO) Achiam et al. (2017a). At every step, the version of MPC we implemented in our study selects the best reward-seeking action among the set of actions that were deemed safe by the lookahead search. DQN with Lagrange penalty is a version of DQN for which the reward is penalized by the risk function with Lagrange weight. We tested this method with three choices of Lagrange weights. As for CPO, we used the implementation available on Github Achiam et al. (2017b).

For our method, we used a neural network to approximate the threat function in the R-MDP for a heuristically-chosen baseline policy that is not necessarily safe, and solved the P-MDP with DQN. When solving P-MDP with DQN, it becomes cumbersome to compute

on states outside . We therefore constructed an MDP defined on that is equivalent to the original MDP for the policies in . Namely, we constructed the tuple , whose components are defined as follows. The function is the restriction of the reward to for all . is a transition probability function derived from the original state transition probability such that, for all , where is the set of all trajectories from to that (1) take a detour to at least once after taking the action at , (2) take the action for all , and (3) lead to without visiting any other states in .

4.1 Tasks

We conducted experiments on three tasks on 2-D fields (see Fig. 2 for visualizations).

Point Gather

In this task, the agent’s goal is to collect as many green apples as possible while avoiding red bombs.


The agent’s goal is to complete one lap around the circuit without crashing into a wall. The agent can control its movement by regulating its acceleration and steering. Lidar sensors are used to compute the distance to obstacles.


The agent’s goal is to navigate its way out of a room from the exit located at the top right corner as quickly as possible without bumping into 8 randomly moving objects.

Figure 2: Panels (a) and (b): the fields for Circuit task and Jam task. For Jam, the light blue circles are obstacles, and the yellow circle is the agent. The arrow attached to each object shows its direction of movement. Panels (c) and (d): the heat maps of the trained threat function in the neighborhood of the agent. The shape of the heat map changes with the speed and the direction of the object’s movement. (e),(f),(g) are the heat maps of the upper bound of the threat function (Theorem 3), computed for different velocity settings of the agent. The assumed movement of the agent is indicated at the left bottom corner of each map. These maps can be interpreted as a heat map of risk potential.

4.2 Learning the threat function

The state spaces of Jam and Circuit are high-dimensional, because there are multiple obstacles in the environment that must be avoided. We therefore used the method described in Şection 3 to construct an upper bound for the true threat function by considering a set of separate R-MDPs in which there is only one obstacle. We also treated wall as a set of immobile obstacles so that we can construct the threat function for the circuit of any shape. For more detail, see the appendix.

Fig. 2 is the heat map for the upper bound of the threat function computed in the way of Theorem 3. Note that the threat map changes with the state of the agent. We see that our threat function is playing a role similar to the risk potential function. Because our threat function is computed using all aspects of the agent’s state (acceleration, velocity, location), we can provide more comprehensive measure of risk in high dimensional environments compared to other risk metrics used in applications, such as TTC (Time To Collision) Lee (1976) used in smart automobiles that considers only 1D movement.

4.3 Learning performance

Fig. 3

plots the average reward and the crash rate of the policy against the training iteration for various methods. The curve plotted for our method (RP) corresponds to the result obtained from training on the P-MDP. The average and the standard deviation at each point was computed over 10 seeds. As we can see in the figure, our method achieves the highest reward at almost all phases during the training for both Jam and Circuit, while maintaining the lowest crash rate. In particular, our method performs significantly better than other methods both in terms of safety and average reward for Jam, the most challenging environment. The RP-trained policy can safely navigate its way out of the dynamically-changing environment consistently even when the number of randomly moving obstacles is different than in the R-MDP used to construct the secure set of policies.

Penalized DQN performs better than our method in terms of reward for the Point-Gather, but at the cost of suffering a very high crash rate (). Our method is also safer than the 3-step MPC for both Jam and Circuit as well, a method with significantly higher computational cost.

Figure 3: Comparison of multiple CMDP methods in terms of rewards and crash rate. For both Circuit and Jam, our method (P-DMP) achieves the highest average reward and lowest crash rate throughout the training process. DQN performs better in terms of reward for Point-Gather, but at the cost of a very high crash rate.

4.4 Robustness of the learned policy to the change of environment

We conducted two sets of experiments of applying a policy learned on one environment to the tasks on another environment. For the first set of experiments, we trained a safe policy for the circuit task, and evaluated its performance on the circuit environments that are different from the original circuit used in the training of the policy: (1) narrowed circuit with original shape, and (2) differently shaped circuit with same width. For the second set of experiments, we trained a safe policy for the JAM task, and tested its performance on other JAM tasks with different numbers of randomly moving obstacles.

Fig. 1 and Fig. 2 shows the results. For the modified Jam, we have no results for MPC with more than 3-step prediction since the search cannot be completed within reasonable time-frame. The 4-step MPC requires 36.5secs per episode (200 steps) for Circuit, and the 3-step MPC requires 285secs per episode (100 steps) for the original Jam. We find that, even in varying environments, the policy obtained by our method can guarantee safety with high probability while seeking high reward.

Environment RP MPC 4step DQN =0 DQN =200
Training env. 1439 (0) 1055 (0.35) 1432 (0.05) 933 (0.4)
Narrowed env. 377 (0) 959 (0.55) -151 (1.0) -145 (0.99)
Circle 130 (0) 351 (0) -189 (1.0) -171 (1.0)
Computation Time (s) 1.0 36.5 0.9 0.9
Table 1: Performance of trained policies on unknown Circuit environments. The values in the table are the obtained rewards, with the probabilities of crashing within parentheses. The agent was penalized 200 pts for each collision, and rewarded for the geodesic distance traveling along the course in right direction. For details concerning the reward settings, please see the appendix.
Environment RP MPC 2step MPC 3step DQN =5 DQN =500
3 obstacles 78.2 (0) 47.45 (0.33) 77.5 (0.05) 77.2 (0.04) 4.4 (0.17)
8 obstacles (training env.) 69.1 (0) 21.32 (0.59) 65.3 (0.2) 47.1 (0.38) -1.0 (0.24)
15 obstacles 33.0 (0.02) -2.5 (0.8) 36.6 (0.45) 16.5 (0.66) -16.8 (0.51)
Computation Time (s) 1.2 2.8 285 0.4 0.4
Table 2: Performance of trained policies on and unknown Jam environments

5 Conclusion

Our study is the first of its kind in providing a framework for solving CMDP problems that performs well in practice on high-dimensional dynamic environments like Jam. Although our method does not guarantee finding the optimal reward-seeking safe policy, empirically, it is able to find a policy that performs significantly better than classical methods both in terms of safety and rewards. Our treatment of the threat function helps us obtain a more sophisticated and comprehensive measure of danger at each state than conventional methods.Our bound on the threat function also seem to have close connections with previous Lyapunov-based methods as well. Overall, we find that utilizing threat functions is a promising approach to safe RL and further research on framework may lead to new CMDP methods applicable to complex, real-world environments.


We thank Wesley Chung for his valuable comments on the manuscript and proposal of a tighter bound.

6 Appendix

6.1 Proof of Theorem 1


This bound can be proved by the recursion derived from the Bellman equation. Define . Then has the following recurrence.


From this recursion (11), we can derive an inequality.


where . Here we replaced in the 2nd term by its maximum value on the support of when and replaced in the 3rd term by its minimum value, zero.

For convenience, let . Then the above inequality can be written as .

Since for the finite horizon MDP with length , we have the inequality below by combining the above two inequalities and repeating the recursion.


This means


Note this bound is tight in the sense that there exists some policy and environment that achieves this bound with equality. To keep small, we should make small by making enough close to . For example, - policy, i.e., take actions according to the policy with probability and take actions according to the uniform random policy with probability , which becomes small when is small. .

6.2 Proof of Theorem 3


Let us suppose that we can write , and let us denote the accident of th type by

. Then, by a basic property of probability distribution


Thus, if the transition probability is given by and if the policy being followed is , We have


Now, let us consider the problem of constraining the probability of an accident . Then, using the fact that , we see that we can bound the threat by


where is the probability of the collision with -th obstacle at state in the ’sub’system consisting of only agent and -th obstacle, and is the threat function for the R-MDP on such subsystem.

6.3 Environments


In this task, the agent’s goal is to complete one lap around the circuit without crashing into the wall. Each state in the system was set to be the tuple of (1) location, (2) velocity, and (3) the direction of the movement. The set of actions allowed for the agent was 0.15rad to left, 0.05rad to left, stay course, 0.05 rad to right, 0.15 rad to right 0.02 unit acceleration, no acceleration, 0.02 unit deceleration (15 options). At all time, the speed of the agent was truncated at . We rewarded the agent for the geodesic distance traveling along the course during each time interval, which accumulates to pts for one lap while we gave negative rewards for the stopping and collision during the time step, each of which are pts and pts, respectively. We set the length of the episode to 200 steps, which is the approximate number of steps required to make one lap.


In this task, the agent’s goal is to navigate its way out of a room from the exit located at the top left corner without bumping into 8 randomly moving objects. We set three circular safety zones centered at each corner except for the top left corner, i.e., exit. Any moving obstacle entered into the safety zone disappear. Without the safety zone, the task seems to be too difficult, i.e., there is a situation that the agent cannot avoid the collision even if the agent tried his best. We set the safety zone to ease the problem hoping the probability that the agent can solve the task when employing the optimal policy becomes reasonably high. The field was square and the radius of the safety zone located at three corners were set to 0.5. The radius of the agent and moving obstacles were set to 0.1. We rewarded the agent for its distance from the exit, and controlled its value so that the accumulated reward at the goal will be around . The agent was given points when it reaches the goal, was penalized points for stopping the advance, and was given pts penalty for each collision. Similar to the setting as in Circuit, the agent was allowed to change direction and acceleration of the movement simultaneously at each time point. The set of actions allowed for the agent was 0.30 rad to left, 0.10 rad to left, stay course, 0.10 rad to right, 0.30 rad to right 0.02 unit acceleration, no acceleration, 0.02 unit deceleration . At all time, the speed of the agent was truncated at . Each obstacle in the environment was allowed to take a random action from 0.15rad to left, 0.05rad to left, stay course, 0.05 rad to right, 0.15 rad to right 0.02 unit acceleration, no acceleration, 0.02 unit deceleration . The speed of the environment was truncated at . We set the length of each episode to steps.

Point Gather

In this task, the goal is to collect as many green apples as possible while avoiding red bombs. There are 2 apples and 10 bombs in the field. The agent was rewarded 10pts when the agent collected apple, and also reward The point mass agent receives 29-dimensional state and can take two-dimensional continuous actions. The state variable takes real values including, position and velocity, the direction and distance to the bomb, etc. The action variables determine the direction and the velocity of the agent. For the implementation of DQN, we discretized each action variable into 9 values. We used the exact same task setting as the one used in the original paper.

6.4 Model architecture and optimization details

6.4.1 RP-algorithm

We implemented our method on Chainer Tokui et al. (2015).

Learning of Threat function

For the Circuit and Jam tasks, the agent must avoid collisions with both moving obstacles and the wall. For these environments, it is computationally difficult to obtain the best . Thus, we computed a threat function for the collision with every object individually in the environment under an arbitrary baseline policy , and constructed the pool of approximately safe policies using the the upper bound of the threat function computed in the way we described in Section 3.1.2. We used that (1) decides the direction of the movement by staying course with probability , turning right with probability and turning left with probability , and (2) decides its speed by accelerating with probability and decelerating with probability . Then, we trained two threat functions each of which predict the collision with the immobile point and moving obstacle randomly put in the 2D region. Since

is fixed, the threat function can be obtained by supervised learning, in which the task is to predict the future collision when starting from a given current state and employing the policy

. The threat function for the collision with the immobile point is used to avoid collision with the wall, which can be considered as a set of immobile points. We shall emphasize that, with our method described in Section 3, the environment used in P-MDP is different from the environment used in R-MDPs, because each threat function in the summand of (3) is computed on the assumption that there is only one obstacle, either immobile point or moving obstacle in the environment.

We used neural network with three fully connected layers (100-50-15) and four fully connected layers (500-500-500-15) for the threat function of the immobile point and moving obstacle, respectively. For the training dataset, we sampled 100,000 initial state and simulated 10,000 paths of length from each initial state. We trained the network with Adam. The parameter settings for the training of the threat function of immobile point and of mobile obstacle are (

, batchsize = 512, number of epochs = 20), and (

, batchsize = 512, number of epochs = 25), respectively.

For the point gather task, we again used the upper bound approximation explained in Section 3.1.2 for the threat function. The threat function is estimated by using a two-layer fully connected neural network.

Solving P-MDP

For the Planning MDP, we used DQN. For DQN, we used convolutional neural network with one convolutional layer and three fully connected layers, and we trained the network with Adam (

). We linearly decayed the learning rate from to over 3M iterations.

6.4.2 DQN with Lagrange coefficient

For DQN, we used the identical DQN as the one used for P-MDP. For Lagrange coefficient, we tested with three Lagrange coefficients for each task, for Circuit, for Jam, for Point Gather, respectively. For the Jam task, the initial Lagrange coefficients are all set to 5 and gradually incremented to the final values as done in Miyashita et al. (2018). This heuristic pushes the agent to learn the goal-oriented policy first, and then learn the collision avoidance.

6.4.3 Constrained Policy Optimization

As for CPO, we used the same implementation publicized on Github Achiam et al. (2017b), i.e., the policy is a Gaussian policy implemented by a two layer MLP with hidden units 64 and 32 for all tasks.

6.4.4 Model Predictive Control

We tested MPC with receding horizon , and at every step selected the action with the highest reward among those that were deemed safe by the prediction.

6.5 Related works

CPO Achiam et al. (2017a) is a method that gradually improves the safe policy by making a local search for a better safe policy in the neighborhood of the current safe policy. By nature, at each update, CPO has to determine what members of the neighborhood of current safe policy satisfy the safety constraint. In implementation, this is done by the evaluation of Lagrange coefficients. Accurate selection of safe policy in the neighborhood is especially difficult when the danger to be avoided is "rare" and "catastrophic", we would need massive number of samples to verify whether a given policy is safe or not. Moreover, because each update is incremental, they have to repeat this process multiple times (usually, several dozens of times for Point Circle, and several thousands of times for Ant Gather and Humanoid Circle). Lyapunov based approach Chow et al. (2018, 2019)

are also similar in nature. At each step of the algorithm, Lyapunov based approach construct a set of safe policy from a neighborhood of the baseline policy, and computes from its safety-margin function—or the state-dependent measure of how bold an action it can take while remaining safe—to specify the neighborhood from which to look for better policy. For the accurate computation of the margin, one must use the transition probability and solve the linear programming problem over the space with dimension that equals the number of states. The subset of the neighborhood computed from approximate margin may contain unsafe policy. Model checking is another approach to guarantee the safety. Once the constraints are represented in the form of temporal logic constrains or computation-tree logic

Baier et al. (2003); Wen et al. (2015); Bouton et al. (2019), we could ensure the safety by using model checking systems. However, it is sometimes difficult to express the constraints such a structured form. Also even when we represents the constraints in the structured form, we again encounters computation issues; when the state-action space becomes large, the computation required for the model checking system becomes prohibitively heavy due to the increase of the candidates of the solutions.


  • J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017a) Constrained policy optimization. arXiv preprint arXiv:1705.10528. Cited by: §1, §4, §6.5.
  • J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017b) Constrained policy optimization. Note: Cited by: §4, §6.4.3.
  • E. Altman (1999) Constrained markov decision processes. Book, Vol. 7, CRC Press. External Links: ISBN 0849303826 Cited by: §1.
  • d. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, and A. Ray (2018) Learning in-hand manipulation. arXiv preprint arXiv:1808.00177. Cited by: §1.
  • C. Baier, B. Haverkort, H. Hermanns, and J. Katoen (2003)

    Model-checking algorithms for continuous-time markov chains

    IEEE Transactions on software engineering 29 (6), pp. 524–541. External Links: ISSN 0098-5589 Cited by: §6.5.
  • M. Bouton, J. Karlsson, A. Nakhaei, K. Fujimura, M. J. Kochenderfer, and J. Tumova (2018) Reinforcement learning with probabilistic guarantees for autonomous driving. In

    Workshop on Safety Risk and Uncertainty in Reinforcement Learning, Conference on Uncertainty in Artificial Intelligence (UAI)

    Cited by: §1.
  • M. Bouton, J. Karlsson, A. Nakhaei, K. Fujimura, M. J. Kochenderfer, and J. Tumova (2019) Reinforcement learning with probabilistic guarantees for autonomous driving. arXiv preprint arXiv:1904.07189. Cited by: §6.5.
  • Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh (2018) A lyapunov-based approach to safe reinforcement learning. Conference Proceedings In Neural Information Processing Systems 2018, Cited by: §1, §1, §6.5.
  • Y. Chow, O. Nachum, A. Faust, M. Ghavamzadeh, and E. Duenez-Guzman (2019) Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031. Cited by: §1, §1, §6.5.
  • S. Di Cairano, D. Bernardini, A. Bemporad, and I. V. Kolmanovsky (2013) Stochastic mpc with learning for driver-predictive vehicle control and its application to hev energy management. IEEE Transactions on Control Systems Technology 22 (3), pp. 1018–1031. External Links: ISSN 1063-6536 Cited by: §1.
  • P. Falcone, F. Borrelli, J. Asgari, H. E. Tseng, and D. Hrovat (2007) A model predictive control approach for combined braking and steering in autonomous vehicles. Conference Proceedings In Mediterranean Conference on Control & Automation, pp. 1–6. External Links: ISBN 1424412811 Cited by: §1.
  • P. Geibel and F. Wysotzki (2005) Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research 24, pp. 81–108. Cited by: §1.
  • S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis (2018) Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. arXiv preprint arXiv:1812.07252. Cited by: §1.
  • J. Ji, A. Khajepour, W. W. Melek, and Y. Huang (2016) Path planning and tracking for vehicle collision avoidance based on model predictive control with multiconstraints. IEEE Transactions on Vehicular Technology 66 (2), pp. 952–964. External Links: ISSN 0018-9545 Cited by: §3.1.2.
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, and V. Vanhoucke (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §1.
  • D. N. Lee (1976) A theory of visual control of braking based on information about time-to-collision. Perception 5 (4), pp. 437–459. External Links: Document Cited by: §4.2.
  • M. Miyashita, S. Maruyama, Y. Fujita, M. Kusumoto, T. Pfeiffer, E. Matsumoto, R. Okuta, and D. Okanohara (2018) Toward onboard control system for mobile robots via deep reinforcement learning. Conference Proceedings In

    Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS)

    Cited by: §6.4.2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: ISSN 0028-0836 Cited by: §1.
  • Y. Rasekhipour, A. Khajepour, S. Chen, and B. Litkouhi (2016) A potential field-based model predictive path-planning controller for autonomous road vehicles. IEEE Transactions on Intelligent Transportation Systems 18 (5), pp. 1255–1267. External Links: ISSN 1524-9050 Cited by: §3.1.2.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. External Links: ISSN 0028-0836 Cited by: §1.
  • S. Tokui, K. Oono, S. Hido, and J. Clayton (2015)

    Chainer: a next-generation open source framework for deep learning

    Conference Proceedings In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), Vol. 5, pp. 1–6. Cited by: §6.4.1.
  • Y. Wang and S. Boyd (2010) Fast model predictive control using online optimization. IEEE Transactions on control systems technology 18 (2), pp. 267–278. External Links: ISSN 1063-6536 Cited by: §1.
  • T. Weiskircher, Q. Wang, and B. Ayalew (2017) Predictive guidance and control framework for (semi-) autonomous vehicles in public traffic. IEEE Transactions on control systems technology 25 (6), pp. 2034–2046. External Links: ISSN 1063-6536 Cited by: §1.
  • M. Wen, R. Ehlers, and U. Topcu (2015) Correct-by-synthesis reinforcement learning with temporal logic constraints. Conference Proceedings In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4983–4990. External Links: ISBN 1479999946 Cited by: §6.5.
  • M. T. Wolf and J. W. Burdick (2008) Artificial potential functions for highway driving with collision avoidance. Conference Proceedings In 2008 IEEE International Conference on Robotics and Automation, pp. 3731–3736. External Links: ISBN 1424416469 Cited by: §3.1.2.