Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization

03/03/2020 ∙ by Lu Wen, et al. ∙ University of Michigan 0

Reinforcement learning (RL) is attracting increasing interests in autonomous driving due to its potential to solve complex classification and control problems. However, existing RL algorithms are rarely applied to real vehicles for two predominant problems: behaviours are unexplainable, and they cannot guarantee safety under new scenarios. This paper presents a safe RL algorithm, called Parallel Constrained Policy Optimization (PCPO), for two autonomous driving tasks. PCPO extends today's common actor-critic architecture to a three-component learning framework, in which three neural networks are used to approximate the policy function, value function and a newly added risk function, respectively. Meanwhile, a trust region constraint is added to allow large update steps without breaking the monotonic improvement condition. To ensure the feasibility of safety constrained problems, synchronized parallel learners are employed to explore different state spaces, which accelerates learning and policy-update. The simulations of two scenarios for autonomous vehicles confirm we can ensure safety while achieving fast learning.



There are no comments yet.


page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomous driving has the potential to improve safety and accessibility of ground vehicles. The approaches to design the control policy for autonomous vehicle generally fall into two categories: (1) rule-based methods and (2) learning-based methods. Because real-world driving can a lot of complexities, rule-based methods (e.g., the finite state mechanism) may not be able to handle all scenarios, and meanwhile pose significant burden on engineers to cover all possibilities [4]. The learning-based methods can imitate and learn from drivers’ manipulation implicitly. In this study, we develop a learning-based method while making an important improvement.

Reinforcement learning (RL) has been actively studied for autonomous driving in recent years. The goal of RL is to find policies to maximize the accumulated reward without reliance on labeled human driving data[13]. Karavolos et al. (2013) first applied the vanilla Q-learning algorithm to efficiently train a driver in a racing game on the simulator TORCS[10]. Silver et al

. (2015) proposed a deep deterministic policy gradient (DDPG) algorithm by introducing deep learning into DPG algorithm

[19], which effectively solves problems in a continuous action space[14]. Wang et al (2018) successfully applied DDPG to an end-to-end policy learning for autonomous driving on TORCS[22]. Pan et al. (2017) built a new framework on the basis of A3C to train a self-driving vehicle by interacting with a synthesized real environment[16]. Recently, Zhang et al. (2019) employed RL with model-based exploration for autonomous lane change decision-making on highways [23]. Duan et al. (2020) employed RL with a hierarchical architecture for autonomous decision-making on highways [5][3].

To date, most RL methods have been developed on simulation platforms, with little work on real vehicles. A main reason is that the policy cannot be guaranteed to be safe, and the back-propagation-driven process may lead to unforeseen accidents. Safety is the most basic requirement for autonomous vehicles, so a training process only look at reward, and not potential risk, is not acceptable. The notion of safe RL is defined in [7] as the process of learning policies that maximizes the expectation of accumulated return, while respecting security constraints in the learning and deployment process. More specifically, safe RL could be divided into being strictly safe and approximately safe. The algorithm developed in this paper is an approximately safe one. Approaches to solve this problem are categorized into two methods: (1) modifying the optimization criterion and (2) modifying the exploration process.

The method to modify the optimization criterion is to incorporate risk into the optimization objective, while the risk-neutral control neglects the variance in the probability distribution of rewards. We categorize these optimization criteria into four groups: maximin, risk-sensitive, constrained, and others. The maximin criterion considers a policy to be optimal if it has the maximum worst-case return. Gaskett (2003) considered the inherent uncertainty related to stochastic nature of the system, by proposing a new extension

-pessimistic term to Q-learning. Nilim and Ghaoui (2013) considered the uncertainty related to some of the parameters of the Markov decision process

[15]. Risk-sensitive criterion includes the notion of risk and return variance in the long term reward maximization objective. Geibel and Wysotzki (2005) transformed the optimization criterion into the probability of entering an error state[9]. The constrained criterion ensures that the expectation of return is subject to one or more constraints. Castro et al. (2012) used a constrained criterion in which the variance of the return must not exceed a given threshold. More optimization criteria were explored to enforce safety as well[21]. Mohammed et al. (2018) proposed a preemptive-shielding system, acting each time the learning agent is about to make a decision and providing a list of safe actions[2]. But these methods have common drawbacks of turning overly pessimistic or computational intractability.

Modification of the exploration process can be categorized into two approaches: (1) incorporating external knowledge and (2) risk directed exploration. Incorporating external knowledge can provide initial knowledge to the agent, but it is not sufficient to prevent dangerous situations in later exploration. Siebel and Sommer (2007) used external knowledge as a form of population seeding in neuroevolution approaches[18]. Mohammed et al. (2018) introduced a new system named post-posed shielding. The shield monitors the agent’s action and corrects them if the chosen action causes a violation. Risk directed exploration encourages the agent to explore controllable regions of environment by introducing risk metric as an exploration bonus. Garcia et al. (2012) successfully applied this approach to the helicopter hovering control in the RL Competition[6]. Gehring and Precup (2013) defined a risk metric based on controllability[8]. These methods have limited performance and are not always reliable because of their inability to detect risky situations for both early steps and long term.

A main contribution of this paper is to propose a safe RL algorithm, called Parallel Constrained Policy Optimization (PCPO), for autonomous driving policy. PCPO can ensure the policy is safe in the learning process and improve the convergence speed. PCPO considers a risk function and bounds the expected risk within predefined hard constraints. Meanwhile, a trust region constraint is added to allow large update step without breaking the monotonic improvement condition. The policy, value function and newly defined risk function are all approximated by neural networks. Secondly, synchronized parallel learners are employed to explore different state sub-space of system to reduce the correlation of sample sets, which increase the possibility of finding feasible states and accelerating the convergence speed.

The rest of this paper is organized as follows: Section II introduces the PCPO algorithm, including the safety constraints, the trust region and the parallel learning framework. Section III presents two simulation studies for autonomous driving tasks: lane-keeping, and intersection crossing. Finally, we provide concluding remarks in section IV.

Ii Methodology

Ii-a Preliminaries

In this work, we formalize the RL problems into a Markov Decision Process (MDP) [20]. An infinite-horizon discounted MDP is defined by the tuple , where is the finite set of states, is the finite set of actions, is the reward function, is the transition probability distribution, is the distribution of the initial state , and is the discount factor.

For each state at time , the expected accumulated return is defined as . The action is chosen according to a stochastic policy , with denoting the probability of choosing action in state . The value function is , where , for , and the state-action value function is denoted as .

RL aims to get the policy which maximizes the accumulated return in infinite horizon. Let denote the objective function of policy update, it follows that , where is a sequence of action-state: . Obviously we can express with and . Firstly, let’s define the advantage function as

So the advantage function and the objective function satisfy

where represents the old policy. It’s sensible to use the state distribution corresponding to the old policy in replacement of that corresponding to policy

. In a large continuous state space, we can generally construct an estimator of the surrogate objective using importance sampling:

The goal of RL is to find the optimal that maximizes . Since and are independent of , the policy optimization process can then be formulated as:

For conciseness, we define the following surrogate objective function:

Ii-B Actor-Critic

RL methods usually employ an actor-critic (AC) architecture to approximate both the policy and value function by iteratively solving the Bellman optimality equation based on generalized policy iteration framework. The AC architecture consists of two structures: the policy network (the so-called actor) and value approximation (the so-called critic)[11]

. In this study, both the actor and critic are approximated by neural networks (NN), which directly map state to the probability distribution of action and expected cumulative return respectively. In this study, we adopt a stochastic policy, the output of which is the mean and standard deviation of the Gaussian distribution. We represent the policy network

with parameters , and the state-action value network with parameters .

In previous AC methods, the parameters of the value network

are tuned by iteratively minimizing the following loss function

where is usually called the temporal-difference (TD) error. The expected accumulated reward is usually estimated in the forward view using the n-step return:


Then the specific gradient update for the parameters of the value network is

The parameters of policy network are updated to maximize the surrogate objective function . Therefore, the update gradient of policy parameters is

Any standard NN optimization methods can be used to update these two NNs, including stochastic gradient descent (SGD), RMSProp, Adam, etc. Taking the SGD method as an example, the updating rules of the value network and the policy network are:

where and denote the learning rate of the value and policy networks, respectively.

Noted that only when the policy learning rate is small enough, can the objective function be guaranteed to be monotonously improved throughout the learning process[17]. However, small learning steps usually leads to slow convergence. Another disadvantage is that policy safety cannot be guaranteed during the learning process.

Ii-C Algorithm

Ii-C1 Constrained Policy Optimization

Inspired by the study of Achiam et al. (2016), we introduce a policy security constraint based on the newly defined risk function to ensure agent security during the learning process. This method is called Constrained Policy Optimization (CPO). Besides the reward signal , the vehicle will also observe a scalar risk signal at each step. The risk signal is usually designed by human experts, which is usually assigned a large value when the vehicle moves into an unsafe state. Similar to the definition of and , the expected accumulated risk and the risk function (also called the cost function) are defined as and respectively. Similar to , we define the objective of :

To ensure policy security, the risk function of policy should always be bounded above the safe bound . So, the policy security constraint can be formulated as:

In this study, the risk function is also represented by a NN with parameters . The update method of the risk network is similar to the value network , which is omitted in this paper. The policy, value and risk networks together constitute a new Actor-Critic-Risk (ACR) architecture (See Fig.1).

Fig. 1: Actor-Critic-Risk architecture.

Ii-C2 Trust Region Constraint

Since both and are estimates, the monotonic improvement condition can be guaranteed only when the policy changes are not very large. Therefore, we add a constraint to avoid excessive policy update, so as to take relatively larger update steps without breaking the monotonic improvement guarantee inspired by [17]. The policy constraint is described as:

where is the corresponding step size bound and is the Kullback-Leibler (KL) divergence, which is used to measure the difference between the new policy and old policy . This constraint is also called the trust region constraint [17].

Therefore, the policy optimization problem can be formulated as:


The optimization problem in (2) is equivalent to the following one, written in terms of expectations:


The nonlinear constrained optimization problem formulated above is difficult to solve in practice due to the high-dimensional policy parameters . However, for small step sizes , both the objective function and risk function can be approximated through linearizing around , and the trust region constraint can also be well-approximated by the second order expansion at . The local approximation to (3) is:


where is the gradient of the objective , is the gradient of risk function , stands for the Hessian of the KL-divergence , and is defined as .

The optimization problem above is convex and can be solved efficiently using duality because is always positive semi-definite. We will assume it is positive-definite in the following. Denoting the Lagrange multipliers as and , the dual to the original problem can be expressed as:


If (4) is feasible, and are the optimal solution to (5), then the update rule for policy is:


Ii-C3 Parallel Constrained Policy Optimization

Sometimes we cannot find a feasible solution to (4). The first reason is dangerous state, i.e., the risk function value is really high when the agent is in a dangerous state. Another reason is a dangerous action because CPO may take a bad update and produce an unsafe action due to approximation errors in (4). Achiam et al. handle this situation by proposing a recovery rule to decrease the constraint value [1]. The recovery rule is as follows:


After applying the recovery update, the constraint value is reduced so that the case turns feasible again. However, this recovery rule does not apply to dangerous state cases, because the policy may act well in safe states. In this case, the adoption of this rule will result in slower convergence.

To overcome this problem, we employ multiple agents to explore different state spaces in parallel. The general structure of the parallel algorithm is illustrated in Fig.2. At each learning step, each agent synchronously generates samples based on the shared policy and uses its samples to solve (4). We call samples collected by agent feasible if (4) is feasible. All samples from different agents will be used to update the value network and the risk network after each iteration. However, only feasible samples can be used to update the policy network. Whether a set of samples is feasible can be mathematically inferred with the following two indexes: , and . It is easy to know by analysis that samples set is feasible only when and .

Fig. 2: Parallel Constrained Policy Optimization structure.

If no feasible samples are collected in one learning step, (7) will be adopted to update the policy. An advantage of parallel agents learning is that it helps to reduce the correlation and increase the coverage of all collected samples, which leads to higher convergence speed and learning stability. The algorithm combining CPO and synchronous parallel agents learning is called Parallel CPO (PCPO) in this study. The pseudo code for PCPO algorithm is given below:

0:    Initial with arbitrary , and and state
  for  do
     Explore samples set
     Update the Value Network with in (1)
     Update Risk Network with:
     Estimate in (4) with
     Store feasible in buffer
  end for
  if  then
     Solve (5) for
     Update policy network using (6)
     Recovery policy using (7)
  end if
Algorithm 1 Parallel Constrained Policy Optimization

Iii Experiments and Results

We implement the PCPO algorithm to design autonomous driving functions. Two experiments were studied: the first one is a single-vehicle lane-keeping task, the second is a multi-vehicle interacting at an intersection.

Iii-a Lane keeping

Iii-A1 Problem Description

The goal of the experiment is to keep the car as close to the center of the lane as possible while not deviating from the road throughout the learning process. The test road used in this experiment is a closed loop with a width of 3 m, which is shown in Fig. 3. The road position and direction information have been acquired by GPS every 0.015 meters.

Fig. 3: Test field map for the lane-keeping experiment.

The state space of the lane-keeping task is represented by a tuple , where denotes the relative lateral distance between the host vehicle and the lane center-line, and denotes the angle between the vehicle’s heading angle and the tangent direction of current trajectory. Each parallel car-learner is initialized at a random position of this road. In this experiment, we only focus on the lateral control of the vehicle, and assume that the self-driving vehicle travels at a constant speed of 50 km/h. The action space is denoted by , referring to the front wheel angle and . Given an action signal, the vehicle will move according to a two-degree bicycle dynamic model [12].

The reward function is defined as follows:

Besides, the vehicle gets a risk of 100 if it leaves the lane boundary.

(a) Car-learner 1
(b) Car-learner 2
(c) Car-learner 3
(d) Car-learner 4
Fig. 4: Lane-keeping experiment. Four parallel learning agents trained with the PCPO algorithm. The solid lines correspond to the mean and the shaded regions are from the maximum and minimum values of 5 runs. The red dash lines show the lane boundaries.
(a) Lateral Deviation Comparison
(b) Risk Comparison
(c) Return Comparison
Fig. 5: Learning curves comparison of the lane-keeping experiment. The safe limit () is set as 1, and the boundary delta () is set as . The red dash line stands for the safe limit. The solid lines correspond to the mean and the shaded regions correspond to standard deviation over 5 runs. This figure style is also applied in Fig.7, Fig.8.

Iii-A2 Algorithm Details

We employ four parallel cars to explore different state spaces and learn the sharing policy synchronously. Each car explores 16 steps at each iteration to form a sample set. Each epoch contains 25 episodes. The discount factor

. We learn the NN parameters with a learning rate of for value and risk networks, while for the policy network. For each NN, the input layer is composed of the states followed by 2 fully-connected layers with 100 hidden units for each layer. We use exponential linear units (ELUs) for hidden layers. Both the output layers of the value and risk networks are fully-connected linear layers with one scalar output. However, policy network has 2 outputs: 1)

with activate function tanh, and 2)

with activate function softplus.

Iii-A3 Results

Fig. 4 shows the evolution of the average lateral deviation of the four cars in 5 different runs during PCPO learning. Obviously, all parallel cars stay inside lanes throughout the learning process, while the deviation of each car quickly drops in about 70 epochs. This demonstrates PCPO’s ability to ensure policy security during learning process while quickly converge to optimum.

We compare the PCPO algorithm with two other RL algorithms, CPO and PPO. We used the same NN architecture and hyper-parameters for all three algorithms. Noted that, we use clipped surrogate objective function with in the PPO experiment. Fig. 5(a) shows the training performance of all algorithms in 5 different runs. We can see that all three methods can eventually learn a safe lane-keeping policy, however, the vehicle deviates from the lane multiple times during the learning process of the PPO. Besides, in this task, PCPO improves learning speed by approximately 35% compared to CPO, and by more than 70% compared to PPO. Fig. 5(b) and 5(c) respectively plot the average risk value and return of all three algorithms. Fig. 5(b) indicates that only CPO and PCPO kept the expected risk value below the predefined risk limit. Fig. 5(c) indicates that PCPO can learn to obtain a relatively optimal policy while ensuring security and efficiency.

Iii-B Decision-making of multi-vehicles at an intersection

Iii-B1 Problem Description

In this experiment, an unsignalized intersection is chosen as the simulation scenario, where each direction is a bidirectional single carriageway. We consider three vehicles in the intersection, the trajectories of which are pre-assigned and fixed. We randomly initialize the velocity and position of each vehicle along its track, then implement algorithms to learn a centralized policy for the three vehicles to pass through the crossing as fast as possible and without collision.

Fig. 6: The intersection crossing experiment. The colored bars show the pre-assigned trajectories. The green dot is an example to illustrate the middle point of the green vehicle’s trajectory.

As Fig. 6 shows, the state space is represented by a tuple , where denotes the distance of the vehicle to the middle point of its track, and (m/s) denotes the velocity. As the trajectory of each vehicle is fixed, the action space consists of the accelerations of the three cars , where (m/). The agents receive a reward of 10 for every passing vehicle, as well as a reward of -1 for every time step and an additional reward of 10 for terminal success. The agents are given a risk of 50 when a collision occurs.

Iii-B2 Results

Fig. 7: Risk comparison of the crossing experiment. The safe limit () is set as 5, and the boundary delta () is set as . The closer to the limit is better.

We share the same comparing algorithm types and settings, neural network structure, hype- parameters, etc. with the Lane Keeping experiment. From Fig. 7 we can see that, risks of PCPO and CPO are monotonically reduced and kept around the safe bound throughout the process, which validates PCPO and CPO’s better performance of guaranteeing safety during the learning process. Because observing safety constraints and getting high rewards are adversarial, specifically crossing and safety are conflicting in some way, thus the closer to the limit, the better.

Fig. 8: Return comparison of the crossing experiment. The theoretical optimal return is 0.

In addition, from Fig. 8 we can observe that PCPO has better learning performances in both learning speed and convergent optimality. The PCPO algorithm converges to 0 (theoretical highest return) after approximate 800 epochs’ learning, in comparison the CPO algorithm reaches less optimal performance of -5 and learns slower, not converged until 1400 epochs (75% slower than PCPO); the PPO algorithm seems to get the highest promotion in the initial epochs, however combining its risk of 30, much higher than the limit at 5, and its return learning curves, converging to the value of -15, we can infer that it converges to a sub-optimum, which is even an unsafe policy.

Iv Conclusion

This paper presents a safe RL algorithm, called Parallel Constrained Policy Optimization (PCPO), for autonomous driving tasks. PCPO is formulated to be a constrained optimization problem, in which an expected risk function bounded above by risk limit is introduced to guarantee policy safety. This algorithm extends today’s actor-critic architecture to a three-component learning framework, in which three fully connected NNs are used to approximate policy, value function and newly defined risk function, respectively. Besides, a trust region constraint is added to allow large policy update without breaking the monotonic improvement condition. A synchronized parallel learning strategy is developed to accelerate exploration and improve the possibility of achieving the optimal solution. We apply our algorithm to two tasks: one-vehicle lane-keeping and multi-vehicles at a crossing. The experimental results show the contributions of the PCPO algorithms in solving autonomous driving problems from the following aspects:

  • It can guarantee safety constraints during the learning process for general autonomous driving tasks;

  • It has higher learning speed and higher data efficiency;

  • It also has more possibility to prevent learning agents from being stuck at sub-optima, or at least to a safe sub-optimal policy.


  • [1] J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017) Constrained policy optimization. arXiv preprint arXiv:1705.10528. Cited by: §II-C3.
  • [2] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu (2018-02) Safe reinforcement learning via shielding. In

    The 32-nd AAAI Conference on Artificial Intelligence

    New Orleans, United States. Cited by: §I.
  • [3] J. Duan, Y. Guan, S. E. Li, Y. Ren, and B. Cheng (2020) Distributional soft actor-critic: off-policy reinforcement learning for addressing value estimation errors. arXiv preprint arXiv:2001.02811. Cited by: §I.
  • [4] J. Duan, R. Li, L. Hou, W. Wang, G. Li, S. E. Li, B. Cheng, and H. Gao (2017) Driver braking behavior analysis to improve autonomous emergency braking systems in typical chinese vehicle-bicycle conflicts. Accident Analysis & Prevention 108, pp. 74–82. Cited by: §I.
  • [5] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng (2020. doi:10.1049/iet-its.2019.0317) Hierarchical reinforcement learning for self-driving decision-making without reliance on labeled driving data. IET Intelligent Transport Systems. Cited by: §I.
  • [6] J. Garcia and F. Fernández (2012) Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research 45, pp. 515–564. Cited by: §I.
  • [7] J. Garcıa and F. Fernández (2015) A comprehensive survey on safe reinforcement learning.

    Journal of Machine Learning Research

    16 (1), pp. 1437–1480.
    Cited by: §I.
  • [8] C. Gehring and D. Precup (2013) Smart exploration in reinforcement learning using absolute temporal difference errors. In Proceedings of the 2013 AAMAS, Saint Paul, United States, pp. 1037–1044. Cited by: §I.
  • [9] P. Geibel and F. Wysotzki (2005) Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research 24, pp. 81–108. Cited by: §I.
  • [10] D. Karavolos (2013)

    Q-learning with heuristic exploration in simulated car racing

    Cited by: §I.
  • [11] V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In NIPS, Denver, United States, pp. 1008–1014. Cited by: §II-B.
  • [12] K. Lee, S. E. Li, and D. Kum (2018) Synthesis of robust lane keeping systems: impact of controller and design parameters on system performance. IEEE Transactions on Intelligent Transportation Systems 20 (8), pp. 3129–3141. Cited by: §III-A1.
  • [13] S. E. Li (2019) Reinforcement learning and control: lecture notes. Tsinghua University. Cited by: §I.
  • [14] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §I.
  • [15] A. Nilim and L. El Ghaoui (2005) Robust control of markov decision processes with uncertain transition matrices. Operations Research 53 (5), pp. 780–798. Cited by: §I.
  • [16] X. Pan, Y. You, Z. Wang, and C. Lu (2017) Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952. Cited by: §I.
  • [17] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015-07) Trust region policy optimization. In ICML, Lille, France, pp. 1889–1897. Cited by: §II-B, §II-C2, §II-C2.
  • [18] N. T. Siebel and G. Sommer (2007) Evolutionary reinforcement learning of artificial neural networks. International Journal of Hybrid Intelligent Systems 4 (3), pp. 171–183. Cited by: §I.
  • [19] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014-06) Deterministic policy gradient algorithms. In ICML, Beijing. Cited by: §I.
  • [20] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §II-A.
  • [21] A. Tamar, D. Di Castro, and S. Mannor (2012-06) Policy gradients with variance related risk criteria. In Proceedings of the 29-th ICML, Edinburgh, Scotland, pp. 387–396. Cited by: §I.
  • [22] S. Wang, D. Jia, and X. Weng (2018) Deep reinforcement learning for autonomous driving. arXiv preprint arXiv:1811.11329. Cited by: §I.
  • [23] S. Zhang, H. Peng, S. Nageshrao, and E. Tseng (2019) Discretionary lane change decision making using reinforcement learning with model-based exploration. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 844–850. Cited by: §I.