I Introduction
Autonomous driving has the potential to improve safety and accessibility of ground vehicles. The approaches to design the control policy for autonomous vehicle generally fall into two categories: (1) rulebased methods and (2) learningbased methods. Because realworld driving can a lot of complexities, rulebased methods (e.g., the finite state mechanism) may not be able to handle all scenarios, and meanwhile pose significant burden on engineers to cover all possibilities [4]. The learningbased methods can imitate and learn from drivers’ manipulation implicitly. In this study, we develop a learningbased method while making an important improvement.
Reinforcement learning (RL) has been actively studied for autonomous driving in recent years. The goal of RL is to find policies to maximize the accumulated reward without reliance on labeled human driving data[13]. Karavolos et al. (2013) first applied the vanilla Qlearning algorithm to efficiently train a driver in a racing game on the simulator TORCS[10]. Silver et al
. (2015) proposed a deep deterministic policy gradient (DDPG) algorithm by introducing deep learning into DPG algorithm
[19], which effectively solves problems in a continuous action space[14]. Wang et al (2018) successfully applied DDPG to an endtoend policy learning for autonomous driving on TORCS[22]. Pan et al. (2017) built a new framework on the basis of A3C to train a selfdriving vehicle by interacting with a synthesized real environment[16]. Recently, Zhang et al. (2019) employed RL with modelbased exploration for autonomous lane change decisionmaking on highways [23]. Duan et al. (2020) employed RL with a hierarchical architecture for autonomous decisionmaking on highways [5][3].To date, most RL methods have been developed on simulation platforms, with little work on real vehicles. A main reason is that the policy cannot be guaranteed to be safe, and the backpropagationdriven process may lead to unforeseen accidents. Safety is the most basic requirement for autonomous vehicles, so a training process only look at reward, and not potential risk, is not acceptable. The notion of safe RL is defined in [7] as the process of learning policies that maximizes the expectation of accumulated return, while respecting security constraints in the learning and deployment process. More specifically, safe RL could be divided into being strictly safe and approximately safe. The algorithm developed in this paper is an approximately safe one. Approaches to solve this problem are categorized into two methods: (1) modifying the optimization criterion and (2) modifying the exploration process.
The method to modify the optimization criterion is to incorporate risk into the optimization objective, while the riskneutral control neglects the variance in the probability distribution of rewards. We categorize these optimization criteria into four groups: maximin, risksensitive, constrained, and others. The maximin criterion considers a policy to be optimal if it has the maximum worstcase return. Gaskett (2003) considered the inherent uncertainty related to stochastic nature of the system, by proposing a new extension
pessimistic term to Qlearning. Nilim and Ghaoui (2013) considered the uncertainty related to some of the parameters of the Markov decision process
[15]. Risksensitive criterion includes the notion of risk and return variance in the long term reward maximization objective. Geibel and Wysotzki (2005) transformed the optimization criterion into the probability of entering an error state[9]. The constrained criterion ensures that the expectation of return is subject to one or more constraints. Castro et al. (2012) used a constrained criterion in which the variance of the return must not exceed a given threshold. More optimization criteria were explored to enforce safety as well[21]. Mohammed et al. (2018) proposed a preemptiveshielding system, acting each time the learning agent is about to make a decision and providing a list of safe actions[2]. But these methods have common drawbacks of turning overly pessimistic or computational intractability.Modification of the exploration process can be categorized into two approaches: (1) incorporating external knowledge and (2) risk directed exploration. Incorporating external knowledge can provide initial knowledge to the agent, but it is not sufficient to prevent dangerous situations in later exploration. Siebel and Sommer (2007) used external knowledge as a form of population seeding in neuroevolution approaches[18]. Mohammed et al. (2018) introduced a new system named postposed shielding. The shield monitors the agent’s action and corrects them if the chosen action causes a violation. Risk directed exploration encourages the agent to explore controllable regions of environment by introducing risk metric as an exploration bonus. Garcia et al. (2012) successfully applied this approach to the helicopter hovering control in the RL Competition[6]. Gehring and Precup (2013) defined a risk metric based on controllability[8]. These methods have limited performance and are not always reliable because of their inability to detect risky situations for both early steps and long term.
A main contribution of this paper is to propose a safe RL algorithm, called Parallel Constrained Policy Optimization (PCPO), for autonomous driving policy. PCPO can ensure the policy is safe in the learning process and improve the convergence speed. PCPO considers a risk function and bounds the expected risk within predefined hard constraints. Meanwhile, a trust region constraint is added to allow large update step without breaking the monotonic improvement condition. The policy, value function and newly defined risk function are all approximated by neural networks. Secondly, synchronized parallel learners are employed to explore different state subspace of system to reduce the correlation of sample sets, which increase the possibility of finding feasible states and accelerating the convergence speed.
The rest of this paper is organized as follows: Section II introduces the PCPO algorithm, including the safety constraints, the trust region and the parallel learning framework. Section III presents two simulation studies for autonomous driving tasks: lanekeeping, and intersection crossing. Finally, we provide concluding remarks in section IV.
Ii Methodology
Iia Preliminaries
In this work, we formalize the RL problems into a Markov Decision Process (MDP) [20]. An infinitehorizon discounted MDP is defined by the tuple , where is the finite set of states, is the finite set of actions, is the reward function, is the transition probability distribution, is the distribution of the initial state , and is the discount factor.
For each state at time , the expected accumulated return is defined as . The action is chosen according to a stochastic policy , with denoting the probability of choosing action in state . The value function is , where , for , and the stateaction value function is denoted as .
RL aims to get the policy which maximizes the accumulated return in infinite horizon. Let denote the objective function of policy update, it follows that , where is a sequence of actionstate: . Obviously we can express with and . Firstly, let’s define the advantage function as
So the advantage function and the objective function satisfy
where represents the old policy. It’s sensible to use the state distribution corresponding to the old policy in replacement of that corresponding to policy
. In a large continuous state space, we can generally construct an estimator of the surrogate objective using importance sampling:
The goal of RL is to find the optimal that maximizes . Since and are independent of , the policy optimization process can then be formulated as:
For conciseness, we define the following surrogate objective function:
IiB ActorCritic
RL methods usually employ an actorcritic (AC) architecture to approximate both the policy and value function by iteratively solving the Bellman optimality equation based on generalized policy iteration framework. The AC architecture consists of two structures: the policy network (the socalled actor) and value approximation (the socalled critic)[11]
. In this study, both the actor and critic are approximated by neural networks (NN), which directly map state to the probability distribution of action and expected cumulative return respectively. In this study, we adopt a stochastic policy, the output of which is the mean and standard deviation of the Gaussian distribution. We represent the policy network
with parameters , and the stateaction value network with parameters .In previous AC methods, the parameters of the value network
are tuned by iteratively minimizing the following loss function
where is usually called the temporaldifference (TD) error. The expected accumulated reward is usually estimated in the forward view using the nstep return:
(1) 
Then the specific gradient update for the parameters of the value network is
The parameters of policy network are updated to maximize the surrogate objective function . Therefore, the update gradient of policy parameters is
Any standard NN optimization methods can be used to update these two NNs, including stochastic gradient descent (SGD), RMSProp, Adam, etc. Taking the SGD method as an example, the updating rules of the value network and the policy network are:
where and denote the learning rate of the value and policy networks, respectively.
Noted that only when the policy learning rate is small enough, can the objective function be guaranteed to be monotonously improved throughout the learning process[17]. However, small learning steps usually leads to slow convergence. Another disadvantage is that policy safety cannot be guaranteed during the learning process.
IiC Algorithm
IiC1 Constrained Policy Optimization
Inspired by the study of Achiam et al. (2016), we introduce a policy security constraint based on the newly defined risk function to ensure agent security during the learning process. This method is called Constrained Policy Optimization (CPO). Besides the reward signal , the vehicle will also observe a scalar risk signal at each step. The risk signal is usually designed by human experts, which is usually assigned a large value when the vehicle moves into an unsafe state. Similar to the definition of and , the expected accumulated risk and the risk function (also called the cost function) are defined as and respectively. Similar to , we define the objective of :
To ensure policy security, the risk function of policy should always be bounded above the safe bound . So, the policy security constraint can be formulated as:
In this study, the risk function is also represented by a NN with parameters . The update method of the risk network is similar to the value network , which is omitted in this paper. The policy, value and risk networks together constitute a new ActorCriticRisk (ACR) architecture (See Fig.1).
IiC2 Trust Region Constraint
Since both and are estimates, the monotonic improvement condition can be guaranteed only when the policy changes are not very large. Therefore, we add a constraint to avoid excessive policy update, so as to take relatively larger update steps without breaking the monotonic improvement guarantee inspired by [17]. The policy constraint is described as:
where is the corresponding step size bound and is the KullbackLeibler (KL) divergence, which is used to measure the difference between the new policy and old policy . This constraint is also called the trust region constraint [17].
Therefore, the policy optimization problem can be formulated as:
(2) 
The optimization problem in (2) is equivalent to the following one, written in terms of expectations:
(3) 
The nonlinear constrained optimization problem formulated above is difficult to solve in practice due to the highdimensional policy parameters . However, for small step sizes , both the objective function and risk function can be approximated through linearizing around , and the trust region constraint can also be wellapproximated by the second order expansion at . The local approximation to (3) is:
(4) 
where is the gradient of the objective , is the gradient of risk function , stands for the Hessian of the KLdivergence , and is defined as .
The optimization problem above is convex and can be solved efficiently using duality because is always positive semidefinite. We will assume it is positivedefinite in the following. Denoting the Lagrange multipliers as and , the dual to the original problem can be expressed as:
(5) 
If (4) is feasible, and are the optimal solution to (5), then the update rule for policy is:
(6) 
IiC3 Parallel Constrained Policy Optimization
Sometimes we cannot find a feasible solution to (4). The first reason is dangerous state, i.e., the risk function value is really high when the agent is in a dangerous state. Another reason is a dangerous action because CPO may take a bad update and produce an unsafe action due to approximation errors in (4). Achiam et al. handle this situation by proposing a recovery rule to decrease the constraint value [1]. The recovery rule is as follows:
(7) 
After applying the recovery update, the constraint value is reduced so that the case turns feasible again. However, this recovery rule does not apply to dangerous state cases, because the policy may act well in safe states. In this case, the adoption of this rule will result in slower convergence.
To overcome this problem, we employ multiple agents to explore different state spaces in parallel. The general structure of the parallel algorithm is illustrated in Fig.2. At each learning step, each agent synchronously generates samples based on the shared policy and uses its samples to solve (4). We call samples collected by agent feasible if (4) is feasible. All samples from different agents will be used to update the value network and the risk network after each iteration. However, only feasible samples can be used to update the policy network. Whether a set of samples is feasible can be mathematically inferred with the following two indexes: , and . It is easy to know by analysis that samples set is feasible only when and .
If no feasible samples are collected in one learning step, (7) will be adopted to update the policy. An advantage of parallel agents learning is that it helps to reduce the correlation and increase the coverage of all collected samples, which leads to higher convergence speed and learning stability. The algorithm combining CPO and synchronous parallel agents learning is called Parallel CPO (PCPO) in this study. The pseudo code for PCPO algorithm is given below:
Iii Experiments and Results
We implement the PCPO algorithm to design autonomous driving functions. Two experiments were studied: the first one is a singlevehicle lanekeeping task, the second is a multivehicle interacting at an intersection.
Iiia Lane keeping
IiiA1 Problem Description
The goal of the experiment is to keep the car as close to the center of the lane as possible while not deviating from the road throughout the learning process. The test road used in this experiment is a closed loop with a width of 3 m, which is shown in Fig. 3. The road position and direction information have been acquired by GPS every 0.015 meters.
The state space of the lanekeeping task is represented by a tuple , where denotes the relative lateral distance between the host vehicle and the lane centerline, and denotes the angle between the vehicle’s heading angle and the tangent direction of current trajectory. Each parallel carlearner is initialized at a random position of this road. In this experiment, we only focus on the lateral control of the vehicle, and assume that the selfdriving vehicle travels at a constant speed of 50 km/h. The action space is denoted by , referring to the front wheel angle and . Given an action signal, the vehicle will move according to a twodegree bicycle dynamic model [12].
The reward function is defined as follows:
Besides, the vehicle gets a risk of 100 if it leaves the lane boundary.
IiiA2 Algorithm Details
We employ four parallel cars to explore different state spaces and learn the sharing policy synchronously. Each car explores 16 steps at each iteration to form a sample set. Each epoch contains 25 episodes. The discount factor
. We learn the NN parameters with a learning rate of for value and risk networks, while for the policy network. For each NN, the input layer is composed of the states followed by 2 fullyconnected layers with 100 hidden units for each layer. We use exponential linear units (ELUs) for hidden layers. Both the output layers of the value and risk networks are fullyconnected linear layers with one scalar output. However, policy network has 2 outputs: 1)with activate function tanh, and 2)
with activate function softplus.IiiA3 Results
Fig. 4 shows the evolution of the average lateral deviation of the four cars in 5 different runs during PCPO learning. Obviously, all parallel cars stay inside lanes throughout the learning process, while the deviation of each car quickly drops in about 70 epochs. This demonstrates PCPO’s ability to ensure policy security during learning process while quickly converge to optimum.
We compare the PCPO algorithm with two other RL algorithms, CPO and PPO. We used the same NN architecture and hyperparameters for all three algorithms. Noted that, we use clipped surrogate objective function with in the PPO experiment. Fig. 5(a) shows the training performance of all algorithms in 5 different runs. We can see that all three methods can eventually learn a safe lanekeeping policy, however, the vehicle deviates from the lane multiple times during the learning process of the PPO. Besides, in this task, PCPO improves learning speed by approximately 35% compared to CPO, and by more than 70% compared to PPO. Fig. 5(b) and 5(c) respectively plot the average risk value and return of all three algorithms. Fig. 5(b) indicates that only CPO and PCPO kept the expected risk value below the predefined risk limit. Fig. 5(c) indicates that PCPO can learn to obtain a relatively optimal policy while ensuring security and efficiency.
IiiB Decisionmaking of multivehicles at an intersection
IiiB1 Problem Description
In this experiment, an unsignalized intersection is chosen as the simulation scenario, where each direction is a bidirectional single carriageway. We consider three vehicles in the intersection, the trajectories of which are preassigned and fixed. We randomly initialize the velocity and position of each vehicle along its track, then implement algorithms to learn a centralized policy for the three vehicles to pass through the crossing as fast as possible and without collision.
As Fig. 6 shows, the state space is represented by a tuple , where denotes the distance of the vehicle to the middle point of its track, and (m/s) denotes the velocity. As the trajectory of each vehicle is fixed, the action space consists of the accelerations of the three cars , where (m/). The agents receive a reward of 10 for every passing vehicle, as well as a reward of 1 for every time step and an additional reward of 10 for terminal success. The agents are given a risk of 50 when a collision occurs.
IiiB2 Results
We share the same comparing algorithm types and settings, neural network structure, hype parameters, etc. with the Lane Keeping experiment. From Fig. 7 we can see that, risks of PCPO and CPO are monotonically reduced and kept around the safe bound throughout the process, which validates PCPO and CPO’s better performance of guaranteeing safety during the learning process. Because observing safety constraints and getting high rewards are adversarial, specifically crossing and safety are conflicting in some way, thus the closer to the limit, the better.
In addition, from Fig. 8 we can observe that PCPO has better learning performances in both learning speed and convergent optimality. The PCPO algorithm converges to 0 (theoretical highest return) after approximate 800 epochs’ learning, in comparison the CPO algorithm reaches less optimal performance of 5 and learns slower, not converged until 1400 epochs (75% slower than PCPO); the PPO algorithm seems to get the highest promotion in the initial epochs, however combining its risk of 30, much higher than the limit at 5, and its return learning curves, converging to the value of 15, we can infer that it converges to a suboptimum, which is even an unsafe policy.
Iv Conclusion
This paper presents a safe RL algorithm, called Parallel Constrained Policy Optimization (PCPO), for autonomous driving tasks. PCPO is formulated to be a constrained optimization problem, in which an expected risk function bounded above by risk limit is introduced to guarantee policy safety. This algorithm extends today’s actorcritic architecture to a threecomponent learning framework, in which three fully connected NNs are used to approximate policy, value function and newly defined risk function, respectively. Besides, a trust region constraint is added to allow large policy update without breaking the monotonic improvement condition. A synchronized parallel learning strategy is developed to accelerate exploration and improve the possibility of achieving the optimal solution. We apply our algorithm to two tasks: onevehicle lanekeeping and multivehicles at a crossing. The experimental results show the contributions of the PCPO algorithms in solving autonomous driving problems from the following aspects:

It can guarantee safety constraints during the learning process for general autonomous driving tasks;

It has higher learning speed and higher data efficiency;

It also has more possibility to prevent learning agents from being stuck at suboptima, or at least to a safe suboptimal policy.
References
 [1] (2017) Constrained policy optimization. arXiv preprint arXiv:1705.10528. Cited by: §IIC3.

[2]
(201802)
Safe reinforcement learning via shielding.
In
The 32nd AAAI Conference on Artificial Intelligence
, New Orleans, United States. Cited by: §I.  [3] (2020) Distributional soft actorcritic: offpolicy reinforcement learning for addressing value estimation errors. arXiv preprint arXiv:2001.02811. Cited by: §I.
 [4] (2017) Driver braking behavior analysis to improve autonomous emergency braking systems in typical chinese vehiclebicycle conflicts. Accident Analysis & Prevention 108, pp. 74–82. Cited by: §I.
 [5] (2020. doi:10.1049/ietits.2019.0317) Hierarchical reinforcement learning for selfdriving decisionmaking without reliance on labeled driving data. IET Intelligent Transport Systems. Cited by: §I.
 [6] (2012) Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research 45, pp. 515–564. Cited by: §I.

[7]
(2015)
A comprehensive survey on safe reinforcement learning.
Journal of Machine Learning Research
16 (1), pp. 1437–1480. Cited by: §I.  [8] (2013) Smart exploration in reinforcement learning using absolute temporal difference errors. In Proceedings of the 2013 AAMAS, Saint Paul, United States, pp. 1037–1044. Cited by: §I.
 [9] (2005) Risksensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research 24, pp. 81–108. Cited by: §I.

[10]
(2013)
Qlearning with heuristic exploration in simulated car racing
. Cited by: §I.  [11] (2000) Actorcritic algorithms. In NIPS, Denver, United States, pp. 1008–1014. Cited by: §IIB.
 [12] (2018) Synthesis of robust lane keeping systems: impact of controller and design parameters on system performance. IEEE Transactions on Intelligent Transportation Systems 20 (8), pp. 3129–3141. Cited by: §IIIA1.
 [13] (2019) Reinforcement learning and control: lecture notes. Tsinghua University. Cited by: §I.
 [14] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §I.
 [15] (2005) Robust control of markov decision processes with uncertain transition matrices. Operations Research 53 (5), pp. 780–798. Cited by: §I.
 [16] (2017) Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952. Cited by: §I.
 [17] (201507) Trust region policy optimization. In ICML, Lille, France, pp. 1889–1897. Cited by: §IIB, §IIC2, §IIC2.
 [18] (2007) Evolutionary reinforcement learning of artificial neural networks. International Journal of Hybrid Intelligent Systems 4 (3), pp. 171–183. Cited by: §I.
 [19] (201406) Deterministic policy gradient algorithms. In ICML, Beijing. Cited by: §I.
 [20] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §IIA.
 [21] (201206) Policy gradients with variance related risk criteria. In Proceedings of the 29th ICML, Edinburgh, Scotland, pp. 387–396. Cited by: §I.
 [22] (2018) Deep reinforcement learning for autonomous driving. arXiv preprint arXiv:1811.11329. Cited by: §I.
 [23] (2019) Discretionary lane change decision making using reinforcement learning with modelbased exploration. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 844–850. Cited by: §I.
Comments
There are no comments yet.