The development of the Dedicated Short-Range Communication (DSRC) technology 
and 5G technology enables Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) communication. The U.S. Department of Transportation (DOT) estimated that DSRC based V2V information sharing could potentially address up to 82% of crashes in the United States and prevent thousands of automobile crashes every year. When Basic Safety Message (BSM), including current velocities and positions, is communicated for Connected Autonomous Vehicles’ (CAVs) coordination, control approaches for scenarios such as cross intersections or lane-merging [22, 14, 19] have shown the advantage of information sharing.
When a lane-changing decision has already been made, continuous state space controllers are designed for vehicles to merge into or leave the platoon . However, existing control frameworks for CAVs, such as platooning, Adaptive Cruise Control (ACC) , and Cooperative Adaptive Cruise Control (CACC)  algorithms mainly focus on the controller design when a decision about lane changing or keeping has been provided . Methods designed based on Model Predictive Control (MPC) method  or hybrid system model usually aims to guarantee string stability of the platoon or improves fuel efficiency . Work in  uses vision to predict trajectory and speed for reinforcement learning training, but it seems from the video the cars do not change lane.
Reinforcement learning (RL) has shown great success in robotics research . For solving autonomous vehicle’s challenges, one paramount concern raised by RL is how to ensure the safety of the learned driving policy. The authors in 
have proved that it cannot ensure driving safety by merely assigning a negative reward for accident trajectories, because it enlarges the variance of the expected reward, which in turn prohibits the convergence of stochastic gradient descent algorithm for policy learning.
It is efficient if autonomous vehicles can learn a driving policy to output control signals for steering angle and acceleration directly based on observed environments, such as an end-to-end convolutional neural network with front camera images as input
. However, it only considers lane-keeping scenarios without lane-changing cases, and this pure learning method cannot provide a safety guarantee. In order to add safety guarantees, researchers start to use deep learning to learn a policy that has similar control output with MPC. For example, paper uses a constrained neural network to approximate MPC law. The guided policy search in  uses MPC to generate samples to train Deep Reinforcement Learning (DRL), where DRL and MPC have the same action/decision space. Though it is an approximation of MPC, the learned policy does have the same safety property with the MPC.
The other popular approach is to decompose the learning and control phases. Learning-based methods can make a high-level decision, including lane-changing or lane-keeping. For example, the work in  evaluates the effectiveness of making high-level decisions, such as ”go straight” and ”go left”, based on images. The work discussing safety in  also separates learning and control phases: first learn a policy to determine the target speed, lateral position, whether to take away or give way to the other vehicle, then conduct trajectory planning with hard constraints to ensure safety.
Certainly, splitting learning and control steps is a good way to combine the advantages of both of them. Is there a better way to integrate them? Can control help the learning process? Also, when an autonomous vehicle gets extra knowledge about the environment via V2V or V2X communication, how to make tactical decisions such as whether to change lane or keep lane to improve traffic efficiency, whether sharing future planned velocity or trajectory can bring benefits are still unsolved challenges. The existing literature of multi-agent reinforcement learning  has not fully solved the CAV challenges yet, since either how communication among agents will improve systems performance or policy learning is not analyzed clearly.
Hence, in this work, we design a learning algorithm that uses shared information to enhance operational performance for future CAVs, considers feedback from controllers and safety requirements, and shows the benefit of V2V communication. We focus on the behavior planning challenge to determine when to change/keep lane for connected autonomous vehicles. We assume that autonomous vehicles can share current and future plans with their neighbors. We model the problem as a Markov decision process and solve it by improving the deep Q-learning algorithm for autonomous vehicles, where the behavior action a vehicle can take and the reward from this action also use the feedback control decisions of the continuous state space controller of this vehicle. In experiments, we show that feedback DRL based behavior planning can help increase traffic flow and provide a more comfortable driving experience.
The main contributions of this work are:
The state space design of the RL approach for CAV reduces the curse of dimensionality. The approach is scalable to the number of vehicles since the dimension of the state space is not directly relevant to the number of vehicles that share information.
The feedback deep Q-learning proposed in our work integrates the strength of both the reinforcement learning and optimal control. The learning algorithm can explore historical data to find a good policy under a complex environment. The continuous state space controller has a well-studied property to consider safety constraints. Controller gets involved in the proposed learning process by providing feedback to actions and rewards.
We design a distributed policy learning algorithm that does not rely on a cloud center to collect data or run the training process. The experiment results reveal the potential to apply both centralized and distributed learning algorithms for behavior planning.
The case study shows that the behavior planning policy learned by the proposed approach can deal with complicated scenarios such as road closures.
The rest of this paper is organized as follows. In Section II, the behavior planning problem is described. Also, the information sharing assumption and a physical model are introduced for CAVs. In Section III, we define the behavior planning as a Markov decision process and solve it through feedback deep Q-learning with both centralized and distributed versions. Experiment results are shown in Section IV concerning traffic flow and driving comfort. Conclusions are given in Section V.
Ii Problem Description
The V2V and V2I communications extend the information gathered by a single-vehicle further beyond its sensing system. This work addresses how to utilize information sharing to make better decisions in terms of when to change/keep lane for Connected Autonomous Vehicles (CAVs). This section introduces information sharing considered in this work. Afterward, the kinematic bicycle model is briefly introduced for the continuous dynamics of an ego vehicle.
As shown in the concept diagram in Fig. 1, the problem considered in this paper is to study a decision making algorithm for behavior planning, assuming that information sharing is capable between CAVs. Each vehicle is expected to share a sequence of future velocities and lane numbers with its neighbors. A formal definition of the neighbors is given in the subsequent section. Once a decision is made, the action is implemented by a controller. Because the action given by the RL agent might not be safe and feasible, a controller can change the action to guarantee safety.
Meanwhile, this action provides feedback to the RL process, which modifies the traditional learning framework as introduced later in Section III-D. The corresponding controller is called a feedback controller. There are several state-of-art controllers proposed by researchers, such as Model Predictive Control (MPC) [3, 8], and the control barrier function based controller [5, 1].
Ii-a Information Sharing
Autonomous vehicles are assumed to share information with their -neighbors defined as follows.
Definition 1 (-neighbors).
One vehicle is said to be the -neighbor of a vehicle if , where and are vehicles’ indexes, and represent the longitudinal positions of vehicle and at time instant respectively, and is a constant parameter. The set that includes all the -neighbors of the vehicle except for itself is denoted as . This set is called vehicle ’s -neighbors.
In this work, we consider autonomous vehicles driving on a 3-lane freeway. Each autonomous vehicle is assumed to share its current state and future plan, denoted by a sequence of , , …, (velocities); , , …, (lane numbers, labeled as 1, 2, 3) with its -neighbors. Note that, though a lane number can be replaced by the lateral position of a vehicle, it can reflect lane-changing actions more clearly. According to the current development of BSM and DSRC security, the message can be both authenticated and encrypted . Hence, in this work, we assume that vehicular communication is true information that is not manipulated by attackers.
Ii-B Physical Model
The physical dynamics of an ego vehicle is described by a kinematic bicycle model, as shown in Fig. 2. This model can achieve the balance between accuracy and complexity [2, 11]. Point represents the two left and right front wheels, while the rear two wheels are represented by point . Point is the Center of Gravity (CoG). The lengths of the line segments and are represented by and , respectively. The is the steering angle for the front wheels. The planar motion of this vehicle is described by three coordinates: , , and . is the location of the CoG, and illustrates the orientation of the vehicle. The -axis represents the lane centerline. The is the velocity at the CoG, and the slip angle denotes its angle with .
The discrete-time equations of this model can be obtained by applying an explicit Euler method with a sampling time for continuous states 
. The control vector for this vehicle is defined as, where is its acceleration. The continuous state vector is defined as , where is the current lane number. The detail equations can be found in the Appendix. More compactly, the update of the state vector is denoted as .
Iii Behavior Policy Learning
The behavior planning problem is modeled as a Markov Decision Process (MDP). This section first introduces the action space, state space, and reward function of this MDP. Then, a feedback deep reinforcement learning algorithm (centralized version) is proposed to solve it. This algorithm can improve the traffic flow and driving comfort of autonomous vehicles, but its learning process relies on a cloud center. Therefore, a distributed version is put forward afterward to get rid of this dependency.
Iii-a Action Space
In this work, we consider a scenario where autonomous vehicles running on a 3-lane free freeway, as shown in Fig. 3. In behavior planning, the action space considers:
Keep Lane (KL),
Change Left (CL),
Change Right (CR).
More complicated actions are either included in these actions or represented by their combinations. For example, an overtake is the combination of change left followed by change right.
Iii-A1 Keep Lane
Vehicles stay in the current lane with this action. A feedback controller will adjust the speed of an ego vehicle according to the safety interval requirements, which is introduced in Sec. III-E. The ego vehicle may either accelerate or decelerate according to the control output of the feedback controller. In extreme cases, when the headway is too tight, the ego vehicle will fully stop on the current lane.
Iii-A2 Change Left & Change Right
With these two actions, an ego vehicle changes to a neighbor lane on its left/right. CL is symmetric to CR. After receiving the CL/CR action from a DRL agent, a feedback controller will implement this action based on the current traffic.
Iii-B State Space
The incentive to change lane is to achieve higher speed or to avoid obstacles/traffic jams. Intuitively, the states can be designed to include the positions, velocities, and lane numbers of all the vehicles. However, the number of states will increase proportionally to the number of vehicles, and the computational complexity of behavior planning will increase exponentially. To reduce the curse of dimensionality, two state functions are defined: is used to evaluate the future velocity quality of lane # based on shared information; is used to record the lane-changing frequency of each vehicle. As an example shown in Fig. 3, the state space includes:
for the left lane
for the current lane
for the right lane
for the lane-changing frequency
Definition 2 (State function for future velocity).
The state function for future velocity of lane # is
where , , , is a decay coefficient, is the velocity of the vehicle at time instant .
This state function is the average discounted future velocity of vehicles on lane # . It can reflect the potential velocity quality of lane # . It also helps to avoid unnecessary lane-changing. For example, even though at time the vehicles on the neighbor lane have a higher speed than the ego vehicle, there may be a traffic jam in front of them. The ego vehicle could not observe this traffic jam because it could not see through its neighbors, which block its view. In this case, there is no need to change lane. The farther the time is, the less accuracy will have. Therefore, a decay coefficient is multiplied to penalize future information.
The state function needs support from information sharing of both future velocities and lane numbers. The lane numbers are used to allocate which vehicle is on lane # , as the state function is defined for a specific lane. When the ego vehicle is on lane # 1 or # 3, the left/right lane’s state is set to be 0. For example, .
The state function for future velocity is scalable according to the number of vehicles. Unlike the aggregation of all vehicles’ states, the average discounted future velocity always exist no matter how many vehicles there are. Note that when there is no vehicle nearby, the future velocities are intentionally set to be . Also, it is robust to potential package drops. Similarly, though the package drops may occur, the average velocity will not be affected dramatically because of losing one or two data points.
Definition 3 (State function for lane-changing frequency).
The state function for lane-changing frequency is
where is the current time instant, is a constant determining the window size of and the lane changing indicator
Passengers may feel uncomfortable if a vehicle changes lane frequently. This frequent behavior will result in a smaller state function value, so there is a minus sign in front.
Iii-C Reward Function
A reward is assigned to each state-action pair. For connected autonomous vehicles, behavior planning in our work targets to improve two system-level evaluation criteria: traffic flow and driving comfort . When there is a central cloud center to collect information from all the agents, a global reward function can be used for multi-agent RL . Therefore, the reward function is defined as a weighted sum of the above two criteria:
Definition 4 (Reward function).
The global reward function (for centralized DRL) is
where , , , is a trade-off weight.
Iii-C1 Traffic Flow
Traffic flow reflects the quality of the road throughout with respect to traffic density. Traffic density is the ratio between the total number of vehicles and road length. Traffic flow is calculated as
where is the average velocity of all the vehicles .
Iii-C2 Driving Comfort
The driving comfort of a road segment is defined to be the average driving comfort of all the vehicles on this segment. The driving comfort of a single-vehicle is related to its acceleration and driving behavior. Define the driving comfort for the vehicle ’s acceleration at time and its action as follows:
where is a predefined acceleration threshold.
Therefore, driving comfort (for the freeway) is calculated as
where is the total number of all the vehicles.
Iii-D Feedback Deep Q-Learning
The environment model is not available for behavior planning, so policy/value iteration does not work. Traditional sample-based reinforcement learning is good at exploiting historical data to find a good policy. However, it does not guarantee either the optimality or the safety property of the converged policy. Different from traditional Deep Reinforcement Learning (DRL), a feedback controller will monitor the action given by the DRL agent and guarantee safety.
Agent-environment interaction is shown in Fig. 4 for the feedback DRL. The DRL agent is like a smart toddler who is good at learning new knowledge, but she may do something dangerous, such as eating inedible things and playing sharp items. In this case, an adult (the feedback controller) needs to look after her and guide her as a guardian. For the ego vehicle, Change Left (CL) or CR may be unsafe under some traffic scenarios. The controller will monitor action from the DRL agent and determine feedback action based on the current traffic condition. In traditional DRL, transition experience is represented by , where is the current state, is the action taken by the DRL agent, is the resulted reward, and is the next state. In feedback deep Q-learning, transition experience is represented by , where is the feedback action executed by the feedback controller. In feedback deep Q-learning (centralized version), every vehicle will contribute its transition experience to a shared replay memory. This algorithm is shown in Alg. 1.
Iii-E Feedback Controller
The MPC controller proposed in  is used as an example to show how a feedback controller works. For each time horizon , the control inputs can be generated by the following optimization program:
where , are both positive definite weighting matrices for tuning, is the reference trajectory, and are the bounds of each state:
where represents the width of each lane.
This program has standard solvers . The state bounds in (8) are used to guarantee there is no overlapping between all vehicles’ positions. Once received action from the DRL agent, this MPC program would evaluate whether it is feasible (safe) to implement this action. If it is not safe to change lane, it will continue to use a lane-keeping trajectory as a reference. Our work in  shows how constraints in the state bounds are relaxed by sharing a sequence of future positions (this information is not required in behavior planning).
Iii-F Distributed Learning
In the centralized version of feedback deep Q-learning, all vehicles share one replay memory D and one action-value function . It means the training process needs support from a centralized cloud center. Also, the reward (4) also needs this cloud center to calculate traffic flow and driving comfort. It can be simulated offline, but it is hard to implement this algorithm in reality because of the dependency of the cloud center. When the number of vehicles increases, the communication cost will increase dramatically. Therefore, a distributed learning algorithm is considered, where each vehicle has its own action-value function and learn it independently. The reward function for distributed DRL is limited to the information from its -neighbors, defined as:
Definition 5 (Reward function).
The local reward function for distributed DRL is
where is the vehicle ’s acceleration at time , is its action at state and .
The reward function is a weighted sum of future velocity state and its own driving comfort. Passengers’ preference can determine this weight. A smaller is set for passengers who dislike frequent lane-changing; a larger is set for passengers who want to arrive at their destination faster. Because each vehicle conducts learning by itself, it can have different weight preferences.
The distributed version of the feedback deep Q-learning is shown in Alg. 2. Though it does not need the support from a cloud center, its reward does not directly contribute to system-level traffic flow and driving comfort, and its convergence is slower than the centralized version. There is one experiment in section IV comparing the centralized and distributed versions.
This section introduces the experiment results of three experiments:
Compare centralized learning with traditional control,
Compare distributed learning with centralized learning,
A case study for a freeway with road closures.
In these experiments, vehicles are randomly scattered on different lanes of a 1000-long freeway as their initial positions. All vehicles loop on this freeway. The total number of vehicles ranges from 100 to 900. For different traffic densities, experiments run for 4000 time steps. Traffic flow and driving comfort are evaluated based on the statistics in the last 1000 time steps. Also, the experiment runs 30 times under different initialization for each traffic density.
Iv-a First Comparison
This experiment compares the centralized version of feedback deep reinforcement learning in Alg. 1 and a traditional control method. The traditional control used here is a control synthesis algorithm put forward in . Each autonomous vehicle is modeled as a hybrid system. The basic idea for behavior planning is to judge whether the benefit of lane changing is significant enough or not based on a pre-tuned threshold.
As shown in Fig. 5, the RL agent gets both larger traffic flow and better driving comfort when traffic density is low. When grows, the result of the RL agent gets worse, but it is still comparable with traditional control. The two dashed lines in this figure show two bounds for behavior planning. They consider two extreme cases: always keep lane and always change lane. When is low, lane-changing can increase traffic flow significantly while sacrificing driving comfort. However, lane-changing will only downgrade passengers’ experience rather than achieve a higher speed when the road is saturated. Consequently, the best choice is to keep a lane when is high.
Note that is not included in the state space, which means autonomous vehicles do not know the current traffic density. Intuitively, the optimal policy should be different under different traffic densities. In fact, excluding traffic density is an intentional design for the state space of distributed learning because vehicles cannot verify the current traffic density by limited information from its -neighbors. Traffic density is excluded in Alg. 1 in order to keep the state space consistent between two algorithms. This experiment also shows the ability of the RL agent to adapt to different traffic densities by itself. When cannot be accurately identified by a single-vehicle, it is more reliable and stable for a traditional controller to use a fixed threshold for different . However, it cannot adopt a better result for different traffic densities, as shown in Fig. 5.
Iv-B Second Comparison
This experiment compares distributed and centralized versions of feedback deep reinforcement learning in Alg. 1 and Alg. 2. Intuitively, the centralized learning will have a better result, because its reward function (4) is directly related to the evaluation criteria: traffic flow and driving comfort. Nevertheless, distributed learning is easier to implement because it does not rely on any cloud center. Therefore, we wonder to what extent the result will deteriorate. The result is shown in Fig. 6. In most cases, the distributed learning has a slightly smaller traffic flow and slightly worse driving comfort. Though the difference becomes relatively larger sometimes, it is still acceptable.
Iv-C Case Study
This experiment shows a case study where there are two road closures on one lane. As shown in Fig. 7, the road closures may be caused by an accident, or they are under construction. These road closures both have a length of 200. We want to test whether the learning algorithm can work in a more complicated scenario. Autonomous vehicles are assumed to know the position of the road closures when it reaches the -neighbor range of these segments. They can share this information with its -neighbors (add as additional information to be shared in this case) or gather this information through V2I communication from infrastructure.
The state value of (1) for future velocity is defined to be 0 on the road closures. The result is shown in Fig. 8. When is low, lane-changing can get a larger traffic flow while sacrificing driving comfort. Lane-keeping becomes a better choice when is high. Because the road closures are relatively long in this case, the critical point is small, and lane-keeping is a better strategy for most traffic densities. Though the RL agent makes more lane-changing, it avoids massive acceleration to provide better driving comfort.
This paper focuses on the behavior planning challenge for CAVs, and designs a policy learning algorithm that considers feedback from controllers and safety requirements with shared information to enhance operational performance. We also show the benefit of V2V communication. The problem is modeled as an MDP problem. The solution, feedback deep reinforcement learning, utilizes -time-step future information shared among autonomous vehicles’ neighbors. The action of the proposed DRL algorithm receives feedback from a controller, which provides a safety guarantee. The centralized learning method can improve the overall system performance in terms of traffic flow and driving comfort, while the training process needs support from a cloud center. The distributed learning method is easier to implement online learning, but each vehicle only gets a local reward, and they do not share their historical transition experience. From experiment results, feedback DRL based behavior planning shows its advantages compared with a control synthesis algorithm. Distributed learning does not downgrade performance too much compared with centralized learning, which means it can be used in online learning. The case study with road closures shows the feedback DRL policy can be applied to a more complicated scenario. More experiments can be done to verify the scalability and robustness of the proposed state space design in the future.
-  (2017) Control barrier function based quadratic programs for safety critical systems. IEEE Transactions on Automatic Control 62 (8), pp. 3861–3876. Cited by: §II.
-  (2019) Model predictive contouring control for collision avoidance in unstructured dynamic environments. IEEE Robotics and Automation Letters 4 (4), pp. 4459–4466. Cited by: §II-B.
-  (2017) Scenario model predictive control for lane change assistance and autonomous driving on highways. IEEE Intelligent Transportation Systems Magazine 9 (3), pp. 23–35. Cited by: Kinematic Bicycle Model, §II-B, §II, §III-E.
-  (2018) Approximating explicit model predictive control using constrained neural networks. In 2018 Annual American control conference (ACC), pp. 1520–1527. Cited by: §I.
-  (2017) Obstacle avoidance for low-speed autonomous vehicles with barrier function. IEEE Transactions on Control Systems Technology 26 (1), pp. 194–206. Cited by: §II.
End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §I.
-  (2018) Fast trajectory planning for automated vehicles using gradient-based nonlinear model predictive control. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7369–7374. Cited by: §I.
-  (2018) Control of connected and automated vehicles: state of the art and future challenges. Annual Reviews in Control 45, pp. 18 – 40. External Links: Cited by: §I, §II.
-  (2019) Exploiting beneficial information sharing among autonomous vehicles. In IEEE 58th Conference on Decision and Control (CDC), Cited by: §III-E, §IV-A.
-  (2019-01-01) Model-predictive policy learning with uncertainty regularization for driving in dense traffic. In 7th International Conference on Learning Representations (ICLR), (English (US)). Cited by: §I.
-  (2018) Multirate lane-keeping system with kinematic vehicle model. IEEE Transactions on Vehicular Technology 67 (10), pp. 9211–9222. Cited by: §II-B.
-  (2011-07) Dedicated short-range communications (dsrc) standards in the united states. Proceedings of the IEEE 99 (7), pp. 1162–1182. External Links: Cited by: §I.
-  (2013) Reinforcement learning in robotics: a survey. The International Journal of Robotics Research 32 (11), pp. 1238–1274. Cited by: §I.
-  (2012-03) Development and evaluation of a cooperative vehicle intersection control algorithm under the connected vehicles environment. IEEE Transactions on Intelligent Transportation Systems 13 (1), pp. 81–90. External Links: Cited by: §I.
-  (2017) Optimal control-based online motion planning for cooperative lane changes of connected and automated vehicles. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3689–3694. Cited by: §I.
-  (2016-04) Heavy-duty vehicle platoon formation for fuel efficiency. IEEE Transactions on Intelligent Transportation Systems 17 (4), pp. 1051–1061. External Links: Cited by: §I.
-  (2015) Linear and nonlinear programming. Springer. Cited by: §III-E.
-  (2016-07) Correct-by-construction adaptive cruise control: two approaches. IEEE Transactions on Control Systems Technology 24 (4), pp. 1294–1307. External Links: Cited by: §I.
-  (2018) Autonomous vehicle navigation in rural environments without detailed prior maps. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2040–2047. Cited by: §I.
-  (2017) Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952. Cited by: §I.
-  (2014-04) Controller synthesis for string stability of vehicle platoons. IEEE Transactions on Intelligent Transportation Systems 15 (2), pp. 854–865. External Links: Cited by: §I.
-  (2017-05) A survey on the coordination of connected and automated vehicles at intersections and merging at highway on-ramps. IEEE Transactions on Intelligent Transportation Systems 18 (5), pp. 1066–1077. External Links: Cited by: §I.
-  (2018) Impact of partial penetrations of connected and automated vehicles on fuel consumption and traffic flow. IEEE Transactions on Intelligent Vehicles 3 (4), pp. 453–462. Cited by: §III-C1.
-  (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295. Cited by: §I, §I.
-  (2018) Decentralized and scalable privacy-preserving authentication scheme in vanets. IEEE Transactions on Vehicular Technology 67 (9), pp. 8647–8655. Cited by: §I, §II-A.
-  (2018) Multi-agent reinforcement learning via double averaging primal-dual optimization. In Advances in Neural Information Processing Systems, pp. 9649–9660. Cited by: §III-C.
-  (2019) Multi-agent reinforcement learning: a selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635. Cited by: §I.
-  (2016-05) Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In 2016 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 528–535. External Links: Cited by: §I.